1
|
Das Baksi K, Pokhrel V, Pudavar AE, Mande SS, Kuntal BK. BactInt: A domain driven transfer learning approach for extracting inter-bacterial associations from biomedical text. Comput Biol Chem 2024; 109:108012. [PMID: 38198963 DOI: 10.1016/j.compbiolchem.2023.108012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 12/15/2023] [Accepted: 12/30/2023] [Indexed: 01/12/2024]
Abstract
BACKGROUND The healthy as well as dysbiotic state of an ecosystem like human body is known to be influenced not only by the presence of the bacterial groups in it, but also with respect to the associations within themselves. Evidence reported in biomedical text serves as a reliable source for identifying and ascertaining such inter bacterial associations. However, the complexity of the reported text as well as the ever-increasing volume of information necessitates development of methods for automated and accurate extraction of such knowledge. METHODS A BioBERT (biomedical domain specific language model) based information extraction model for bacterial associations is presented that utilizes learning patterns from other publicly available datasets. Additionally, a specialized sentence corpus has been developed to significantly improve the prediction accuracy of the 'transfer learned' model using a fine-tuning approach. RESULTS The final model was seen to outperform all other variations (non-transfer learned and non-fine-tuned models) as well as models trained on BioGPT (a domain trained Generative Pre-trained Transformer). To further demonstrate the utility, a case study was performed using bacterial association network data obtained from experimental studies. CONCLUSION This study attempts to demonstrate the applicability of transfer learning in a niche field of life sciences where understanding of inter bacterial relationships is crucial to obtain meaningful insights in comprehending microbial community structures across different ecosystems. The study further discusses how such a model can be further improved by fine tuning using limited training data. The results presented and the datasets made available are expected to be a valuable addition in the field of medical informatics and bioinformatics.
Collapse
Affiliation(s)
| | - Vatsala Pokhrel
- TCS Research, Tata Consultancy Services Ltd, Pune 411057, India
| | | | | | - Bhusan K Kuntal
- TCS Research, Tata Consultancy Services Ltd, Pune 411057, India.
| |
Collapse
|
2
|
Tumilovich A, Yablokov E, Mezentsev Y, Ershov P, Basina V, Gnedenko O, Kaluzhskiy L, Tsybruk T, Grabovec I, Kisel M, Shabunya P, Soloveva N, Vavilov N, Gilep A, Ivanov A. The Multienzyme Complex Nature of Dehydroepiandrosterone Sulfate Biosynthesis. Int J Mol Sci 2024; 25:2072. [PMID: 38396748 PMCID: PMC10889563 DOI: 10.3390/ijms25042072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/16/2024] [Accepted: 01/26/2024] [Indexed: 02/25/2024] Open
Abstract
Dehydroepiandrosterone (DHEA), a precursor of steroid sex hormones, is synthesized by steroid 17-alpha-hydroxylase/17,20-lyase (CYP17A1) with the participation of microsomal cytochrome b5 (CYB5A) and cytochrome P450 reductase (CPR), followed by sulfation by two cytosolic sulfotransferases, SULT1E1 and SULT2A1, for storage and transport to tissues in which its synthesis is not available. The involvement of CYP17A1 and SULTs in these successive reactions led us to consider the possible interaction of SULTs with DHEA-producing CYP17A1 and its redox partners. Text mining analysis, protein-protein network analysis, and gene co-expression analysis were performed to determine the relationships between SULTs and microsomal CYP isoforms. For the first time, using surface plasmon resonance, we detected interactions between CYP17A1 and SULT2A1 or SULT1E1. SULTs also interacted with CYB5A and CPR. The interaction parameters of SULT2A1/CYP17A1 and SULT2A1/CYB5A complexes seemed to be modulated by 3'-phosphoadenosine-5'-phosphosulfate (PAPS). Affinity purification, combined with mass spectrometry (AP-MS), allowed us to identify a spectrum of SULT1E1 potential protein partners, including CYB5A. We showed that the enzymatic activity of SULTs increased in the presence of only CYP17A1 or CYP17A1 and CYB5A mixture. The structures of CYP17A1/SULT1E1 and CYB5A/SULT1E1 complexes were predicted. Our data provide novel fundamental information about the organization of microsomal CYP-dependent macromolecular complexes.
Collapse
Affiliation(s)
- Anastasiya Tumilovich
- Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
| | - Evgeniy Yablokov
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| | - Yuri Mezentsev
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| | - Pavel Ershov
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| | - Viktoriia Basina
- Research Centre for Medical Genetics, 1 Moskvorechye Street, 115522 Moscow, Russia;
| | - Oksana Gnedenko
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| | - Leonid Kaluzhskiy
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| | - Tatsiana Tsybruk
- Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
| | - Irina Grabovec
- Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
| | - Maryia Kisel
- Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
| | - Polina Shabunya
- Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
| | - Natalia Soloveva
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| | - Nikita Vavilov
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| | - Andrei Gilep
- Institute of Bioorganic Chemistry NASB, 5 Building 2, V.F. Kuprevich Street, 220141 Minsk, Belarus; (A.T.); (T.T.); (I.G.); (M.K.); (P.S.); (A.G.)
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| | - Alexis Ivanov
- Institute of Biomedical Chemistry, 10 Building 8, Pogodinskaya Street, 119121 Moscow, Russia; (E.Y.); (P.E.); (O.G.); (L.K.); (N.S.); (N.V.); (A.I.)
| |
Collapse
|
3
|
Jin Q, Leaman R, Lu Z. PubMed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine 2024; 100:104988. [PMID: 38306900 PMCID: PMC10850402 DOI: 10.1016/j.ebiom.2024.104988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2023] [Revised: 01/14/2024] [Accepted: 01/15/2024] [Indexed: 02/04/2024] Open
Abstract
Biomedical research yields vast information, much of which is only accessible through the literature. Consequently, literature search is crucial for healthcare and biomedicine. Recent improvements in artificial intelligence (AI) have expanded functionality beyond keywords, but they might be unfamiliar to clinicians and researchers. In response, we present an overview of over 30 literature search tools tailored to common biomedical use cases, aiming at helping readers efficiently fulfill their information needs. We first discuss recent improvements and continued challenges of the widely used PubMed. Then, we describe AI-based literature search tools catering to five specific information needs: 1. Evidence-based medicine. 2. Precision medicine and genomics. 3. Searching by meaning, including questions. 4. Finding related articles with literature recommendation. 5. Discovering hidden associations through literature mining. Finally, we discuss the impacts of recent developments of large language models such as ChatGPT on biomedical information seeking.
Collapse
Affiliation(s)
- Qiao Jin
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
| |
Collapse
|
4
|
Gayen S, Gupta D, F Loane R, Ide NC, Demner-Fushman D. Effects of Porting Essie Tokenization and Normalization to Solr. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2024; 2023:369-378. [PMID: 38222430 PMCID: PMC10785910] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
Search for information is now an integral part of healthcare. Searches are enabled by search engines whose objective is to efficiently retrieve the relevant information for the user query. When it comes to retrieving biomedical text and literature, Essie search engine developed at the National Library of Medicine (NLM) performs exceptionally well. However, Essie is a software system developed for NLM that has ceased development and support. On the other hand, Solr is a popular opensource enterprise search engine used by many of the world's largest internet sites, offering continuous developments and improvements along with the state-of-the-art features. In this paper, we present our approach to porting the key features of Essie and developing custom components to be used in Solr. We demonstrate the effectiveness of the added components on three benchmark biomedical datasets. The custom components may aid the community in improving search methods for biomedical text retrieval.
Collapse
Affiliation(s)
- Soumya Gayen
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, Bethesda, MD, USA
| | - Deepak Gupta
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, Bethesda, MD, USA
| | - Russell F Loane
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, Bethesda, MD, USA
| | - Nicholas C Ide
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, Bethesda, MD, USA
| | - Dina Demner-Fushman
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, Bethesda, MD, USA
| |
Collapse
|
5
|
Jin Q, Kim W, Chen Q, Comeau DC, Yeganova L, Wilbur WJ, Lu Z. MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics 2023; 39:btad651. [PMID: 37930897 PMCID: PMC10627406 DOI: 10.1093/bioinformatics/btad651] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Revised: 09/29/2023] [Indexed: 11/08/2023] Open
Abstract
MOTIVATION Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce MedCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot semantic IR in biomedicine. RESULTS To train MedCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely integrated retriever and re-ranker. Experimental results show that MedCPT sets new state-of-the-art performance on six biomedical IR tasks, outperforming various baselines including much larger models, such as GPT-3-sized cpt-text-XL. In addition, MedCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, MedCPT can be readily applied to various real-world biomedical IR tasks. AVAILABILITY AND IMPLEMENTATION The MedCPT code and model are available at https://github.com/ncbi/MedCPT.
Collapse
Affiliation(s)
- Qiao Jin
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States
| | - Won Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States
| | - Donald C Comeau
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States
| | - Lana Yeganova
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States
| | - W John Wilbur
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, United States
| |
Collapse
|
6
|
Ershov P, Yablokov E, Mezentsev Y, Ivanov A. Uncharacterized Proteins CxORFx: Subinteractome Analysis and Prognostic Significance in Cancers. Int J Mol Sci 2023; 24:10190. [PMID: 37373333 DOI: 10.3390/ijms241210190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 05/25/2023] [Accepted: 05/26/2023] [Indexed: 06/29/2023] Open
Abstract
Functions of about 10% of all the proteins and their associations with diseases are poorly annotated or not annotated at all. Among these proteins, there is a group of uncharacterized chromosome-specific open-reading frame genes (CxORFx) from the 'Tdark' category. The aim of the work was to reveal associations of CxORFx gene expression and ORF proteins' subinteractomes with cancer-driven cellular processes and molecular pathways. We performed systems biology and bioinformatic analysis of 219 differentially expressed CxORFx genes in cancers, an estimation of prognostic significance of novel transcriptomic signatures and analysis of subinteractome composition using several web servers (GEPIA2, KMplotter, ROC-plotter, TIMER, cBioPortal, DepMap, EnrichR, PepPSy, cProSite, WebGestalt, CancerGeneNet, PathwAX II and FunCoup). The subinteractome of each ORF protein was revealed using ten different data sources on physical protein-protein interactions (PPIs) to obtain representative datasets for the exploration of possible cellular functions of ORF proteins through a spectrum of neighboring annotated protein partners. A total of 42 out of 219 presumably cancer-associated ORF proteins and 30 cancer-dependent binary PPIs were found. Additionally, a bibliometric analysis of 204 publications allowed us to retrieve biomedical terms related to ORF genes. In spite of recent progress in functional studies of ORF genes, the current investigations aim at finding out the prognostic value of CxORFx expression patterns in cancers. The results obtained expand the understanding of the possible functions of the poorly annotated CxORFx in the cancer context.
Collapse
Affiliation(s)
- Pavel Ershov
- Institute of Biomedical Chemistry, Moscow 119121, Russia
| | | | - Yuri Mezentsev
- Institute of Biomedical Chemistry, Moscow 119121, Russia
| | - Alexis Ivanov
- Institute of Biomedical Chemistry, Moscow 119121, Russia
| |
Collapse
|
7
|
Hsiao TK, Torvik VI. OpCitance: Citation contexts identified from the PubMed Central open access articles. Sci Data 2023; 10:243. [PMID: 37117220 PMCID: PMC10139909 DOI: 10.1038/s41597-023-02134-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Accepted: 04/04/2023] [Indexed: 04/30/2023] Open
Abstract
OpCitance contains all the sentences from 2 million PubMed Central open-access (PMCOA) articles, with 137 million inline citations annotated (i.e., the "citation contexts"). Parsing out the references and citation contexts from the PMCOA XML files was non-trivial due to the diversity of referencing style. Only 0.5% citation contexts remain unidentified due to technical or human issues, e.g., references unmentioned by the authors in the text or improper XML nesting, which is more common among older articles (pre-2000). PubMed IDs (PMIDs) linked to inline citations in the XML files compared to citations harvested using the NCBI E-Utilities differed for 70.96% of the articles. Using an in-house citation matcher, called Patci, 6.84% of the referenced PMIDs were supplemented and corrected. OpCitance includes fewer total number of articles than the Semantic Scholar Open Research Corpus, but OpCitance has 160 thousand unique articles, a higher inline citation identification rate, and a more accurate reference mapping to PMIDs. We hope that OpCitance will facilitate citation context studies in particular and benefit text-mining research more broadly.
Collapse
Affiliation(s)
- Tzu-Kun Hsiao
- School of Information Sciences, University of Illinois at Urbana-Champaign, 501 E. Daniel Street, Champaign, IL, 61820, USA.
| | - Vetle I Torvik
- School of Information Sciences, University of Illinois at Urbana-Champaign, 501 E. Daniel Street, Champaign, IL, 61820, USA.
| |
Collapse
|
8
|
Liu Y, Elsworth BL, Gaunt TR. Using language models and ontology topology to perform semantic mapping of traits between biomedical datasets. Bioinformatics 2023; 39:btad169. [PMID: 37010521 PMCID: PMC10097433 DOI: 10.1093/bioinformatics/btad169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Revised: 02/12/2023] [Accepted: 03/19/2023] [Indexed: 04/04/2023] Open
Abstract
MOTIVATION Human traits are typically represented in both the biomedical literature and large population studies as descriptive text strings. Whilst a number of ontologies exist, none of these perfectly represent the entire human phenome and exposome. Mapping trait names across large datasets is therefore time-consuming and challenging. Recent developments in language modelling have created new methods for semantic representation of words and phrases, and these methods offer new opportunities to map human trait names in the form of words and short phrases, both to ontologies and to each other. Here, we present a comparison between a range of established and more recent language modelling approaches for the task of mapping trait names from UK Biobank to the Experimental Factor Ontology (EFO), and also explore how they compare to each other in direct trait-to-trait mapping. RESULTS In our analyses of 1191 traits from UK Biobank with manual EFO mappings, the BioSentVec model performed best at predicting these, matching 40.3% of the manual mappings correctly. The BlueBERT-EFO model (finetuned on EFO) performed nearly as well (38.8% of traits matching the manual mapping). In contrast, Levenshtein edit distance only mapped 22% of traits correctly. Pairwise mapping of traits to each other demonstrated that many of the models can accurately group similar traits based on their semantic similarity. AVAILABILITY AND IMPLEMENTATION Our code is available at https://github.com/MRCIEU/vectology.
Collapse
Affiliation(s)
- Yi Liu
- MRC Integrative Epidemiology Unit, Bristol Medical School, University of Bristol, Bristol, United Kingdom
| | | | - Tom R Gaunt
- MRC Integrative Epidemiology Unit, Bristol Medical School, University of Bristol, Bristol, United Kingdom
| |
Collapse
|
9
|
Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. A reproducible experimental survey on biomedical sentence similarity: A string-based method sets the state of the art. PLoS One 2022; 17:e0276539. [PMID: 36409715 PMCID: PMC9678326 DOI: 10.1371/journal.pone.0276539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Accepted: 10/08/2022] [Indexed: 11/22/2022] Open
Abstract
This registered report introduces the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity with the following aims: (1) to elucidate the state of the art of the problem; (2) to solve some reproducibility problems preventing the evaluation of most current methods; (3) to evaluate several unexplored sentence similarity methods; (4) to evaluate for the first time an unexplored benchmark, called Corpus-Transcriptional-Regulation (CTR); (5) to carry out a study on the impact of the pre-processing stages and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (6) to bridge the lack of software and data reproducibility resources for methods and experiments in this line of research. Our reproducible experimental survey is based on a single software platform, which is provided with a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results. In addition, we introduce a new aggregated string-based sentence similarity method, called LiBlock, together with eight variants of current ontology-based methods, and a new pre-trained word embedding model trained on the full-text articles in the PMC-BioC corpus. Our experiments show that our novel string-based measure establishes the new state of the art in sentence similarity analysis in the biomedical domain and significantly outperforms all the methods evaluated herein, with the only exception of one ontology-based method. Likewise, our experiments confirm that the pre-processing stages, and the choice of the NER tool for ontology-based methods, have a very significant impact on the performance of the sentence similarity methods. We also detail some drawbacks and limitations of current methods, and highlight the need to refine the current benchmarks. Finally, a notable finding is that our new string-based method significantly outperforms all state-of-the-art Machine Learning (ML) models evaluated herein.
Collapse
Affiliation(s)
- Alicia Lara-Clares
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
- * E-mail:
| | - Juan J. Lastra-Díaz
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| | - Ana Garcia-Serrano
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| |
Collapse
|
10
|
Kim W, Yeganova L, Comeau DC, Wilbur WJ, Lu Z. Towards a unified search: Improving PubMed retrieval with full text. J Biomed Inform 2022; 134:104211. [PMID: 36152950 PMCID: PMC9561061 DOI: 10.1016/j.jbi.2022.104211] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Revised: 09/12/2022] [Accepted: 09/15/2022] [Indexed: 10/14/2022]
Abstract
OBJECTIVE A significant number of recent articles in PubMed have full text available in PubMed Central®, and the availability of full texts has been consistently growing. However, it is not currently possible for a user to simultaneously query the contents of both databases and receive a single integrated search result. In this study, we investigate how to score full text articles given a multitoken query and how to combine those full text article scores with scores originating from abstracts and achieve an overall improved retrieval performance. MATERIALS AND METHODS For scoring full text articles, we propose a method to combine information coming from different sections by converting the traditionally used BM25 scores into log odds ratio scores which can be treated uniformly. We further propose a method that successfully combines scores from two heterogenous retrieval sources - full text articles and abstract only articles - by balancing the contributions of their respective scores through a probabilistic transformation. We use PubMed click data that consists of queries sampled from PubMed user logs along with a subset of retrieved and clicked documents to train the probabilistic functions and to evaluate retrieval effectiveness. RESULTS AND CONCLUSIONS Random ranking achieves 0.579 MAP score on our PubMed click data. BM25 ranking on PubMed abstracts improves the MAP by 10.6%. For full text documents, experiments confirm that BM25 section scores are of different value depending on the section type and are not directly comparable. Naïvely using the body text of articles along with abstract text degrades the overall quality of the search. The proposed log odds ratio scores normalize and combine the contributions of occurrences of query tokens in different sections. By including full text where available, we gain another 0.67%, or 7% relative improvement over abstract alone. We find an advantage in the more accurate estimate of the value of BM25 scores depending on the section from which they were produced. Taking the sum of top three section scores performs the best.
Collapse
Affiliation(s)
- Won Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Lana Yeganova
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - W John Wilbur
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA.
| |
Collapse
|
11
|
Xu Q, Liu Y, Hu J, Duan X, Song N, Zhou J, Zhai J, Su J, Liu S, Chen F, Zheng W, Guo Z, Li H, Zhou Q, Niu B. OncoPubMiner: a platform for mining oncology publications. Brief Bioinform 2022; 23:6691792. [PMID: 36058206 DOI: 10.1093/bib/bbac383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Revised: 08/08/2022] [Accepted: 08/09/2022] [Indexed: 11/12/2022] Open
Abstract
Updated and expert-quality knowledge bases are fundamental to biomedical research. A knowledge base established with human participation and subject to multiple inspections is needed to support clinical decision making, especially in the growing field of precision oncology. The number of original publications in this field has risen dramatically with the advances in technology and the evolution of in-depth research. Consequently, the issue of how to gather and mine these articles accurately and efficiently now requires close consideration. In this study, we present OncoPubMiner (https://oncopubminer.chosenmedinfo.com), a free and powerful system that combines text mining, data structure customisation, publication search with online reading and project-centred and team-based data collection to form a one-stop 'keyword in-knowledge out' oncology publication mining platform. The platform was constructed by integrating all open-access abstracts from PubMed and full-text articles from PubMed Central, and it is updated daily. OncoPubMiner makes obtaining precision oncology knowledge from scientific articles straightforward and will assist researchers in efficiently developing structured knowledge base systems and bring us closer to achieving precision oncology goals.
Collapse
Affiliation(s)
- Quan Xu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Yueyue Liu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,ChosenMed Gene Technology Co. Ltd., Nanjing, China
| | - Jifang Hu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100190, China
| | - Xiaohong Duan
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,ChosenMed Gene Technology Co. Ltd., Nanjing, China
| | - Niuben Song
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Jiale Zhou
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Jincheng Zhai
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Junyan Su
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Siyao Liu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Fan Chen
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,ChosenMed Gene Technology Co. Ltd., Nanjing, China
| | - Wei Zheng
- The Department of Nephrology and Hypertension Medicine, Beijing Electric Power Hospital, Beijing 100073, China
| | - Zhongjia Guo
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Hexiang Li
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China
| | - Qiming Zhou
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,ChosenMed Gene Technology Co. Ltd., Nanjing, China
| | - Beifang Niu
- ChosenMed Technology (Beijing) Company Limited, Jinghai Industrial Park, Economic and Technological Development Area, Beijing 100176, China.,Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China.,University of Chinese Academy of Sciences, Beijing 100190, China
| |
Collapse
|
12
|
Zhang L, Lu W, Chen H, Huang Y, Cheng Q. A comparative evaluation of biomedical similar article recommendation. J Biomed Inform 2022; 131:104106. [PMID: 35661818 DOI: 10.1016/j.jbi.2022.104106] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 05/27/2022] [Accepted: 05/28/2022] [Indexed: 11/28/2022]
Abstract
BACKGROUND Biomedical sciences, with their focus on human health and disease, have attracted unprecedented attention in the 21st century. The proliferation of biomedical sciences has also led to a large number of scientific articles being produced, which makes it difficult for biomedical researchers to find relevant articles and hinders the dissemination of valuable discoveries. To bridge this gap, the research community has initiated the article recommendation task, with the aim of recommending articles to biomedical researchers automatically based on their research interests. Over the past two decades, many recommendation methods have been developed. However, an algorithm-level comparison and rigorous evaluation of the most important methods on a shared dataset is still lacking. METHOD In this study, we first investigate 15 methods for automated article recommendation in the biomedical domain. We then conduct an empirical evaluation of the 15 methods, including six term-based methods, two word embedding methods, three sentence embedding methods, two document embedding methods, and two BERT-based methods. These methods are evaluated in two scenarios: article-oriented recommenders and user-oriented recommenders, with two publicly available datasets: TREC 2005 Genomics and RELISH, respectively. RESULTS Our experimental results show that the text representation models BERT and BioSenVec outperform many existing recommendation methods (e.g., BM25, PMRA, XPRC) and web-based recommendation systems (e.g., MScanner, MedlineRanker, BioReader) on both datasets regarding most of the evaluation metrics, and fine-tuning can improve the performance of the BERT-based methods. CONCLUSIONS Our comparison study is useful for researchers and practitioners in selecting the best modeling strategies for building article recommendation systems in the biomedical domain. The code and datasets are publicly available.
Collapse
Affiliation(s)
- Li Zhang
- School of Information Management, Wuhan University, Wuhan, 430074, Hubei Province, China.
| | - Wei Lu
- School of Information Management, Wuhan University, Wuhan, 430074, Hubei Province, China.
| | - Haihua Chen
- Department of Information Science, University of North Texas, Denton, 76203, Texas, USA.
| | - Yong Huang
- School of Information Management, Wuhan University, Wuhan, 430074, Hubei Province, China.
| | - Qikai Cheng
- School of Information Management, Wuhan University, Wuhan, 430074, Hubei Province, China.
| |
Collapse
|
13
|
Clinical language search algorithm from free-text: facilitating appropriate imaging. BMC Med Imaging 2022; 22:18. [PMID: 35120466 PMCID: PMC8815252 DOI: 10.1186/s12880-022-00740-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Accepted: 01/18/2022] [Indexed: 11/22/2022] Open
Abstract
Background The comprehensiveness and maintenance of the American College of Radiology (ACR) Appropriateness Criteria (AC) makes it a unique resource for evidence-based clinical imaging decision support, but it is underutilized by clinicians. To facilitate the use of imaging recommendations, we develop a natural language processing (NLP) search algorithm that automatically matches clinical indications that physicians write into imaging orders to appropriate AC imaging recommendations. Methods We apply a hybrid model of semantic similarity from a sent2vec model trained on 223 million scientific sentences, combined with term frequency inverse document frequency features. AC documents are ranked based on their embeddings’ cosine distance to query. For model testing, we compiled a dataset of simulated simple and complex indications for each AC document (n = 410) and another with clinical indications from randomly sampled radiology reports (n = 100). We compare our algorithm to a custom google search engine. Results On the simulated indications, our algorithm ranked ground truth documents as top 3 for 98% of simple queries and 85% of complex queries. Similarly, on the randomly sampled radiology report dataset, the algorithm ranked 86% of indications with a single match as top 3. Vague and distracting phrases present in the free-text indications were main sources of errors. Our algorithm provides more relevant results than a custom Google search engine, especially for complex queries. Conclusions We have developed and evaluated an NLP algorithm that matches clinical indications to appropriate AC guidelines. This approach can be integrated into imaging ordering systems for automated access to guidelines. Supplementary Information The online version contains supplementary material available at 10.1186/s12880-022-00740-6.
Collapse
|
14
|
Chen Q, Rankine A, Peng Y, Aghaarabi E, Lu Z. Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study. JMIR Med Inform 2021; 9:e27386. [PMID: 34967748 PMCID: PMC8759018 DOI: 10.2196/27386] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Revised: 08/06/2021] [Accepted: 08/06/2021] [Indexed: 01/23/2023] Open
Abstract
Background Semantic textual similarity (STS) measures the degree of relatedness between sentence pairs. The Open Health Natural Language Processing (OHNLP) Consortium released an expertly annotated STS data set and called for the National Natural Language Processing Clinical Challenges. This work describes our entry, an ensemble model that leverages a range of deep learning (DL) models. Our team from the National Library of Medicine obtained a Pearson correlation of 0.8967 in an official test set during 2019 National Natural Language Processing Clinical Challenges/Open Health Natural Language Processing shared task and achieved a second rank. Objective Although our models strongly correlate with manual annotations, annotator-level correlation was only moderate (weighted Cohen κ=0.60). We are cautious of the potential use of DL models in production systems and argue that it is more critical to evaluate the models in-depth, especially those with extremely high correlations. In this study, we benchmark the effectiveness and efficiency of top-ranked DL models. We quantify their robustness and inference times to validate their usefulness in real-time applications. Methods We benchmarked five DL models, which are the top-ranked systems for STS tasks: Convolutional Neural Network, BioSentVec, BioBERT, BlueBERT, and ClinicalBERT. We evaluated a random forest model as an additional baseline. For each model, we repeated the experiment 10 times, using the official training and testing sets. We reported 95% CI of the Wilcoxon rank-sum test on the average Pearson correlation (official evaluation metric) and running time. We further evaluated Spearman correlation, R², and mean squared error as additional measures. Results Using only the official training set, all models obtained highly effective results. BioSentVec and BioBERT achieved the highest average Pearson correlations (0.8497 and 0.8481, respectively). BioSentVec also had the highest results in 3 of 4 effectiveness measures, followed by BioBERT. However, their robustness to sentence pairs of different similarity levels varies significantly. A particular observation is that BERT models made the most errors (a mean squared error of over 2.5) on highly similar sentence pairs. They cannot capture highly similar sentence pairs effectively when they have different negation terms or word orders. In addition, time efficiency is dramatically different from the effectiveness results. On average, the BERT models were approximately 20 times and 50 times slower than the Convolutional Neural Network and BioSentVec models, respectively. This results in challenges for real-time applications. Conclusions Despite the excitement of further improving Pearson correlations in this data set, our results highlight that evaluations of the effectiveness and efficiency of STS models are critical. In future, we suggest more evaluations on the generalization capability and user-level testing of the models. We call for community efforts to create more biomedical and clinical STS data sets from different perspectives to reflect the multifaceted notion of sentence-relatedness.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States
| | - Alex Rankine
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States.,Harvard College, Cambridge, MA, United States
| | - Yifan Peng
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States.,Weill Cornell Medicine, New York, NY, United States
| | - Elaheh Aghaarabi
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States.,Towson University, Towson, MD, United States
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, United States
| |
Collapse
|
15
|
Barupal DK, Schubauer-Berigan MK, Korenjak M, Zavadil J, Guyton KZ. Prioritizing cancer hazard assessments for IARC Monographs using an integrated approach of database fusion and text mining. ENVIRONMENT INTERNATIONAL 2021; 156:106624. [PMID: 33984576 PMCID: PMC8380673 DOI: 10.1016/j.envint.2021.106624] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/12/2021] [Revised: 03/22/2021] [Accepted: 04/30/2021] [Indexed: 05/14/2023]
Abstract
BACKGROUND Systematic evaluation of literature data on the cancer hazards of human exposures is an essential process underlying cancer prevention strategies. The scope and volume of evidence for suspected carcinogens can range from very few to thousands of publications, requiring a complex, systematically planned, and critical procedure to nominate, prioritize and evaluate carcinogenic agents. To aid in this process, database fusion, cheminformatics and text mining techniques can be combined into an integrated approach to inform agent prioritization, selection, and grouping. RESULTS We have applied these techniques to agents recommended for the IARC Monographs evaluations during 2020-2024. An integration of PubMed filters to cover cancer epidemiology, key characteristics of carcinogens, chemical lists from 34 databases relevant for cancer research, chemical structure grouping and a literature data-based clustering was applied in an innovative approach to 119 agents recommended by an advisory group for future IARC Monographs evaluations. The approach also facilitated a rational grouping of these agents and aids in understanding the volume and complexity of relevant information, as well as important gaps in coverage of the available studies on cancer etiology and carcinogenesis. CONCLUSION A new data-science approach has been applied to diverse agents recommended for cancer hazard assessments, and its applications for the IARC Monographs are demonstrated. The prioritization approach has been made available at www.cancer.idsl.me site for ranking cancer agents.
Collapse
Affiliation(s)
- Dinesh Kumar Barupal
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mt Sinai, NY, USA.
| | - Mary K Schubauer-Berigan
- Evidence Synthesis and Classification Branch, International Agency for Research on Cancer, Lyon, France
| | - Michael Korenjak
- Epigenomics and Mechanisms Branch, International Agency for Research on Cancer, Lyon, France
| | - Jiri Zavadil
- Epigenomics and Mechanisms Branch, International Agency for Research on Cancer, Lyon, France
| | - Kathryn Z Guyton
- Evidence Synthesis and Classification Branch, International Agency for Research on Cancer, Lyon, France
| |
Collapse
|
16
|
Ershov P, Kaluzhskiy L, Mezentsev Y, Yablokov E, Gnedenko O, Ivanov A. Enzymes in the Cholesterol Synthesis Pathway: Interactomics in the Cancer Context. Biomedicines 2021; 9:biomedicines9080895. [PMID: 34440098 PMCID: PMC8389681 DOI: 10.3390/biomedicines9080895] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 07/20/2021] [Accepted: 07/22/2021] [Indexed: 02/06/2023] Open
Abstract
A global protein interactome ensures the maintenance of regulatory, signaling and structural processes in cells, but at the same time, aberrations in the repertoire of protein-protein interactions usually cause a disease onset. Many metabolic enzymes catalyze multistage transformation of cholesterol precursors in the cholesterol biosynthesis pathway. Cancer-associated deregulation of these enzymes through various molecular mechanisms results in pathological cholesterol accumulation (its precursors) which can be disease risk factors. This work is aimed at systematization and bioinformatic analysis of the available interactomics data on seventeen enzymes in the cholesterol pathway, encoded by HMGCR, MVK, PMVK, MVD, FDPS, FDFT1, SQLE, LSS, DHCR24, CYP51A1, TM7SF2, MSMO1, NSDHL, HSD17B7, EBP, SC5D, DHCR7 genes. The spectrum of 165 unique and 21 common protein partners that physically interact with target enzymes was selected from several interatomic resources. Among them there were 47 modifying proteins from different protein kinases/phosphatases and ubiquitin-protein ligases/deubiquitinases families. A literature search, enrichment and gene co-expression analysis showed that about a quarter of the identified protein partners was associated with cancer hallmarks and over-represented in cancer pathways. Our results allow to update the current fundamental view on protein-protein interactions and regulatory aspects of the cholesterol synthesis enzymes and annotate of their sub-interactomes in term of possible involvement in cancers that will contribute to prioritization of protein targets for future drug development.
Collapse
|
17
|
Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. Protocol for a reproducible experimental survey on biomedical sentence similarity. PLoS One 2021; 16:e0248663. [PMID: 33760855 PMCID: PMC7990182 DOI: 10.1371/journal.pone.0248663] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 03/02/2021] [Indexed: 11/28/2022] Open
Abstract
Measuring semantic similarity between sentences is a significant task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining. For this reason, the proposal of sentence similarity methods for the biomedical domain has attracted a lot of attention in recent years. However, most sentence similarity methods and experimental results reported in the biomedical domain cannot be reproduced for multiple reasons as follows: the copying of previous results without confirmation, the lack of source code and data to replicate both methods and experiments, and the lack of a detailed definition of the experimental setup, among others. As a consequence of this reproducibility gap, the state of the problem can be neither elucidated nor new lines of research be soundly set. On the other hand, there are other significant gaps in the literature on biomedical sentence similarity as follows: (1) the evaluation of several unexplored sentence similarity methods which deserve to be studied; (2) the evaluation of an unexplored benchmark on biomedical sentence similarity, called Corpus-Transcriptional-Regulation (CTR); (3) a study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (4) the lack of software and data resources for the reproducibility of methods and experiments in this line of research. Identified these open problems, this registered report introduces a detailed experimental setup, together with a categorization of the literature, to develop the largest, updated, and for the first time, reproducible experimental survey on biomedical sentence similarity. Our aforementioned experimental survey will be based on our own software replication and the evaluation of all methods being studied on the same software platform, which will be specially developed for this work, and it will become the first publicly available software library for biomedical sentence similarity. Finally, we will provide a very detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.
Collapse
Affiliation(s)
- Alicia Lara-Clares
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| | - Juan J. Lastra-Díaz
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| | - Ana Garcia-Serrano
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| |
Collapse
|
18
|
Caufield JH, Sigdel D, Fu J, Choi H, Guevara-Gonzalez V, Wang D, Ping P. Cardiovascular Informatics: building a bridge to data harmony. Cardiovasc Res 2021; 118:732-745. [PMID: 33751044 DOI: 10.1093/cvr/cvab067] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/01/2020] [Accepted: 03/03/2021] [Indexed: 12/11/2022] Open
Abstract
The search for new strategies for better understanding cardiovascular disease is a constant one, spanning multitudinous types of observations and studies. A comprehensive characterization of each disease state and its biomolecular underpinnings relies upon insights gleaned from extensive information collection of various types of data. Researchers and clinicians in cardiovascular biomedicine repeatedly face questions regarding which types of data may best answer their questions, how to integrate information from multiple datasets of various types, and how to adapt emerging advances in machine learning and/or artificial intelligence to their needs in data processing. Frequently lauded as a field with great practical and translational potential, the interface between biomedical informatics and cardiovascular medicine is challenged with staggeringly massive datasets. Successful application of computational approaches to decode these complex and gigantic amounts of information becomes an essential step toward realizing the desired benefits. In this review, we examine recent efforts to adapt informatics strategies to cardiovascular biomedical research: automated information extraction and unification of multifaceted -omics data. We discuss how and why this interdisciplinary space of Cardiovascular Informatics is particularly relevant to and supportive of current experimental and clinical research. We describe in detail how open data sources and methods can drive discovery while demanding few initial resources, an advantage afforded by widespread availability of cloud computing-driven platforms. Subsequently, we provide examples of how interoperable computational systems facilitate exploration of data from multiple sources, including both consistently-formatted structured data and unstructured data. Taken together, these approaches for achieving data harmony enable molecular phenotyping of cardiovascular (CV) diseases and unification of cardiovascular knowledge.
Collapse
Affiliation(s)
- J Harry Caufield
- NHLBI Integrated Cardiovascular Data Science Training Program at University of California, Los Angeles (UCLA), Los Angeles, CA, 90095, USA.,Departments of Physiology at UCLA School of Medicine, Los Angeles, CA, 90095, USA
| | - Dibakar Sigdel
- NHLBI Integrated Cardiovascular Data Science Training Program at University of California, Los Angeles (UCLA), Los Angeles, CA, 90095, USA.,Departments of Physiology at UCLA School of Medicine, Los Angeles, CA, 90095, USA
| | - John Fu
- NHLBI Integrated Cardiovascular Data Science Training Program at University of California, Los Angeles (UCLA), Los Angeles, CA, 90095, USA
| | - Howard Choi
- NHLBI Integrated Cardiovascular Data Science Training Program at University of California, Los Angeles (UCLA), Los Angeles, CA, 90095, USA
| | - Vladimir Guevara-Gonzalez
- NHLBI Integrated Cardiovascular Data Science Training Program at University of California, Los Angeles (UCLA), Los Angeles, CA, 90095, USA
| | - Ding Wang
- Departments of Physiology at UCLA School of Medicine, Los Angeles, CA, 90095, USA
| | - Peipei Ping
- NHLBI Integrated Cardiovascular Data Science Training Program at University of California, Los Angeles (UCLA), Los Angeles, CA, 90095, USA.,Departments of Physiology at UCLA School of Medicine, Los Angeles, CA, 90095, USA.,Department of Medicine (Cardiology) at UCLA School of Medicine, Los Angeles, CA, 90095, USA.,Bioinformatics and Medical Informatics, Los Angeles, CA, 90095, USA.,Scalable Analytics Institute (ScAi) at UCLA School of Engineering, Los Angeles, CA, 90095, USA
| |
Collapse
|
19
|
Chen Q, Allot A, Lu Z. LitCovid: an open database of COVID-19 literature. Nucleic Acids Res 2021; 49:D1534-D1540. [PMID: 33166392 PMCID: PMC7778958 DOI: 10.1093/nar/gkaa952] [Citation(s) in RCA: 130] [Impact Index Per Article: 43.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2020] [Revised: 10/02/2020] [Accepted: 10/08/2020] [Indexed: 12/22/2022] Open
Abstract
Since the outbreak of the current pandemic in 2020, there has been a rapid growth of published articles on COVID-19 and SARS-CoV-2, with about 10 000 new articles added each month. This is causing an increasingly serious information overload, making it difficult for scientists, healthcare professionals and the general public to remain up to date on the latest SARS-CoV-2 and COVID-19 research. Hence, we developed LitCovid (https://www.ncbi.nlm.nih.gov/research/coronavirus/), a curated literature hub, to track up-to-date scientific information in PubMed. LitCovid is updated daily with newly identified relevant articles organized into curated categories. To support manual curation, advanced machine-learning and deep-learning algorithms have been developed, evaluated and integrated into the curation workflow. To the best of our knowledge, LitCovid is the first-of-its-kind COVID-19-specific literature resource, with all of its collected articles and curated data freely available. Since its release, LitCovid has been widely used, with millions of accesses by users worldwide for various information needs, such as evidence synthesis, drug discovery and text and data mining, among others.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20892, USA
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20892, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20892, USA
| |
Collapse
|
20
|
Dinakar B, Boguslav MR, Görg C, Dinakarpandian D. Semantic Changepoint Detection for Finding Potentially Novel Research Publications. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2021; 26:107-118. [PMID: 33691009 PMCID: PMC8352552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
How has the focus of research papers on a given disease changed over time? Identifying the papers at the cusps of change can help highlight the emergence of a new topic or a change in the direction of research. We present a generally applicable unsupervised approach to this question based on semantic changepoints within a given collection of research papers. We illustrate the approach by a range of examples based on a nascent corpus of literature on COVID-19 as well as subsets of papers from PubMed on the World Health Organization list of neglected tropical diseases. The software is freely available at: https://github.com/pdddinakar/SemanticChangepointDetection.
Collapse
Affiliation(s)
- Bhavish Dinakar
- Department of Chemical and Biomolecular Engineering, University of California, Berkeley, Berkeley, CA 94720, USA,
| | | | | | | |
Collapse
|
21
|
Leaman R, Wei CH, Allot A, Lu Z. Ten tips for a text-mining-ready article: How to improve automated discoverability and interpretability. PLoS Biol 2020; 18:e3000716. [PMID: 32479517 PMCID: PMC7289435 DOI: 10.1371/journal.pbio.3000716] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 06/11/2020] [Indexed: 12/22/2022] Open
Abstract
Data-driven research in biomedical science requires structured, computable data. Increasingly, these data are created with support from automated text mining. Text-mining tools have rapidly matured: although not perfect, they now frequently provide outstanding results. We describe 10 straightforward writing tips—and a web tool, PubReCheck—guiding authors to help address the most common cases that remain difficult for text-mining tools. We anticipate these guides will help authors’ work be found more readily and used more widely, ultimately increasing the impact of their work and the overall benefit to both authors and readers. PubReCheck is available at http://www.ncbi.nlm.nih.gov/research/pubrecheck. Your published research is already being processed with automated tools, and text mining will become more common; this Community Page article describes how you can help these tools process your work more accurately, including a web tool, PubReCheck.
Collapse
Affiliation(s)
- Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
- * E-mail:
| |
Collapse
|
22
|
Chen Q, Du J, Kim S, Wilbur WJ, Lu Z. Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records. BMC Med Inform Decis Mak 2020; 20:73. [PMID: 32349758 PMCID: PMC7191680 DOI: 10.1186/s12911-020-1044-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Background Capturing sentence semantics plays a vital role in a range of text mining applications. Despite continuous efforts on the development of related datasets and models in the general domain, both datasets and models are limited in biomedical and clinical domains. The BioCreative/OHNLP2018 organizers have made the first attempt to annotate 1068 sentence pairs from clinical notes and have called for a community effort to tackle the Semantic Textual Similarity (BioCreative/OHNLP STS) challenge. Methods We developed models using traditional machine learning and deep learning approaches. For the post challenge, we focused on two models: the Random Forest and the Encoder Network. We applied sentence embeddings pre-trained on PubMed abstracts and MIMIC-III clinical notes and updated the Random Forest and the Encoder Network accordingly. Results The official results demonstrated our best submission was the ensemble of eight models. It achieved a Person correlation coefficient of 0.8328 – the highest performance among 13 submissions from 4 teams. For the post challenge, the performance of both Random Forest and the Encoder Network was improved; in particular, the correlation of the Encoder Network was improved by ~ 13%. During the challenge task, no end-to-end deep learning models had better performance than machine learning models that take manually-crafted features. In contrast, with the sentence embeddings pre-trained on biomedical corpora, the Encoder Network now achieves a correlation of ~ 0.84, which is higher than the original best model. The ensembled model taking the improved versions of the Random Forest and Encoder Network as inputs further increased performance to 0.8528. Conclusions Deep learning models with sentence embeddings pre-trained on biomedical corpora achieve the highest performance on the test set. Through error analysis, we find that end-to-end deep learning models and traditional machine learning models with manually-crafted features complement each other by finding different types of sentences. We suggest a combination of these models can better find similar sentences in practice.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - Jingcheng Du
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.,School of Biomedical Informatics, UTHealth, Houston, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.
| |
Collapse
|
23
|
Chen Q, Lee K, Yan S, Kim S, Wei CH, Lu Z. BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale. PLoS Comput Biol 2020; 16:e1007617. [PMID: 32324731 PMCID: PMC7237030 DOI: 10.1371/journal.pcbi.1007617] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Revised: 05/19/2020] [Accepted: 12/19/2019] [Indexed: 12/14/2022] Open
Abstract
A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings—which involve the learning of vector representations of concepts using machine learning models—have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec. Capturing the semantics of related biological concepts, such as genes and mutations, is of significant importance to many research tasks in computational biology such as protein-protein interaction detection, gene-drug association prediction, and biomedical literature-based discovery. Here, we propose to leverage state-of-the-art text mining tools and machine learning models to learn the semantics via vector representations (aka. embeddings) of over 400,000 biological concepts mentioned in the entire PubMed abstracts. Our learned embeddings, namely BioConceptVec, can capture related concepts based on their surrounding contextual information in the literature, which is beyond exact term match or co-occurrence-based methods. BioConceptVec has been thoroughly evaluated in multiple bioinformatics tasks consisting of over 25 million instances from nine different biological datasets. The evaluation results demonstrate that BioConceptVec has better performance than existing methods in all tasks. Finally, BioConceptVec is made freely available to the research community and general public.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Kyubum Lee
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Shankai Yan
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Sun Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| |
Collapse
|