1
|
Yeganova L, Kim W, Tian S, Comeau DC, Wilbur WJ, Lu Z. LitSense 2.0: AI-powered biomedical information retrieval with sentence and passage level knowledge discovery. Nucleic Acids Res 2025:gkaf417. [PMID: 40377097 DOI: 10.1093/nar/gkaf417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2025] [Revised: 04/24/2025] [Accepted: 05/02/2025] [Indexed: 05/18/2025] Open
Abstract
LitSense 2.0 (https://www.ncbi.nlm.nih.gov/research/litsense2/) is an advanced biomedical search system enhanced with dense vector semantic retrieval, designed for accessing literature on sentence and paragraph levels. It provides unified access to 38 million PubMed abstracts and 6.6 million full-length articles in the PubMed Central (PMC) Open Access subset, encompassing 1.4 billion sentences and ∼300 million paragraphs, and is updated weekly. Compared to PubMed and PMC, the primary platforms for biomedical information search, LitSense offers cross-platform functionality by searching seamlessly across both PubMed and PMC and returning relevant results at a more granular level. Building on the success of the original LitSense launched in 2018, LitSense 2.0 introduces two major enhancements. The first is the addition of paragraph-level search: users can now choose to search either against sentences or against paragraphs. The second is improved retrieval accuracy via a state-of-the-art biomedical text encoder, ensuring more reliable identification of relevant results across the entire biomedical literature.
Collapse
Affiliation(s)
- Lana Yeganova
- Division of Intramural Research (DIR), National Library of Medicine (NLM), National Institutes of Health (NIH), MD 20894 Bethesda, United States
| | - Won Kim
- Division of Intramural Research (DIR), National Library of Medicine (NLM), National Institutes of Health (NIH), MD 20894 Bethesda, United States
| | - Shubo Tian
- Division of Intramural Research (DIR), National Library of Medicine (NLM), National Institutes of Health (NIH), MD 20894 Bethesda, United States
| | - Donald C Comeau
- Division of Intramural Research (DIR), National Library of Medicine (NLM), National Institutes of Health (NIH), MD 20894 Bethesda, United States
| | - W John Wilbur
- Division of Intramural Research (DIR), National Library of Medicine (NLM), National Institutes of Health (NIH), MD 20894 Bethesda, United States
| | - Zhiyong Lu
- Division of Intramural Research (DIR), National Library of Medicine (NLM), National Institutes of Health (NIH), MD 20894 Bethesda, United States
| |
Collapse
|
2
|
Puy A, Bacon E, Carmona A, Flinders S, Gefen D, Khanjani M, Larsen KR, Lachi A, Linga SN, Lo Piano S, Melsen LA, Murray E, Sheikholeslami R, Sobhani A, Wei N, Saltelli A. Socio-environmental modeling shows physics-like confidence with water modeling surpassing it in numerical claims. iScience 2025; 28:112184. [PMID: 40224017 PMCID: PMC11986976 DOI: 10.1016/j.isci.2025.112184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2024] [Revised: 02/01/2025] [Accepted: 03/05/2025] [Indexed: 04/15/2025] Open
Abstract
Several modern scientific fields rely on computationally intensive mathematical models to study uncertain, complex socio-environmental phenomena such as the spread of a virus, climate change, or the water cycle. However, the degree of epistemic commitment of these fields is unclear. By using machine learning to extract the knowledge claims of around 755,000 abstracts from 14 scientific fields spanning the human and physical sciences, we show that epidemic, integrated assessment, and water modeling display a degree of linguistic assertiveness akin to physics. Water modeling surpasses even the most accurate physical sciences in substantiating knowledge claims with numbers, which are largely produced without accompanying uncertainty and sensitivity analysis. By exploring the balance between doubt and certainty in academic writing, our study reflects on whether the strong conviction and quantification of fields modeling socio-environmental processes, especially water modeling, are epistemically justified.
Collapse
Affiliation(s)
- Arnald Puy
- School of Geography, Earth and Environmental Sciences, University of Birmingham, Birmingham B15 2TT, UK
| | - Ethan Bacon
- School of Geography, Earth and Environmental Sciences, University of Birmingham, Birmingham B15 2TT, UK
| | - Alba Carmona
- Department of Modern Languages, College of Arts and Law, University of Birmingham, Birmingham B15 2TT, UK
- School of Languages, Cultures and Societies, Faculty of Arts, Humanities and Cultures, University of Leeds, Leeds LS2 9JT, UK
| | - Samuel Flinders
- School of Geography, Earth and Environmental Sciences, University of Birmingham, Birmingham B15 2TT, UK
| | - David Gefen
- LeBow College of Business, Drexel University, Philadelphia, PA 19104, USA
| | - Mohammad Khanjani
- Department of Civil Engineering, Sharif University of Technology, Azadi Avenue, Tehran 11155-4313, Iran
| | - Kai R. Larsen
- Organizational Leadership and Information Analytics, Leeds School of Business, University of Colorado Boulder, Boulder, CO, USA
| | - Alessio Lachi
- Saint Camillus International University of Health and Medical Sciences (UniCamillus), Via Sant’Alessandro 8, 00131 Rome, Italy
| | - Seth N. Linga
- School of Geography, Earth and Environmental Sciences, University of Birmingham, Birmingham B15 2TT, UK
| | - Samuele Lo Piano
- University of Reading, School of the Built Environment, JJ Thompson Building, Whiteknights Campus, Reading RG6 6AF, UK
| | - Lieke A. Melsen
- Hydrology and Environmental Hydraulics Group, Wageningen University, P.O. Box 9101, 6700 HB Wageningen, the Netherlands
| | - Emily Murray
- School of Geography, Earth and Environmental Sciences, University of Birmingham, Birmingham B15 2TT, UK
| | - Razi Sheikholeslami
- Department of Civil Engineering, Sharif University of Technology, Azadi Avenue, Tehran 11155-4313, Iran
| | - Ariana Sobhani
- School of Biosciences, University of Birmingham, Birmingham B15 2TT, UK
| | - Nanxin Wei
- School of Geography, Earth and Environmental Sciences, University of Birmingham, Birmingham B15 2TT, UK
| | - Andrea Saltelli
- Barcelona School of Management, Pompeu Fabra University, Carrer de Balmes 132, 08008 Barcelona, Spain
- Centre for the Study of the Sciences and the Humanities, University of Bergen, Parkveien 9, PB 7805, 5020 Bergen, Norway
| |
Collapse
|
3
|
Hair K, Arroyo-Araujo M, Vojvodic S, Economou M, Wong C, Tinsdeall F, Smith S, Rackoll T, Sena ES, McCann SK. Connecting the dots in neuroscience research: The future of evidence synthesis. Exp Neurol 2025; 384:115047. [PMID: 39510296 DOI: 10.1016/j.expneurol.2024.115047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2024] [Revised: 10/31/2024] [Accepted: 11/03/2024] [Indexed: 11/15/2024]
Abstract
Making progress in neuroscience research involves learning from existing data. In this perspective piece, we explore the potential of a data-driven evidence ecosystem to connect all primary data streams, and synthesis efforts to inform evidence-based research and translational success from bench to bedside. To enable this transformation, we set out how we can produce evidence designed with evidence curation in mind. All data should be findable, understandable, and easily synthesisable, using a combination of human and machine effort. This will require shifts in research culture and tailored infrastructure to support rapid dissemination, data sharing, and transparency. We also discuss improvements in the way we can synthesise evidence to better inform primary research, including the potential of emerging technologies, big-data approaches, and breaking down research silos. Through a case study in stroke research, one of the most well-established areas for synthesis efforts, we demonstrate the progress in implementing elements of this ecosystem, with an emphasis on the need for coordinated efforts between laboratory researchers and synthesists.
Collapse
Affiliation(s)
- Kaitlyn Hair
- Centre for Clinical Brain Sciences, University of Edinburgh, Chancellor's Building, 49 Little France Crescent, Edinburgh EH16 4SB, United Kingdom.
| | - María Arroyo-Araujo
- Berlin Institute of Health at Charité-Universitätsmedizin Berlin, QUEST Center, Charitéplatz 1, 10117 Berlin, Germany.
| | - Sofija Vojvodic
- Berlin Institute of Health at Charité-Universitätsmedizin Berlin, QUEST Center, Charitéplatz 1, 10117 Berlin, Germany.
| | - Maria Economou
- Berlin Institute of Health at Charité-Universitätsmedizin Berlin, QUEST Center, Charitéplatz 1, 10117 Berlin, Germany.
| | - Charis Wong
- Centre for Clinical Brain Sciences, University of Edinburgh, Chancellor's Building, 49 Little France Crescent, Edinburgh EH16 4SB, United Kingdom; Euan MacDonald Centre for Motor Neuron Disease Research, Chancellor's Building, 49 Little France Crescent, Edinburgh EH16 4SB, United Kingdom; Anne Rowling Regenerative Neurology Clinic, 49 Little France Crescent, Edinburgh EH16 4SB, United Kingdom; MRC Clinical Trials Unit, 90 High Holborn, London WC1V 6LJ, United Kingdom.
| | - Francesca Tinsdeall
- Centre for Clinical Brain Sciences, University of Edinburgh, Chancellor's Building, 49 Little France Crescent, Edinburgh EH16 4SB, United Kingdom.
| | - Sean Smith
- Centre for Clinical Brain Sciences, University of Edinburgh, Chancellor's Building, 49 Little France Crescent, Edinburgh EH16 4SB, United Kingdom.
| | - Torsten Rackoll
- Berlin Institute of Health at Charité-Universitätsmedizin Berlin, QUEST Center, Charitéplatz 1, 10117 Berlin, Germany.
| | - Emily S Sena
- Centre for Clinical Brain Sciences, University of Edinburgh, Chancellor's Building, 49 Little France Crescent, Edinburgh EH16 4SB, United Kingdom.
| | - Sarah K McCann
- Berlin Institute of Health at Charité-Universitätsmedizin Berlin, QUEST Center, Charitéplatz 1, 10117 Berlin, Germany.
| |
Collapse
|
4
|
Lai PT, Coudert E, Aimo L, Axelsen K, Breuza L, de Castro E, Feuermann M, Morgat A, Pourcel L, Pedruzzi I, Poux S, Redaschi N, Rivoire C, Sveshnikova A, Wei CH, Leaman R, Luo L, Lu Z, Bridge A. EnzChemRED, a rich enzyme chemistry relation extraction dataset. Sci Data 2024; 11:982. [PMID: 39251610 PMCID: PMC11384730 DOI: 10.1038/s41597-024-03835-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Accepted: 08/23/2024] [Indexed: 09/11/2024] Open
Abstract
Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts where enzymes and the chemical reactions they catalyze are annotated using identifiers from the protein knowledgebase UniProtKB and the chemical ontology ChEBI. We show that fine-tuning language models with EnzChemRED significantly boosts their ability to identify proteins and chemicals in text (86.30% F1 score) and to extract the chemical conversions (86.66% F1 score) and the enzymes that catalyze those conversions (83.79% F1 score). We apply our methods to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea.
Collapse
Grants
- U24 HG007822 NHGRI NIH HHS
- U41 HG007822 NHGRI NIH HHS
- NIH Intramural Research Program, National Library of Medicine
- Expert curation and evaluation of EnzChemRED at Swiss-Prot were supported by the Swiss Federal Government through the State Secretariat for Education, Research and Innovation (SERI) and the National Human Genome Research Institute (NHGRI), Office of Director [OD/DPCPSI/ODSS], National Institute of Allergy and Infectious Diseases (NIAID), National Institute on Aging (NIA), National Institute of General Medical Sciences (NIGMS), National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Eye Institute (NEI), National Cancer Institute (NCI), National Heart, Lung, and Blood Institute (NHLBI) of the National Institutes of Health [U24HG007822], and by the European Union's Horizon Europe Framework Programme (grant number 101080997), supported in Switzerland through the State Secretariat for Education, Research and Innovation (SERI).
- Fundamental Research Funds for the Central Universities [DUT23RC(3)014 to L.L.]
Collapse
Affiliation(s)
- Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
| | - Elisabeth Coudert
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Lucila Aimo
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Kristian Axelsen
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Lionel Breuza
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Edouard de Castro
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Marc Feuermann
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Anne Morgat
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Lucille Pourcel
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Ivo Pedruzzi
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Sylvain Poux
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Nicole Redaschi
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Catherine Rivoire
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Anastasia Sveshnikova
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA.
| | - Alan Bridge
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland.
| |
Collapse
|
5
|
Nastou K, Koutrouli M, Pyysalo S, Jensen LJ. Improving dictionary-based named entity recognition with deep learning. Bioinformatics 2024; 40:ii45-ii52. [PMID: 39230709 PMCID: PMC11373323 DOI: 10.1093/bioinformatics/btae402] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
MOTIVATION Dictionary-based named entity recognition (NER) allows terms to be detected in a corpus and normalized to biomedical databases and ontologies. However, adaptation to different entity types requires new high-quality dictionaries and associated lists of blocked names for each type. The latter are so far created by identifying cases that cause many false positives through manual inspection of individual names, a process that scales poorly. RESULTS In this work, we aim to improve block list s by automatically identifying names to block, based on the context in which they appear. By comparing results of three well-established biomedical NER methods, we generated a dataset of over 12.5 million text spans where the methods agree on the boundaries and type of entity tagged. These were used to generate positive and negative examples of contexts for four entity types (genes, diseases, species, and chemicals), which were used to train a Transformer-based model (BioBERT) to perform entity type classification. Application of the best model (F1-score = 96.7%) allowed us to generate a list of problematic names that should be blocked. Introducing this into our system doubled the size of the previous list of corpus-wide blocked names. In addition, we generated a document-specific list that allows ambiguous names to be blocked in specific documents. These changes boosted text mining precision by ∼5.5% on average, and over 8.5% for chemical and 7.5% for gene names, positively affecting several biological databases utilizing this NER system, like the STRING database, with only a minor drop in recall (0.6%). AVAILABILITY AND IMPLEMENTATION All resources are available through Zenodo https://doi.org/10.5281/zenodo.11243139 and GitHub https://doi.org/10.5281/zenodo.10289360.
Collapse
Affiliation(s)
- Katerina Nastou
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Blegdamsvej 3, Copenhagen, 2200, Denmark
| | - Mikaela Koutrouli
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Blegdamsvej 3, Copenhagen, 2200, Denmark
| | - Sampo Pyysalo
- TurkuNLP Group, Department of Computing, University of Turku, Turku, 20014, Finland
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Blegdamsvej 3, Copenhagen, 2200, Denmark
| |
Collapse
|
6
|
Wei CH, Allot A, Lai PT, Leaman R, Tian S, Luo L, Jin Q, Wang Z, Chen Q, Lu Z. PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge. Nucleic Acids Res 2024; 52:W540-W546. [PMID: 38572754 PMCID: PMC11223843 DOI: 10.1093/nar/gkae235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 03/02/2024] [Accepted: 03/21/2024] [Indexed: 04/05/2024] Open
Abstract
PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0's online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Shubo Tian
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Qiao Jin
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhizheng Wang
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|
7
|
Zhao T, He ZA, Shao J, Regmi A, Shi L, Cai Y. Decoding hotline's information with text-mining: A protocol for improving tobacco control in Shanghai. Tob Induc Dis 2024; 22:TID-22-107. [PMID: 38887599 PMCID: PMC11181012 DOI: 10.18332/tid/187864] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Revised: 04/13/2024] [Accepted: 04/23/2024] [Indexed: 06/20/2024] Open
Abstract
Tobacco consumption in China remains the primary cause of preventable mortality, with Shanghai being particularly affected by issues related to secondhand smoke exposure. This study explores the role of the public service hotline 12345, a grassroots initiative in Shanghai, in capturing public sentiment and assessing the effectiveness of anti-smoking regulations. Our research aims to accurately and deeply understand the implementation and feedback of smoking control policies: by identifying high-frequency points and prominent issues in smoking control work based on the smoking control work order data received by the health hotline 12320. The results of this study will assist government enforcement agencies in improving smoking monitoring and clarify the direction for improving smoking control measures. Text-mining techniques were employed to analyze a dataset comprising 78011 call sheets, all related to tobacco control and collected from the hotline between 1 January 2015 and 31 December 2019. This methodological approach aims to uncover prevalent themes and sentiments in the public discourse on smoking and its regulation, as reflected in the hotline interactions. Our study identified hotspots and the issues of greatest concern to citizens. Additionally, it provided recommendations to enforcement agencies to enhance their capabilities, optimize the allocation of human resources for smoking control monitoring, reduce enforcement costs and support for anti-smoking campaigns, thereby contributing to more effective tobacco control policies in the region.
Collapse
Affiliation(s)
- Tong Zhao
- School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Zi-an He
- School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- Jiading District Center for Disease Control and Prevention, Shanghai, China
| | - Jiaqi Shao
- Zhongshan Hospital, Fudan University, Shanghai, China
| | - Aksara Regmi
- School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Lili Shi
- Xinhua Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Yuyang Cai
- School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| |
Collapse
|
8
|
Corradi M, Luechtefeld T, de Haan AM, Pieters R, Freedman JH, Vanhaecke T, Vinken M, Teunis M. The application of natural language processing for the extraction of mechanistic information in toxicology. FRONTIERS IN TOXICOLOGY 2024; 6:1393662. [PMID: 38800806 PMCID: PMC11116573 DOI: 10.3389/ftox.2024.1393662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 04/16/2024] [Indexed: 05/29/2024] Open
Abstract
To study the ways in which compounds can induce adverse effects, toxicologists have been constructing Adverse Outcome Pathways (AOPs). An AOP can be considered as a pragmatic tool to capture and visualize mechanisms underlying different types of toxicity inflicted by any kind of stressor, and describes the interactions between key entities that lead to the adverse outcome on multiple biological levels of organization. The construction or optimization of an AOP is a labor intensive process, which currently depends on the manual search, collection, reviewing and synthesis of available scientific literature. This process could however be largely facilitated using Natural Language Processing (NLP) to extract information contained in scientific literature in a systematic, objective, and rapid manner that would lead to greater accuracy and reproducibility. This would support researchers to invest their expertise in the substantive assessment of the AOPs by replacing the time spent on evidence gathering by a critical review of the data extracted by NLP. As case examples, we selected two frequent adversities observed in the liver: namely, cholestasis and steatosis denoting accumulation of bile and lipid, respectively. We used deep learning language models to recognize entities of interest in text and establish causal relationships between them. We demonstrate how an NLP pipeline combining Named Entity Recognition and a simple rules-based relationship extraction model helps screen compounds related to liver adversities in the literature, but also extract mechanistic information for how such adversities develop, from the molecular to the organismal level. Finally, we provide some perspectives opened by the recent progress in Large Language Models and how these could be used in the future. We propose this work brings two main contributions: 1) a proof-of-concept that NLP can support the extraction of information from text for modern toxicology and 2) a template open-source model for recognition of toxicological entities and extraction of their relationships. All resources are openly accessible via GitHub (https://github.com/ontox-project/en-tox).
Collapse
Affiliation(s)
- Marie Corradi
- Innovative Testing in Life Sciences and Chemistry, Utrecht University of Applied Sciences, Utrecht, Netherlands
| | - Thomas Luechtefeld
- ToxTrack, Bethesda, MD, United States
- Environmental Health and Engineering, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States
| | - Alyanne M. de Haan
- Innovative Testing in Life Sciences and Chemistry, Utrecht University of Applied Sciences, Utrecht, Netherlands
| | - Raymond Pieters
- Innovative Testing in Life Sciences and Chemistry, Utrecht University of Applied Sciences, Utrecht, Netherlands
- Institute for Risk Assessment Sciences, Utrecht University, Utrecht, Netherlands
| | - Jonathan H. Freedman
- Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Tamara Vanhaecke
- Department of Pharmaceutical and Pharmacological Sciences, Vrije Universiteit Brussel-Belgium, Brussels, Belgium
| | - Mathieu Vinken
- Department of Pharmaceutical and Pharmacological Sciences, Vrije Universiteit Brussel-Belgium, Brussels, Belgium
| | - Marc Teunis
- Innovative Testing in Life Sciences and Chemistry, Utrecht University of Applied Sciences, Utrecht, Netherlands
| |
Collapse
|
9
|
Méndez-Cruz CF, Rodríguez-Herrera J, Varela-Vega A, Mateo-Estrada V, Castillo-Ramírez S. Unsupervised learning and natural language processing highlight research trends in a superbug. Front Artif Intell 2024; 7:1336071. [PMID: 38576460 PMCID: PMC10991725 DOI: 10.3389/frai.2024.1336071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Accepted: 03/11/2024] [Indexed: 04/06/2024] Open
Abstract
Introduction Antibiotic-resistant Acinetobacter baumannii is a very important nosocomial pathogen worldwide. Thousands of studies have been conducted about this pathogen. However, there has not been any attempt to use all this information to highlight the research trends concerning this pathogen. Methods Here we use unsupervised learning and natural language processing (NLP), two areas of Artificial Intelligence, to analyse the most extensive database of articles created (5,500+ articles, from 851 different journals, published over 3 decades). Results K-means clustering found 113 theme clusters and these were defined with representative terms automatically obtained with topic modelling, summarising different research areas. The biggest clusters, all with over 100 articles, are biased toward multidrug resistance, carbapenem resistance, clinical treatment, and nosocomial infections. However, we also found that some research areas, such as ecology and non-human infections, have received very little attention. This approach allowed us to study research themes over time unveiling those of recent interest, such as the use of Cefiderocol (a recently approved antibiotic) against A. baumannii. Discussion In a broader context, our results show that unsupervised learning, NLP and topic modelling can be used to describe and analyse the research themes for important infectious diseases. This strategy should be very useful to analyse other ESKAPE pathogens or any other pathogens relevant to Public Health.
Collapse
Affiliation(s)
- Carlos-Francisco Méndez-Cruz
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Mexico
| | - Joel Rodríguez-Herrera
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Mexico
| | - Alfredo Varela-Vega
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Mexico
| | - Valeria Mateo-Estrada
- Programa de Genómica Evolutiva, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Mexico
| | - Santiago Castillo-Ramírez
- Programa de Genómica Evolutiva, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Mexico
| |
Collapse
|
10
|
Yang X, Saha S, Venkatesan A, Tirunagari S, Vartak V, McEntyre J. Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms. Sci Data 2023; 10:722. [PMID: 37857688 PMCID: PMC10587067 DOI: 10.1038/s41597-023-02617-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 10/03/2023] [Indexed: 10/21/2023] Open
Abstract
Named entity recognition (NER) is a widely used text-mining and natural language processing (NLP) subtask. In recent years, deep learning methods have superseded traditional dictionary- and rule-based NER approaches. A high-quality dataset is essential to fully leverage recent deep learning advancements. While several gold-standard corpora for biomedical entities in abstracts exist, only a few are based on full-text research articles. The Europe PMC literature database routinely annotates Gene/Proteins, Diseases, and Organisms entities. To transition this pipeline from a dictionary-based to a machine learning-based approach, we have developed a human-annotated full-text corpus for these entities, comprising 300 full-text open-access research articles. Over 72,000 mentions of biomedical concepts have been identified within approximately 114,000 sentences. This article describes the corpus and details how to access and reuse this open community resource.
Collapse
Affiliation(s)
- Xiao Yang
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Shyamasree Saha
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
- Open Targets, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Aravind Venkatesan
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Santosh Tirunagari
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK.
- Open Targets, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - Vid Vartak
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Johanna McEntyre
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| |
Collapse
|
11
|
Leal Rodríguez C, Haue AD, Mazzoni G, Eriksson R, Hernansanz Biel J, Cantwell L, Westergaard D, Belling KG, Brunak S. Drug dosage modifications in 24 million in-patient prescriptions covering eight years: A Danish population-wide study of polypharmacy. PLOS DIGITAL HEALTH 2023; 2:e0000336. [PMID: 37676853 PMCID: PMC10484442 DOI: 10.1371/journal.pdig.0000336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Accepted: 07/20/2023] [Indexed: 09/09/2023]
Abstract
Polypharmacy has generally been assessed by raw counts of different drugs administered concomitantly to the same patients; not with respect to the likelihood of dosage-adjustments. To address this aspect of polypharmacy, the objective of the present study was to identify co-medications associated with more frequent dosage adjustments. The data foundation was electronic health records from 3.2 million inpatient admissions at Danish hospitals (2008-2016). The likelihood of dosage-adjustments when two drugs were administered concomitantly were computed using Bayesian logistic regressions. We identified 3,993 co-medication pairs that associate significantly with dosage changes when administered together. Of these pairs, 2,412 (60%) did associate with readmission, mortality or longer stays, while 308 (8%) associated with reduced kidney function. In comparison to co-medications pairs that were previously classified as drug-drug interactions, pairs not classified as drug-drug interactions had higher odds ratios of dosage modifications than drug pairs with an established interaction. Drug pairs not corresponding to known drug-drug interactions while still being associated significantly with dosage changes were prescribed to fewer patients and mentioned more rarely together in the literature. We hypothesize that some of these pairs could be associated with yet to be discovered interactions as they may be harder to identify in smaller-scale studies.
Collapse
Affiliation(s)
- Cristina Leal Rodríguez
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Amalie Dahl Haue
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
- The Heart Center, Rigshospitalet, Copenhagen University Hospital, DK-2100 Copenhagen, Denmark
| | - Gianluca Mazzoni
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Robert Eriksson
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
- Department of Pulmonary and Infectious Diseases, Nordsjællands Hospital, DK-3400 Hillerød, Denmark
| | - Jorge Hernansanz Biel
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Lisa Cantwell
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - David Westergaard
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
- Department of Obstetrics and Gynaecology, Copenhagen University Hospital Hvidovre, DK-2650 Hvidovre, Denmark
| | - Kirstine G. Belling
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200 Copenhagen, Denmark
| |
Collapse
|
12
|
Lundgaard AT, Burdet F, Siggaard T, Westergaard D, Vagiaki D, Cantwell L, Röder T, Vistisen D, Sparsø T, Giordano GN, Ibberson M, Banasik K, Brunak S. BALDR: A Web-based platform for informed comparison and prioritization of biomarker candidates for type 2 diabetes mellitus. PLoS Comput Biol 2023; 19:e1011403. [PMID: 37590326 PMCID: PMC10464978 DOI: 10.1371/journal.pcbi.1011403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Revised: 08/29/2023] [Accepted: 07/31/2023] [Indexed: 08/19/2023] Open
Abstract
Novel biomarkers are key to addressing the ongoing pandemic of type 2 diabetes mellitus. While new technologies have improved the potential of identifying such biomarkers, at the same time there is an increasing need for informed prioritization to ensure efficient downstream verification. We have built BALDR, an automated pipeline for biomarker comparison and prioritization in the context of diabetes. BALDR includes protein, gene, and disease data from major public repositories, text-mining data, and human and mouse experimental data from the IMI2 RHAPSODY consortium. These data are provided as easy-to-read figures and tables enabling direct comparison of up to 20 biomarker candidates for diabetes through the public website https://baldr.cpr.ku.dk.
Collapse
Affiliation(s)
- Agnete T. Lundgaard
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, Copenhagen, Denmark
| | - Frédéric Burdet
- Vital-IT, Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Troels Siggaard
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, Copenhagen, Denmark
| | - David Westergaard
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, Copenhagen, Denmark
| | - Danai Vagiaki
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, Copenhagen, Denmark
| | - Lisa Cantwell
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, Copenhagen, Denmark
| | - Timo Röder
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, Copenhagen, Denmark
| | - Dorte Vistisen
- Clinical Epidemiological Research, Steno Diabetes Center Copenhagen, Herlev, Denmark
- Department of Public Health, University of Copenhagen, Copenhagen, Denmark
| | - Thomas Sparsø
- Bioinformatics and Data Mining, Global Research Technologies, Novo Nordisk A/S, Måløv, Denmark
| | - Giuseppe N. Giordano
- Genetic and Molecular Epidemiology Unit, Lund University Diabetes Centre, Department of Clinical Sciences, Clinical Research Centre, Lund University, Skåne University Hospital, Malmö, Sweden
| | - Mark Ibberson
- Vital-IT, Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
| | - Karina Banasik
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, Copenhagen, Denmark
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3B, Copenhagen, Denmark
| |
Collapse
|
13
|
Bachman JA, Gyori BM, Sorger PK. Automated assembly of molecular mechanisms at scale from text mining and curated databases. Mol Syst Biol 2023; 19:e11325. [PMID: 36938926 PMCID: PMC10167483 DOI: 10.15252/msb.202211325] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Revised: 02/24/2023] [Accepted: 02/27/2023] [Indexed: 03/21/2023] Open
Abstract
The analysis of omic data depends on machine-readable information about protein interactions, modifications, and activities as found in protein interaction networks, databases of post-translational modifications, and curated models of gene and protein function. These resources typically depend heavily on human curation. Natural language processing systems that read the primary literature have the potential to substantially extend knowledge resources while reducing the burden on human curators. However, machine-reading systems are limited by high error rates and commonly generate fragmentary and redundant information. Here, we describe an approach to precisely assemble molecular mechanisms at scale using multiple natural language processing systems and the Integrated Network and Dynamical Reasoning Assembler (INDRA). INDRA identifies full and partial overlaps in information extracted from published papers and pathway databases, uses predictive models to improve the reliability of machine reading, and thereby assembles individual pieces of information into non-redundant and broadly usable mechanistic knowledge. Using INDRA to create high-quality corpora of causal knowledge we show it is possible to extend protein-protein interaction databases and explain co-dependencies in the Cancer Dependency Map.
Collapse
Affiliation(s)
- John A Bachman
- Laboratory of Systems PharmacologyHarvard Medical SchoolBostonMAUSA
| | - Benjamin M Gyori
- Laboratory of Systems PharmacologyHarvard Medical SchoolBostonMAUSA
| | - Peter K Sorger
- Laboratory of Systems PharmacologyHarvard Medical SchoolBostonMAUSA
- Department of Systems BiologyHarvard Medical SchoolBostonMAUSA
| |
Collapse
|
14
|
Bucur CI, Kuhn T, Ceolin D, van Ossenbruggen J. Nanopublication-based semantic publishing and reviewing: a field study with formalization papers. PeerJ Comput Sci 2023; 9:e1159. [PMID: 37346675 PMCID: PMC10280262 DOI: 10.7717/peerj-cs.1159] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Accepted: 10/25/2022] [Indexed: 06/23/2023]
Abstract
With the rapidly increasing amount of scientific literature, it is getting continuously more difficult for researchers in different disciplines to keep up-to-date with the recent findings in their field of study. Processing scientific articles in an automated fashion has been proposed as a solution to this problem, but the accuracy of such processing remains very poor for extraction tasks beyond the most basic ones (like locating and identifying entities and simple classification based on predefined categories). Few approaches have tried to change how we publish scientific results in the first place, such as by making articles machine-interpretable by expressing them with formal semantics from the start. In the work presented here, we propose a first step in this direction by setting out to demonstrate that we can formally publish high-level scientific claims in formal logic, and publish the results in a special issue of an existing journal. We use the concept and technology of nanopublications for this endeavor, and represent not just the submissions and final papers in this RDF-based format, but also the whole process in between, including reviews, responses, and decisions. We do this by performing a field study with what we call formalization papers, which contribute a novel formalization of a previously published claim. We received 15 submissions from 18 authors, who then went through the whole publication process leading to the publication of their contributions in the special issue. Our evaluation shows the technical and practical feasibility of our approach. The participating authors mostly showed high levels of interest and confidence, and mostly experienced the process as not very difficult, despite the technical nature of the current user interfaces. We believe that these results indicate that it is possible to publish scientific results from different fields with machine-interpretable semantics from the start, which in turn opens countless possibilities to radically improve in the future the effectiveness and efficiency of the scientific endeavor as a whole.
Collapse
Affiliation(s)
- Cristina-Iulia Bucur
- Computer Science Department, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| | - Tobias Kuhn
- Computer Science Department, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| | - Davide Ceolin
- Human-Centered Data Analytics Group, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
| | | |
Collapse
|
15
|
Scott-Fordsmand JJ, Amorim MJB. Using Machine Learning to make nanomaterials sustainable. THE SCIENCE OF THE TOTAL ENVIRONMENT 2023; 859:160303. [PMID: 36410486 DOI: 10.1016/j.scitotenv.2022.160303] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Revised: 11/06/2022] [Accepted: 11/15/2022] [Indexed: 06/16/2023]
Abstract
Sustainable development is a key challenge for contemporary human societies; failure to achieve sustainability could threaten human survival. In this review article, we illustrate how Machine Learning (ML) could support more sustainable development, covering the basics of data gathering through each step of the Environmental Risk Assessment (ERA). The literature provides several examples showing how ML can be employed in most steps of a typical ERA.A key observation is that there are currently no clear guidance for using such autonomous technologies in ERAs or which standards/checks are required. Steering thus seems to be the most important task for supporting the use of ML in the ERA of nano- and smart-materials. Resources should be devoted to developing a strategy for implementing ML in ERA with a strong emphasis on data foundations, methodologies, and the related sensitivities/uncertainties. We should recognise historical errors and biases (e.g., in data) to avoid embedding them during ML programming.
Collapse
Affiliation(s)
| | - Mónica J B Amorim
- Department of Biology & CESAM, University of Aveiro, 3810-193 Aveiro, Portugal.
| |
Collapse
|
16
|
Ferraz de Arruda H, Aleta A, Moreno Y. Food composition databases in the era of Big Data: Vegetable oils as a case study. Front Nutr 2023; 9:1052934. [PMID: 36687693 PMCID: PMC9851468 DOI: 10.3389/fnut.2022.1052934] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Accepted: 12/07/2022] [Indexed: 01/07/2023] Open
Abstract
Understanding the population's dietary patterns and their impacts on health requires many different sources of information. The development of reliable food composition databases is a key step in this pursuit. With them, nutrition and health care professionals can provide better public health advice and guide society toward achieving a better and healthier life. Unfortunately, these databases are full of caveats. Focusing on the specific case of vegetable oils, we analyzed the possible obsolescence of the information and the differences or inconsistencies among databases. We show that in many cases, the information is limited, incompletely documented, old or unreliable. More importantly, despite the many efforts carried out in the last decades, there is still much work to be done. As such, institutions should develop long-standing programs that can ensure the quality of the information on what we eat in the long term. In the face of climate change and complex societal challenges in an interconnected world, the full diversity of the food system needs to be recognized and more efforts should be put toward achieving a data-driven food system.
Collapse
Affiliation(s)
- Henrique Ferraz de Arruda
- ISI Foundation, Turin, Italy,CENTAI Institute, Turin, Italy,*Correspondence: Henrique Ferraz de Arruda ✉
| | - Alberto Aleta
- ISI Foundation, Turin, Italy,Institute for Biocomputation and Physics of Complex Systems (BIFI), University of Zaragoza, Zaragoza, Spain,Department of Theoretical Physics, Faculty of Sciences, University of Zaragoza, Zaragoza, Spain
| | - Yamir Moreno
- ISI Foundation, Turin, Italy,CENTAI Institute, Turin, Italy,Institute for Biocomputation and Physics of Complex Systems (BIFI), University of Zaragoza, Zaragoza, Spain,Department of Theoretical Physics, Faculty of Sciences, University of Zaragoza, Zaragoza, Spain
| |
Collapse
|
17
|
Kart Ö, Mestiashvili A, Lachmann K, Kwasnicki R, Schroeder M. Emati: a recommender system for biomedical literature based on supervised learning. Database (Oxford) 2022; 2022:6885256. [PMID: 36484479 PMCID: PMC9732843 DOI: 10.1093/database/baac104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Revised: 11/07/2022] [Accepted: 11/17/2022] [Indexed: 12/13/2022]
Abstract
The scientific literature continues to grow at an ever-increasing rate. Considering that thousands of new articles are published every week, it is obvious how challenging it is to keep up with newly published literature on a regular basis. Using a recommender system that improves the user experience in the online environment can be a solution to this problem. In the present study, we aimed to develop a web-based article recommender service, called Emati. Since the data are text-based by nature and we wanted our system to be independent of the number of users, a content-based approach has been adopted in this study. A supervised machine learning model has been proposed to generate article recommendations. Two different supervised learning approaches, namely the naïve Bayes model with Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer and the state-of-the-art language model bidirectional encoder representations from transformers (BERT), have been implemented. In the first one, a list of documents is converted into TF-IDF-weighted features and fed into a classifier to distinguish relevant articles from irrelevant ones. Multinomial naïve Bayes algorithm is used as a classifier since, along with the class label, it also gives the probability that the input belongs to this class. The second approach is based on fine-tuning the pretrained state-of-the-art language model BERT for the text classification task. Emati provides a weekly updated list of article recommendations and presents it to the user, sorted by probability scores. New article recommendations are also sent to users' email addresses on a weekly basis. Additionally, Emati has a personalized search feature to search online services' (such as PubMed and arXiv) content and have the results sorted by the user's classifier. Database URL: https://emati.biotec.tu-dresden.de.
Collapse
Affiliation(s)
- Özge Kart
- Biotechnology Center (BIOTEC), Center for Molecular and Cellular Bioengineering (CMCB), Technische Universität Dresden, Tatzberg 47-49, Dresden 01307, Germany,Department of Computer Engineering, Dokuz Eylül University, Tinaztepe Campus, Buca 35160 Izmir, Turkey
| | - Alexandre Mestiashvili
- Biotechnology Center (BIOTEC), Center for Molecular and Cellular Bioengineering (CMCB), Technische Universität Dresden, Tatzberg 47-49, Dresden 01307, Germany
| | - Kurt Lachmann
- Biotechnology Center (BIOTEC), Center for Molecular and Cellular Bioengineering (CMCB), Technische Universität Dresden, Tatzberg 47-49, Dresden 01307, Germany
| | - Richard Kwasnicki
- Biotechnology Center (BIOTEC), Center for Molecular and Cellular Bioengineering (CMCB), Technische Universität Dresden, Tatzberg 47-49, Dresden 01307, Germany
| | | |
Collapse
|
18
|
Feng Z, Shen Z, Li H, Li S. e-TSN: an interactive visual exploration platform for target-disease knowledge mapping from literature. Brief Bioinform 2022; 23:bbac465. [PMID: 36347537 PMCID: PMC9677481 DOI: 10.1093/bib/bbac465] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 09/20/2022] [Accepted: 09/27/2022] [Indexed: 11/10/2022] Open
Abstract
Target discovery and identification processes are driven by the increasing amount of biomedical data. The vast numbers of unstructured texts of biomedical publications provide a rich source of knowledge for drug target discovery research and demand the development of specific algorithms or tools to facilitate finding disease genes and proteins. Text mining is a method that can automatically mine helpful information related to drug target discovery from massive biomedical literature. However, there is a substantial lag between biomedical publications and the subsequent abstraction of information extracted by text mining to databases. The knowledge graph is introduced to integrate heterogeneous biomedical data. Here, we describe e-TSN (Target significance and novelty explorer, http://www.lilab-ecust.cn/etsn/), a knowledge visualization web server integrating the largest database of associations between targets and diseases from the full scientific literature by constructing significance and novelty scoring methods based on bibliometric statistics. The platform aims to visualize target-disease knowledge graphs to assist in prioritizing candidate disease-related proteins. Approved drugs and associated bioactivities for each interested target are also provided to facilitate the visualization of drug-target relationships. In summary, e-TSN is a fast and customizable visualization resource for investigating and analyzing the intricate target-disease networks, which could help researchers understand the mechanisms underlying complex disease phenotypes and improve the drug discovery and development efficiency, especially for the unexpected outbreak of infectious disease pandemics like COVID-19.
Collapse
Affiliation(s)
- Ziyan Feng
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Zihao Shen
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Honglin Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
- Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China
- Lingang Laboratory, Shanghai 200031, China
| | - Shiliang Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
- Innovation Center for AI and Drug Discovery, East China Normal University, Shanghai 200062, China
| |
Collapse
|
19
|
Nicholson DN, Himmelstein DS, Greene CS. Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts. BioData Min 2022; 15:26. [PMID: 36258252 PMCID: PMC9578183 DOI: 10.1186/s13040-022-00311-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2022] [Accepted: 09/17/2022] [Indexed: 02/04/2023] Open
Abstract
BACKGROUND Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an exponentially rising publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to annotate textual data automatically. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This bottleneck makes populating a knowledge graph with multiple nodes and edge types practically infeasible. Thus, we sought to accelerate the label function creation process by evaluating how label functions can be re-used across multiple edge types. RESULTS We obtained entity-tagged abstracts and subsetted these entities to only contain compounds, genes, and disease mentions. We extracted sentences containing co-mentions of certain biomedical entities contained in a previously described knowledge graph, Hetionet v1. We trained a baseline model that used database-only label functions and then used a sampling approach to measure how well adding edge-specific or edge-mismatch label function combinations improved over our baseline. Next, we trained a discriminator model to detect sentences that indicated a biomedical relationship and then estimated the number of edge types that could be recalled and added to Hetionet v1. We found that adding edge-mismatch label functions rarely improved relationship extraction, while control edge-specific label functions did. There were two exceptions to this trend, Compound-binds-Gene and Gene-interacts-Gene, which both indicated physical relationships and showed signs of transferability. Across the scenarios tested, discriminative model performance strongly depends on generated annotations. Using the best discriminative model for each edge type, we recalled close to 30% of established edges within Hetionet v1. CONCLUSIONS Our results show that this framework can incorporate novel edges into our source knowledge graph. However, results with label function transfer were mixed. Only label functions describing very similar edge types supported improved performance when transferred. We expect that the continued development of this strategy may provide essential building blocks to populating biomedical knowledge graphs with discoveries, ensuring that these resources include cutting-edge results.
Collapse
Affiliation(s)
- David N. Nicholson
- grid.25879.310000 0004 1936 8972Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA USA
| | - Daniel S. Himmelstein
- grid.25879.310000 0004 1936 8972Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA USA
| | - Casey S. Greene
- grid.430503.10000 0001 0703 675XDepartment of Biomedical Informatics, University of Colorado School of Medicine and Center for Health Artificial Intellegence (CHAI), University of Colorado School of Medicine, Aurora, USA
| |
Collapse
|
20
|
Jadhav A, Kumar T, Raghavendra M, Loganathan T, Narayanan M. Predicting cross-tissue hormone-gene relations using balanced word embeddings. Bioinformatics 2022; 38:4771-4781. [PMID: 36000859 PMCID: PMC9563690 DOI: 10.1093/bioinformatics/btac578] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Revised: 07/29/2022] [Accepted: 08/23/2022] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Inter-organ/inter-tissue communication is central to multi-cellular organisms including humans, and mapping inter-tissue interactions can advance system-level whole-body modeling efforts. Large volumes of biomedical literature have fostered studies that map within-tissue or tissue-agnostic interactions, but literature-mining studies that infer inter-tissue relations, such as between hormones and genes are solely missing. RESULTS We present a first study to predict from biomedical literature the hormone-gene associations mediating inter-tissue signaling in the human body. Our BioEmbedS* models use neural network-based Biomedical word Embeddings with a Support Vector Machine classifier to predict if a hormone-gene pair is associated or not, and whether an associated gene is involved in the hormone's production or response. Model training relies on our unified dataset Hormone-Gene version 1 of ground-truth associations between genes and endocrine hormones, which we compiled and carefully balanced in the embedded space to handle data disparities, such as between poorly- versus well-studied hormones. Our BioEmbedS model recapitulates known gene mediators of tissue-tissue signaling with 70.4% accuracy; predicts novel inter-tissue communication genes in humans, which are enriched for hormone-related disorders; and generalizes well to mouse, thereby holding promise for its extension to other multi-cellular organisms as well. AVAILABILITY AND IMPLEMENTATION Freely available at https://cross-tissue-signaling.herokuapp.com are our model predictions & datasets; https://github.com/BIRDSgroup/BioEmbedS has all relevant code. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Aditya Jadhav
- Department of Computer Science and Engineering, Indian Institute of Technology (IIT) Madras, Chennai, India
| | - Tarun Kumar
- Department of Computer Science and Engineering, Indian Institute of Technology (IIT) Madras, Chennai, India
- Initiative for Biological Systems Engineering, IIT Madras, Chennai, India
- Robert Bosch Centre for Data Science and Artificial Intelligence, IIT Madras, Chennai, India
| | - Mohit Raghavendra
- Department of Information Technology, National Institute of Technology Karnataka, Surathkal, India
| | - Tamizhini Loganathan
- Initiative for Biological Systems Engineering, IIT Madras, Chennai, India
- Robert Bosch Centre for Data Science and Artificial Intelligence, IIT Madras, Chennai, India
| | - Manikandan Narayanan
- Department of Computer Science and Engineering, Indian Institute of Technology (IIT) Madras, Chennai, India
- Initiative for Biological Systems Engineering, IIT Madras, Chennai, India
- Robert Bosch Centre for Data Science and Artificial Intelligence, IIT Madras, Chennai, India
| |
Collapse
|
21
|
Kim W, Yeganova L, Comeau DC, Wilbur WJ, Lu Z. Towards a unified search: Improving PubMed retrieval with full text. J Biomed Inform 2022; 134:104211. [PMID: 36152950 PMCID: PMC9561061 DOI: 10.1016/j.jbi.2022.104211] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Revised: 09/12/2022] [Accepted: 09/15/2022] [Indexed: 10/14/2022]
Abstract
OBJECTIVE A significant number of recent articles in PubMed have full text available in PubMed Central®, and the availability of full texts has been consistently growing. However, it is not currently possible for a user to simultaneously query the contents of both databases and receive a single integrated search result. In this study, we investigate how to score full text articles given a multitoken query and how to combine those full text article scores with scores originating from abstracts and achieve an overall improved retrieval performance. MATERIALS AND METHODS For scoring full text articles, we propose a method to combine information coming from different sections by converting the traditionally used BM25 scores into log odds ratio scores which can be treated uniformly. We further propose a method that successfully combines scores from two heterogenous retrieval sources - full text articles and abstract only articles - by balancing the contributions of their respective scores through a probabilistic transformation. We use PubMed click data that consists of queries sampled from PubMed user logs along with a subset of retrieved and clicked documents to train the probabilistic functions and to evaluate retrieval effectiveness. RESULTS AND CONCLUSIONS Random ranking achieves 0.579 MAP score on our PubMed click data. BM25 ranking on PubMed abstracts improves the MAP by 10.6%. For full text documents, experiments confirm that BM25 section scores are of different value depending on the section type and are not directly comparable. Naïvely using the body text of articles along with abstract text degrades the overall quality of the search. The proposed log odds ratio scores normalize and combine the contributions of occurrences of query tokens in different sections. By including full text where available, we gain another 0.67%, or 7% relative improvement over abstract alone. We find an advantage in the more accurate estimate of the value of BM25 scores depending on the section from which they were produced. Taking the sum of top three section scores performs the best.
Collapse
Affiliation(s)
- Won Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Lana Yeganova
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - W John Wilbur
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA.
| |
Collapse
|
22
|
Abdalla M, Abdalla M, Abdalla S, Saad M, Jones DS, Podolsky SH. Insights from full-text analyses of the Journal of the American Medical Association and the New England Journal of Medicine. eLife 2022; 11:e72602. [PMID: 35796055 PMCID: PMC9262397 DOI: 10.7554/elife.72602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2021] [Accepted: 03/22/2022] [Indexed: 11/13/2022] Open
Abstract
Analysis of the content of medical journals enables us to frame the shifting scientific, material, ethical, and epistemic underpinnings of medicine over time, including today. Leveraging a dataset comprised of nearly half-a-million articles published in the Journal of the American Medical Association (JAMA) and the New England Journal of Medicine (NEJM) over the past 200 years, we (a) highlight the evolution of medical language, and its manifestations in shifts of usage and meaning, (b) examine traces of the medical profession's changing self-identity over time, reflected in its shifting ethical and epistemic underpinnings, (c) analyze medicine's material underpinnings and how we describe where medicine is practiced, (d) demonstrate how the occurrence of specific disease terms within the journals reflects the changing burden of disease itself over time and the interests and perspectives of authors and editors, and (e) showcase how this dataset can allow us to explore the evolution of modern medical ideas and further our understanding of how modern disease concepts came to be, and of the retained legacies of prior embedded values.
Collapse
Affiliation(s)
- Moustafa Abdalla
- Harvard Medical SchoolBostonUnited States
- Department of Statistics, University of OxfordOxfordUnited Kingdom
| | - Mohamed Abdalla
- Department of Computer Science, University of TorontoTorontoCanada
- The Vector Institute for Artificial IntelligenceTorontoCanada
| | - Salwa Abdalla
- Department of Computer Science, University of TorontoTorontoCanada
| | - Mohamed Saad
- University of Bahrain & the Royal AcademyManamaBahrain
| | - David S Jones
- Harvard Medical SchoolBostonUnited States
- Department of the History of Science, Harvard UniversityCambridgeUnited States
| | | |
Collapse
|
23
|
Hyams TC, Luo L, Hair B, Lee K, Lu Z, Seminara D. Machine Learning Approach to Facilitate Knowledge Synthesis at the Intersection of Liver Cancer, Epidemiology, and Health Disparities Research. JCO Clin Cancer Inform 2022; 6:e2100129. [PMID: 35623021 DOI: 10.1200/cci.21.00129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Liver cancer is a global challenge, and disparities exist across multiple domains and throughout the disease continuum. However, liver cancer's global epidemiology and etiology are shifting, and the literature is rapidly evolving, presenting a challenge to the synthesis of knowledge needed to identify areas of research needs and to develop research agendas focusing on disparities. Machine learning (ML) techniques can be used to semiautomate the literature review process and improve efficiency. In this study, we detail our approach and provide practical benchmarks for the development of a ML approach to classify literature and extract data at the intersection of three fields: liver cancer, health disparities, and epidemiology. METHODS We performed a six-phase process including: training (I), validating (II), confirming (III), and performing error analysis (IV) for a ML classifier. We then developed an extraction model (V) and applied it (VI) to the liver cancer literature identified through PubMed. We present precision, recall, F1, and accuracy metrics for the classifier and extraction models as appropriate for each phase of the process. We also provide the results for the application of our extraction model. RESULTS With limited training data, we achieved a high degree of accuracy for both our classifier and for the extraction model for liver cancer disparities research literature performed using epidemiologic methods. The disparities concept was the most challenging to accurately classify, and concepts that appeared infrequently in our data set were the most difficult to extract. CONCLUSION We provide a roadmap for using ML to classify and extract comprehensive information on multidisciplinary literature. Our technique can be adapted and modified for other cancers or diseases where disparities persist.
Collapse
Affiliation(s)
- Travis C Hyams
- Office of the Director, Division of Cancer Control and Population Sciences, National Cancer Institute, National Institutes of Health, Bethesda, MD
| | - Ling Luo
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Brionna Hair
- Office of the Director, Division of Cancer Control and Population Sciences, National Cancer Institute, National Institutes of Health, Bethesda, MD
| | - Kyubum Lee
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Daniela Seminara
- Office of the Director, Division of Cancer Control and Population Sciences, National Cancer Institute, National Institutes of Health, Bethesda, MD
| |
Collapse
|
24
|
Farrell MJ, Brierley L, Willoughby A, Yates A, Mideo N. Past and future uses of text mining in ecology and evolution. Proc Biol Sci 2022; 289:20212721. [PMID: 35582795 PMCID: PMC9114983 DOI: 10.1098/rspb.2021.2721] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Ecology and evolutionary biology, like other scientific fields, are experiencing an exponential growth of academic manuscripts. As domain knowledge accumulates, scientists will need new computational approaches for identifying relevant literature to read and include in formal literature reviews and meta-analyses. Importantly, these approaches can also facilitate automated, large-scale data synthesis tasks and build structured databases from the information in the texts of primary journal articles, books, grey literature, and websites. The increasing availability of digital text, computational resources, and machine-learning based language models have led to a revolution in text analysis and natural language processing (NLP) in recent years. NLP has been widely adopted across the biomedical sciences but is rarely used in ecology and evolutionary biology. Applying computational tools from text mining and NLP will increase the efficiency of data synthesis, improve the reproducibility of literature reviews, formalize analyses of research biases and knowledge gaps, and promote data-driven discovery of patterns across ecology and evolutionary biology. Here we present recent use cases from ecology and evolution, and discuss future applications, limitations and ethical issues.
Collapse
Affiliation(s)
- Maxwell J. Farrell
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Canada
| | - Liam Brierley
- Department of Health Data Science, University of Liverpool, Liverpool, UK
| | - Anna Willoughby
- Odum School of Ecology, University of Georgia, Athens, GA, USA,Center for the Ecology of Infectious Diseases, University of Georgia, Athens, GA, USA
| | - Andrew Yates
- University of Amsterdam, Amsterdam, The Netherlands
| | - Nicole Mideo
- Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Canada
| |
Collapse
|
25
|
Grissa D, Junge A, Oprea TI, Jensen LJ. Diseases 2.0: a weekly updated database of disease–gene associations from text mining and data integration. Database (Oxford) 2022; 2022:6554833. [PMID: 35348648 PMCID: PMC9216524 DOI: 10.1093/database/baac019] [Citation(s) in RCA: 54] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Revised: 02/14/2022] [Accepted: 03/11/2022] [Indexed: 12/04/2022]
Abstract
The scientific knowledge about which genes are involved in which diseases grows rapidly, which makes it difficult to keep up with new publications and genetics datasets. The DISEASES database aims to provide a comprehensive overview by systematically integrating and assigning confidence scores to evidence for disease–gene associations from curated databases, genome-wide association studies (GWAS) and automatic text mining of the biomedical literature. Here, we present a major update to this resource, which greatly increases the number of associations from all these sources. This is especially true for the text-mined associations, which have increased by at least 9-fold at all confidence cutoffs. We show that this dramatic increase is primarily due to adding full-text articles to the text corpus, secondarily due to improvements to both the disease and gene dictionaries used for named entity recognition, and only to a very small extent due to the growth in number of PubMed abstracts. DISEASES now also makes use of a new GWAS database, Target Illumination by GWAS Analytics, which considerably increased the number of GWAS-derived disease–gene associations. DISEASES itself is also integrated into several other databases and resources, including GeneCards/MalaCards, Pharos/Target Central Resource Database and the Cytoscape stringApp. All data in DISEASES are updated on a weekly basis and is available via a web interface at https://diseases.jensenlab.org, from where it can also be downloaded under open licenses. Database URL: https://diseases.jensenlab.org
Collapse
Affiliation(s)
- Dhouha Grissa
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark
| | - Alexander Junge
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark
| | - Tudor I Oprea
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark
- Department of Internal Medicine, Division of Translational Informatics, University of New Mexico Health Sciences Center, Albuquerque, NM, USA
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark
| |
Collapse
|
26
|
Zafeiropoulos H, Paragkamian S, Ninidakis S, Pavlopoulos GA, Jensen LJ, Pafilis E. PREGO: A Literature and Data-Mining Resource to Associate Microorganisms, Biological Processes, and Environment Types. Microorganisms 2022; 10:microorganisms10020293. [PMID: 35208748 PMCID: PMC8879827 DOI: 10.3390/microorganisms10020293] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2021] [Revised: 01/19/2022] [Accepted: 01/20/2022] [Indexed: 12/12/2022] Open
Abstract
To elucidate ecosystem functioning, it is fundamental to recognize what processes occur in which environments (where) and which microorganisms carry them out (who). Here, we present PREGO, a one-stop-shop knowledge base providing such associations. PREGO combines text mining and data integration techniques to mine such what-where-who associations from data and metadata scattered in the scientific literature and in public omics repositories. Microorganisms, biological processes, and environment types are identified and mapped to ontology terms from established community resources. Analyses of comentions in text and co-occurrences in metagenomics data/metadata are performed to extract associations and a level of confidence is assigned to each of them thanks to a scoring scheme. The PREGO knowledge base contains associations for 364,508 microbial taxa, 1090 environmental types, 15,091 biological processes, and 7971 molecular functions with a total of almost 58 million associations. These associations are available through a web portal, an Application Programming Interface (API), and bulk download. By exploring environments and/or processes associated with each other or with microbes, PREGO aims to assist researchers in design and interpretation of experiments and their results. To demonstrate PREGO’s capabilities, a thorough presentation of its web interface is given along with a meta-analysis of experimental results from a lagoon-sediment study of sulfur-cycle related microbes.
Collapse
Affiliation(s)
- Haris Zafeiropoulos
- Department of Biology, University of Crete, Voutes University Campus, P.O. Box 2208, 70013 Heraklion, Crete, Greece; (H.Z.); (S.P.)
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Former U.S. Base of Gournes, P.O. Box 2214, 71003 Heraklion, Crete, Greece;
| | - Savvas Paragkamian
- Department of Biology, University of Crete, Voutes University Campus, P.O. Box 2208, 70013 Heraklion, Crete, Greece; (H.Z.); (S.P.)
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Former U.S. Base of Gournes, P.O. Box 2214, 71003 Heraklion, Crete, Greece;
| | - Stelios Ninidakis
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Former U.S. Base of Gournes, P.O. Box 2214, 71003 Heraklion, Crete, Greece;
| | - Georgios A. Pavlopoulos
- Institute for Fundamental Biomedical Research, Biomedical Sciences Research Center “Alexander Fleming”, 16672 Vari, Greece;
- Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, 11527 Athens, Greece
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen, Denmark;
| | - Evangelos Pafilis
- Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Former U.S. Base of Gournes, P.O. Box 2214, 71003 Heraklion, Crete, Greece;
- Correspondence: or ; Tel.: +30-2810-337748
| |
Collapse
|
27
|
Media discourse in China and Japan on the COVID-19 pandemic: comparative analysis of the first three months. JOURNAL OF INFORMATION COMMUNICATION & ETHICS IN SOCIETY 2022. [DOI: 10.1108/jices-05-2021-0047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Purpose
This study aims to analyze how English-language versions of e-newspapers in the first two countries affected, China and Japan, which are non-English-speaking countries and have different socio-economic and political settings, have highlighted Coronavirus disease 2019 (COVID-19) pandemic news and informed the global community.
Design/methodology/approach
A text-mining approach was used to explore experts’ thoughts as published by the two leading English-language newspapers in China and Japan from January to March 2020. This study analyzes the Opinion section, which mainly comprises editorial and the op-ed section. The current study groups all editorial discussions and highlights into ten major aspects, which cover health, economy, politics, culture and others.
Findings
Within the first three months, the media in both China and Japan shifted their focus from health and preparedness to the economy, politics and social welfare. Governance and social welfare were key concerns in China’s news media, while, in contrast, global politics received the highest level of attention from experts in Japan’s news media. Environment and technologies aspects did not receive much attention by the expert’s columns.
Originality/value
At the initial stage of a world crisis, how leading nations and initially affected nations deal with the problem, how media play their role and guide mass population with experts’ thoughts are highlighted here. The understanding developed in this study can provide guidance to news media in other countries in playing effective roles in the management of this health crisis and catastrophes.
Collapse
|
28
|
Bhasuran B. Combining Literature Mining and Machine Learning for Predicting Biomedical Discoveries. Methods Mol Biol 2022; 2496:123-140. [PMID: 35713862 DOI: 10.1007/978-1-0716-2305-3_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The major outcomes and insights of scientific research and clinical study end up in the form of publication or clinical record in an unstructured text format. Due to advancements in biomedical research, the growth of published literature is getting tremendous large in recent years. The scientists and clinical researchers are facing a big challenge to stay current with the knowledge and to extract hidden information from this sheer quantity of millions of published biomedical literature. The potential one-stop automated solution to this problem is biomedical literature mining. One of the long-standing goals in biology is to discover the disease-causing genes and their specific roles in personalized precision medicine and drug repurposing. However, the empirical approaches and clinical affirmation are expensive and time-consuming. In silico approach using text mining to identify the disease causing genes can contribute towards biomarker discovery. This chapter presents a protocol on combining literature mining and machine learning for predicting biomedical discoveries with a special emphasis on gene-disease relation based discovery. The protocol is presented as a literature based discovery (LBD) pipeline for gene-disease based discovery. The protocol includes our web based tools: (1) DNER (Disease Named Entity Recognizer) for disease entity recognition, (2) BCCNER (Bidirectional, Contextual clues Named Entity Tagger) for gene/protein entity recognition, (3) DisGeReExT (Disease-Gene Relation Extractor) for statistically validated results and visualization, and (4) a newly introduced deep learning based method for association discovery. Our proposed deep learning based method can be generalized and applied to other important biomedical discoveries focusing on entities such as drug/chemical, or miRNA.
Collapse
Affiliation(s)
- Balu Bhasuran
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India.
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, USA.
| |
Collapse
|
29
|
Bhasuran B. BioBERT and Similar Approaches for Relation Extraction. Methods Mol Biol 2022; 2496:221-235. [PMID: 35713867 DOI: 10.1007/978-1-0716-2305-3_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In biomedicine, facts about relations between entities (disease, gene, drug, etc.) are hidden in the large trove of 30 million scientific publications. The curated information is proven to play an important role in various applications such as drug repurposing and precision medicine. Recently, due to the advancement in deep learning a transformer architecture named BERT (Bidirectional Encoder Representations from Transformers) has been proposed. This pretrained language model trained using the Books Corpus with 800M words and English Wikipedia with 2500M words reported state of the art results in various NLP (Natural Language Processing) tasks including relation extraction. It is a widely accepted notion that due to the word distribution shift, general domain models exhibit poor performance in information extraction tasks of the biomedical domain. Due to this, an architecture is later adapted to the biomedical domain by training the language models using 28 million scientific literatures from PubMed and PubMed central. This chapter presents a protocol for relation extraction using BERT by discussing state-of-the-art for BERT versions in the biomedical domain such as BioBERT. The protocol emphasis on general BERT architecture, pretraining and fine tuning, leveraging biomedical information, and finally a knowledge graph infusion to the BERT model layer.
Collapse
Affiliation(s)
- Balu Bhasuran
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India.
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, USA.
| |
Collapse
|
30
|
Software review: The JATSdecoder package-extract metadata, abstract and sectioned text from NISO-JATS coded XML documents; Insights to PubMed central's open access database. Scientometrics 2021; 126:9585-9601. [PMID: 34720253 PMCID: PMC8542361 DOI: 10.1007/s11192-021-04162-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2021] [Accepted: 09/08/2021] [Indexed: 11/17/2022]
Abstract
JATSdecoder is a general toolbox which facilitates text extraction and analytical tasks on NISO-JATS coded XML documents. Its function JATSdecoder() outputs metadata, the abstract, the sectioned text and reference list as easy selectable elements. One of the biggest repositories for open access full texts covering biology and the medical and health sciences is PubMed Central (PMC), with more than 3.2 million files. This report provides an overview of the PMC document collection processed with JATSdecoder(). The development of extracted tags is displayed for the full corpus over time and in greater detail for some meta tags. Possibilities and limitations for text miners working with scientific literature are outlined. The NISO-JATS-tags are used quite consistently nowadays and allow a reliable extraction of metadata and text elements. International collaborations are more present than ever. There are obvious errors in the date stamps of some documents. Only about half of all articles from 2020 contain at least one author listed with an author identification code. Since many authors share the same name, the identification of person-related content is problematic, especially for authors with Asian names. JATSdecoder() reliably extracts key metadata and text elements from NISO-JATS coded XML files. When combined with the rich, publicly available content within PMCs database, new monitoring and text mining approaches can be carried out easily. Any selection of article subsets should be carefully performed with in- and exclusion criteria on several NISO-JATS tags, as both the subject and keyword tags are used quite inconsistently.
Collapse
|
31
|
Tewari S, Toledo Margalef P, Kareem A, Abdul-Hussein A, White M, Wazana A, Davidge ST, Delrieux C, Connor KL. Mining Early Life Risk and Resiliency Factors and Their Influences in Human Populations from PubMed: A Machine Learning Approach to Discover DOHaD Evidence. J Pers Med 2021; 11:jpm11111064. [PMID: 34834416 PMCID: PMC8621659 DOI: 10.3390/jpm11111064] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 10/01/2021] [Accepted: 10/18/2021] [Indexed: 01/03/2023] Open
Abstract
The Developmental Origins of Health and Disease (DOHaD) framework aims to understand how early life exposures shape lifecycle health. To date, no comprehensive list of these exposures and their interactions has been developed, which limits our ability to predict trajectories of risk and resiliency in humans. To address this gap, we developed a model that uses text-mining, machine learning, and natural language processing approaches to automate search, data extraction, and content analysis from DOHaD-related research articles available in PubMed. Our first model captured 2469 articles, which were subsequently categorised into topics based on word frequencies within the titles and abstracts. A manual screening validated 848 of these as relevant, which were used to develop a revised model that finally captured 2098 articles that largely fell under the most prominently researched domains related to our specific DOHaD focus. The articles were clustered according to latent topic extraction, and 23 experts in the field independently labelled the perceived topics. Consensus analysis on this labelling yielded mostly from fair to substantial agreement, which demonstrates that automated models can be developed to successfully retrieve and classify research literature, as a first step to gather evidence related to DOHaD risk and resilience factors that influence later life human health.
Collapse
Affiliation(s)
- Shrankhala Tewari
- Health Sciences, Carleton University, Ottawa, ON K1S 5B6, Canada; (S.T.); (A.K.); (A.A.-H.); (M.W.)
| | - Pablo Toledo Margalef
- CONICET, National Science and Technology Council of Argentina, Buenos Aires C1425FQD, Argentina; (P.T.M.); (C.D.)
| | - Ayesha Kareem
- Health Sciences, Carleton University, Ottawa, ON K1S 5B6, Canada; (S.T.); (A.K.); (A.A.-H.); (M.W.)
| | - Ayah Abdul-Hussein
- Health Sciences, Carleton University, Ottawa, ON K1S 5B6, Canada; (S.T.); (A.K.); (A.A.-H.); (M.W.)
| | - Marina White
- Health Sciences, Carleton University, Ottawa, ON K1S 5B6, Canada; (S.T.); (A.K.); (A.A.-H.); (M.W.)
| | - Ashley Wazana
- Department of Psychiatry, McGill University, Montreal, QC H3A 0G4, Canada;
| | - Sandra T. Davidge
- Women and Children’s Health Research Institute, University of Alberta, Edmonton, AB T6G 1C9, Canada;
| | - Claudio Delrieux
- CONICET, National Science and Technology Council of Argentina, Buenos Aires C1425FQD, Argentina; (P.T.M.); (C.D.)
- DIEC—Electric and Computer Engineering Department, Universidad Nacional del Sur, Bahía Blanca B8000, Argentina
| | - Kristin L. Connor
- Health Sciences, Carleton University, Ottawa, ON K1S 5B6, Canada; (S.T.); (A.K.); (A.A.-H.); (M.W.)
- Correspondence:
| |
Collapse
|
32
|
Azevedo S, Seixas MR, Jurberg AD, Mermelstein C, Costa ML. Do medicine and cell biology talk to each other? A study of vocabulary similarities between fields. Braz J Med Biol Res 2021; 54:e11728. [PMID: 34669784 PMCID: PMC8521539 DOI: 10.1590/1414-431x2021e11728] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Accepted: 08/25/2021] [Indexed: 11/22/2022] Open
Abstract
A close interaction between basic science and applied medicine is to be expected. Therefore, it is important to measure how far apart the field of cell biology and medicine are. Our approach to estimating the distance between these fields was to compare their vocabularies and to quantify the difference in word repertoire. We compared the vocabulary of the title and abstract of articles available in PubMed in two selected high-impact journals in each field: cell biology, medicine, and translational science. Although each journal has its own editorial policy, we showed that within each field there is a small vocabulary difference between the two journals. We developed a word similarity index that can measure how much journals share a common vocabulary. We found a high similarity index between each cell biology (91%), medical (71-74%), and translational journal (65%). In contrast, the comparison between medicine and biology journals produced low correlation values (22-36%), suggesting that their vocabularies are quite dissimilar. Translational medicine journals had medium similarity values when compared to cell biology journals (52-70%) and medicine journals (27-59%). This approach was also performed in 10-year periods to evaluate the evolution of each field. Using the “onomics” strategy presented here, we observed that differences in vocabulary of basic science and medicine have been increasing over time. Since translational medicine has an intermediate vocabulary, we confirmed that translational medicine is an efficient approach to bridge this gap.
Collapse
Affiliation(s)
- S Azevedo
- Instituto de Ciências Biomédicas, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brasil
| | - M R Seixas
- Instituto de Ciências Biomédicas, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brasil
| | - A D Jurberg
- Instituto de Ciências Biomédicas, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brasil.,Faculdade de Medicina, Universidade Estácio de Sá (Campus Presidente Vargas), Rio de Janeiro, RJ, Brasil
| | - C Mermelstein
- Instituto de Ciências Biomédicas, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brasil
| | - M L Costa
- Instituto de Ciências Biomédicas, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brasil
| |
Collapse
|
33
|
Rosário-Ferreira N, Guimarães V, Costa VS, Moreira IS. SicknessMiner: a deep-learning-driven text-mining tool to abridge disease-disease associations. BMC Bioinformatics 2021; 22:482. [PMID: 34607568 PMCID: PMC8491382 DOI: 10.1186/s12859-021-04397-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Accepted: 09/24/2021] [Indexed: 12/24/2022] Open
Abstract
Background Blood cancers (BCs) are responsible for over 720 K yearly deaths worldwide. Their prevalence and mortality-rate uphold the relevance of research related to BCs. Despite the availability of different resources establishing Disease-Disease Associations (DDAs), the knowledge is scattered and not accessible in a straightforward way to the scientific community. Here, we propose SicknessMiner, a biomedical Text-Mining (TM) approach towards the centralization of DDAs. Our methodology encompasses Named Entity Recognition (NER) and Named Entity Normalization (NEN) steps, and the DDAs retrieved were compared to the DisGeNET resource for qualitative and quantitative comparison. Results We obtained the DDAs via co-mention using our SicknessMiner or gene- or variant-disease similarity on DisGeNET. SicknessMiner was able to retrieve around 92% of the DisGeNET results and nearly 15% of the SicknessMiner results were specific to our pipeline. Conclusions SicknessMiner is a valuable tool to extract disease-disease relationship from RAW input corpus. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04397-w.
Collapse
Affiliation(s)
- Nícia Rosário-Ferreira
- CQC - Coimbra Chemistry Center, Chemistry Department, Faculty of Science and Technology, University of Coimbra, 3004-535, Coimbra, Portugal. .,CNC - Center for Neuroscience and Cell Biology, University of Coimbra, Coimbra, Portugal.
| | - Victor Guimarães
- Department of Sciences, University of Porto, Porto, Portugal.,INESC-TEC - Centre of Advanced Computing Systems, Porto, Portugal
| | - Vítor S Costa
- Department of Sciences, University of Porto, Porto, Portugal.,INESC-TEC - Centre of Advanced Computing Systems, Porto, Portugal
| | - Irina S Moreira
- Department of Life Sciences, University of Coimbra, Calçada Martim de Freitas, 3000-456, Coimbra, Portugal. .,CNC - Center for Neuroscience and Cell Biology, CIBB - Center for Innovative Biomedicine and Biotechnology, University of Coimbra, Coimbra, Portugal.
| |
Collapse
|
34
|
Karimi K, Agalakov S, Telmer CA, Beatman TR, Pells TJ, Arshinoff BI, Ku CJ, Foley S, Hinman VF, Ettensohn CA, Vize PD. Classifying domain-specific text documents containing ambiguous keywords. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2021:6377760. [PMID: 34585729 PMCID: PMC8588847 DOI: 10.1093/database/baab062] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Revised: 08/23/2021] [Accepted: 09/16/2021] [Indexed: 11/14/2022]
Abstract
A keyword-based search of comprehensive databases such as PubMed may return
irrelevant papers, especially if the keywords are used in multiple fields of
study. In such cases, domain experts (curators) need to verify the results and
remove the irrelevant articles. Automating this filtering process will save
time, but it has to be done well enough to ensure few relevant papers are
rejected and few irrelevant papers are accepted. A good solution would be fast,
work with the limited amount of data freely available (full paper body may be
missing), handle ambiguous keywords and be as domain-neutral as possible. In
this paper, we evaluate a number of classification algorithms for identifying a
domain-specific set of papers about echinoderm species and show that the
resulting tool satisfies most of the abovementioned requirements. Echinoderms
consist of a number of very different organisms, including brittle stars, sea
stars (starfish), sea urchins and sea cucumbers. While their taxonomic
identifiers are specific, the common names are used in many other contexts,
creating ambiguity and making a keyword search prone to error. We try
classifiers using Linear, Naïve Bayes, Nearest Neighbor, Tree, SVM,
Bagging, AdaBoost and Neural Network learning models and compare their
performance. We show how effective the resulting classifiers are in filtering
irrelevant articles returned from PubMed. The methodology used is more dependent
on the good selection of training data and is a practical solution that can be
applied to other fields of study facing similar challenges. Database URL The code and date reported in this paper are freely available at
http://xenbaseturbofrog.org/pub/Text-Topic-Classifier/
Collapse
Affiliation(s)
- Kamran Karimi
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Sergei Agalakov
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Cheryl A Telmer
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Thomas R Beatman
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Troy J Pells
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Bradley Im Arshinoff
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| | - Carolyn J Ku
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Saoirse Foley
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Veronica F Hinman
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Charles A Ettensohn
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Peter D Vize
- Department of Biological Sciences, University of Calgary, Calgary, AB T2N 1N4, Canada
| |
Collapse
|
35
|
Parolo S, Tomasoni D, Bora P, Ramponi A, Kaddi C, Azer K, Domenici E, Neves-Zaph S, Lombardo R. Reconstruction of the Cytokine Signaling in Lysosomal Storage Diseases by Literature Mining and Network Analysis. Front Cell Dev Biol 2021; 9:703489. [PMID: 34490253 PMCID: PMC8417786 DOI: 10.3389/fcell.2021.703489] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 07/30/2021] [Indexed: 11/13/2022] Open
Abstract
Lysosomal storage diseases (LSDs) are characterized by the abnormal accumulation of substrates in tissues due to the deficiency of lysosomal proteins. Among the numerous clinical manifestations, chronic inflammation has been consistently reported for several LSDs. However, the molecular mechanisms involved in the inflammatory response are still not completely understood. In this study, we performed text-mining and systems biology analyses to investigate the inflammatory signals in three LSDs characterized by sphingolipid accumulation: Gaucher disease, Acid Sphingomyelinase Deficiency (ASMD), and Fabry Disease. We first identified the cytokines linked to the LSDs, and then built on the extracted knowledge to investigate the inflammatory signals. We found numerous transcription factors that are putative regulators of cytokine expression in a cell-specific context, such as the signaling axes controlled by STAT2, JUN, and NR4A2 as candidate regulators of the monocyte Gaucher disease cytokine network. Overall, our results suggest the presence of a complex inflammatory signaling in LSDs involving many cellular and molecular players that could be further investigated as putative targets of anti-inflammatory therapies.
Collapse
Affiliation(s)
- Silvia Parolo
- Fondazione the Microsoft Research-University of Trento Centre for Computational and Systems Biology, Rovereto, Italy
| | - Danilo Tomasoni
- Fondazione the Microsoft Research-University of Trento Centre for Computational and Systems Biology, Rovereto, Italy
| | - Pranami Bora
- Fondazione the Microsoft Research-University of Trento Centre for Computational and Systems Biology, Rovereto, Italy
| | - Alan Ramponi
- Fondazione the Microsoft Research-University of Trento Centre for Computational and Systems Biology, Rovereto, Italy
| | - Chanchala Kaddi
- Data and Data Science - Translational Disease Modeling, Sanofi, Bridgewater, NJ, United States
| | - Karim Azer
- Data and Data Science - Translational Disease Modeling, Sanofi, Bridgewater, NJ, United States
| | - Enrico Domenici
- Fondazione the Microsoft Research-University of Trento Centre for Computational and Systems Biology, Rovereto, Italy.,Department of Cellular, Computational and Integrative Biology (CIBIO), University of Trento, Trento, Italy
| | - Susana Neves-Zaph
- Data and Data Science - Translational Disease Modeling, Sanofi, Bridgewater, NJ, United States
| | - Rosario Lombardo
- Fondazione the Microsoft Research-University of Trento Centre for Computational and Systems Biology, Rovereto, Italy
| |
Collapse
|
36
|
Chen Q, Leaman R, Allot A, Luo L, Wei CH, Yan S, Lu Z. Artificial Intelligence in Action: Addressing the COVID-19 Pandemic with Natural Language Processing. Annu Rev Biomed Data Sci 2021; 4:313-339. [PMID: 34465169 DOI: 10.1146/annurev-biodatasci-021821-061045] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The COVID-19 (coronavirus disease 2019) pandemic has had a significant impact on society, both because of the serious health effects of COVID-19 and because of public health measures implemented to slow its spread. Many of these difficulties are fundamentally information needs; attempts to address these needs have caused an information overload for both researchers and the public. Natural language processing (NLP)-the branch of artificial intelligence that interprets human language-can be applied to address many of the information needs made urgent by the COVID-19 pandemic. This review surveys approximately 150 NLP studies and more than 50 systems and datasets addressing the COVID-19 pandemic. We detail work on four core NLP tasks: information retrieval, named entity recognition, literature-based discovery, and question answering. We also describe work that directly addresses aspects of the pandemic through four additional tasks: topic modeling, sentiment and emotion analysis, caseload forecasting, and misinformation detection. We conclude by discussing observable trends and remaining challenges.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Robert Leaman
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Alexis Allot
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Ling Luo
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Shankai Yan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA;
| |
Collapse
|
37
|
A Systematic Literature Review of Sexual Harassment Studies with Text Mining. SUSTAINABILITY 2021. [DOI: 10.3390/su13126589] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
Sexual harassment has been the topic of thousands of research articles in the 20th and 21st centuries. Several review papers have been developed to synthesize the literature about sexual harassment. While traditional literature review studies provide valuable insights, these studies have some limitations including analyzing a limited number of papers, being time-consuming and labor-intensive, focusing on a few topics, and lacking temporal trend analysis. To address these limitations, this paper employs both computational and qualitative approaches to identify major research topics, explore temporal trends of sexual harassment topics over the past few decades, and point to future possible directions in sexual harassment studies. We collected 5320 research papers published between 1977 and 2020, identified and analyzed sexual harassment topics, and explored the temporal trend of topics. Our findings indicate that sexual harassment in the workplace was the most popular research theme, and sexual harassment was investigated in a wide range of spaces ranging from school to military settings. Our analysis shows that 62.5% of the topics having a significant trend had an increasing (hot) temporal trend that is expected to be studied more in the coming years. This study offers a bird’s eye view to better understand sexual harassment literature with text mining, qualitative, and temporal trend analysis methods. This research could be beneficial to researchers, educators, publishers, and policymakers by providing a broad overview of the sexual harassment field.
Collapse
|
38
|
Badal VD, Kundrotas PJ, Vakser IA. Text mining for modeling of protein complexes enhanced by machine learning. Bioinformatics 2021; 37:497-505. [PMID: 32960948 PMCID: PMC8088328 DOI: 10.1093/bioinformatics/btaa823] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 09/04/2020] [Accepted: 09/08/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Procedures for structural modeling of protein-protein complexes (protein docking) produce a number of models which need to be further analyzed and scored. Scoring can be based on independently determined constraints on the structure of the complex, such as knowledge of amino acids essential for the protein interaction. Previously, we showed that text mining of residues in freely available PubMed abstracts of papers on studies of protein-protein interactions may generate such constraints. However, absence of post-processing of the spotted residues reduced usability of the constraints, as a significant number of the residues were not relevant for the binding of the specific proteins. RESULTS We explored filtering of the irrelevant residues by two machine learning approaches, Deep Recursive Neural Network (DRNN) and Support Vector Machine (SVM) models with different training/testing schemes. The results showed that the DRNN model is superior to the SVM model when training is performed on the PMC-OA full-text articles and applied to classification (interface or non-interface) of the residues spotted in the PubMed abstracts. When both training and testing is performed on full-text articles or on abstracts, the performance of these models is similar. Thus, in such cases, there is no need to utilize computationally demanding DRNN approach, which is computationally expensive especially at the training stage. The reason is that SVM success is often determined by the similarity in data/text patterns in the training and the testing sets, whereas the sentence structures in the abstracts are, in general, different from those in the full text articles. AVAILABILITYAND IMPLEMENTATION The code and the datasets generated in this study are available at https://gitlab.ku.edu/vakser-lab-public/text-mining/-/tree/2020-09-04. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Ilya A Vakser
- Computational Biology Program.,Department of Molecular Biosciences, The University of Kansas, Lawrence, KS 66045, USA
| |
Collapse
|
39
|
Text Mining Gene Selection to Understand Pathological Phenotype Using Biological Big Data. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] Open
|
40
|
Classification of Full Text Biomedical Documents: Sections Importance Assessment. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11062674] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The exponential growth of documents in the web makes it very hard for researchers to be aware of the relevant work being done within the scientific community. The task of efficiently retrieving information has therefore become an important research topic. The objective of this study is to test how the efficiency of the text classification changes if different weights are previously assigned to the sections that compose the documents. The proposal takes into account the place (section) where terms are located in the document, and each section has a weight that can be modified depending on the corpus. To carry out the study, an extended version of the OHSUMED corpus with full documents have been created. Through the use of WEKA, we compared the use of abstracts only with that of full texts, as well as the use of section weighing combinations to assess their significance in the scientific article classification process using the SMO (Sequential Minimal Optimization), the WEKA Support Vector Machine (SVM) algorithm implementation. The experimental results show that the proposed combinations of the preprocessing techniques and feature selection achieve promising results for the task of full text scientific document classification. We also have evidence to conclude that enriched datasets with text from certain sections achieve better results than using only titles and abstracts.
Collapse
|
41
|
Espinosa C, Becker M, Marić I, Wong RJ, Shaw GM, Gaudilliere B, Aghaeepour N, Stevenson DK. Data-Driven Modeling of Pregnancy-Related Complications. Trends Mol Med 2021; 27:762-776. [PMID: 33573911 DOI: 10.1016/j.molmed.2021.01.007] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 12/01/2020] [Accepted: 01/20/2021] [Indexed: 12/11/2022]
Abstract
A healthy pregnancy depends on complex interrelated biological adaptations involving placentation, maternal immune responses, and hormonal homeostasis. Recent advances in high-throughput technologies have provided access to multiomics biological data that, combined with clinical and social data, can provide a deeper understanding of normal and abnormal pregnancies. Integration of these heterogeneous datasets using state-of-the-art machine-learning methods can enable the prediction of short- and long-term health trajectories for a mother and offspring and the development of treatments to prevent or minimize complications. We review advanced machine-learning methods that could: provide deeper biological insights into a pregnancy not yet unveiled by current methodologies; clarify the etiologies and heterogeneity of pathologies that affect a pregnancy; and suggest the best approaches to address disparities in outcomes affecting vulnerable populations.
Collapse
Affiliation(s)
- Camilo Espinosa
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA; Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
| | - Martin Becker
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA; Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
| | - Ivana Marić
- Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Ronald J Wong
- Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Gary M Shaw
- Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Brice Gaudilliere
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA; Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - Nima Aghaeepour
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford University School of Medicine, Stanford, CA, USA; Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA; Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | - David K Stevenson
- Department of Pediatrics, Division of Neonatal and Developmental Medicine, Stanford University School of Medicine, Stanford, CA, USA.
| | | |
Collapse
|
42
|
Sousa D, Lamurias A, Couto FM. Using Neural Networks for Relation Extraction from Biomedical Literature. Methods Mol Biol 2021; 2190:289-305. [PMID: 32804372 DOI: 10.1007/978-1-0716-0826-5_14] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Using different sources of information to support automated extracting of relations between biomedical concepts contributes to the development of our understanding of biological systems. The primary comprehensive source of these relations is biomedical literature. Several relation extraction approaches have been proposed to identify relations between concepts in biomedical literature, namely, using neural networks algorithms. The use of multichannel architectures composed of multiple data representations, as in deep neural networks, is leading to state-of-the-art results. The right combination of data representations can eventually lead us to even higher evaluation scores in relation extraction tasks. Thus, biomedical ontologies play a fundamental role by providing semantic and ancestry information about an entity. The incorporation of biomedical ontologies has already been proved to enhance previous state-of-the-art results.
Collapse
Affiliation(s)
- Diana Sousa
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal.
| | - Andre Lamurias
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
| | - Francisco M Couto
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal
| |
Collapse
|
43
|
Pandi MT, van der Spek PJ, Koromina M, Patrinos GP. A Novel Text-Mining Approach for Retrieving Pharmacogenomics Associations From the Literature. Front Pharmacol 2020; 11:602030. [PMID: 33343371 PMCID: PMC7748107 DOI: 10.3389/fphar.2020.602030] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Accepted: 09/30/2020] [Indexed: 11/13/2022] Open
Abstract
Text mining in biomedical literature is an emerging field which has already been shown to have a variety of implementations in many research areas, including genetics, personalized medicine, and pharmacogenomics. In this study, we describe a novel text-mining approach for the extraction of pharmacogenomics associations. The code that was used toward this end was implemented using R programming language, either through custom scripts, where needed, or through utilizing functions from existing libraries. Articles (abstracts or full texts) that correspond to a specified query were extracted from PubMed, while concept annotations were derived by PubTator Central. Terms that denote a Mutation or a Gene as well as Chemical compound terms corresponding to drug compounds were normalized and the sentences containing the aforementioned terms were filtered and preprocessed to create appropriate training sets. Finally, after training and adequate hyperparameter tuning, four text classifiers were created and evaluated (FastText, Linear kernel SVMs, XGBoost, Lasso, and Elastic-Net Regularized Generalized Linear Models) with regard to their performance in identifying pharmacogenomics associations. Although further improvements are essential toward proper implementation of this text-mining approach in the clinical practice, our study stands as a comprehensive, simplified, and up-to-date approach for the identification and assessment of research articles enriched in clinically relevant pharmacogenomics relationships. Furthermore, this work highlights a series of challenges concerning the effective application of text mining in biomedical literature, whose resolution could substantially contribute to the further development of this field.
Collapse
Affiliation(s)
- Maria-Theodora Pandi
- Laboratory of Pharmacogenomics and Individualized Therapy, Department of Pharmacy, School of Health Sciences, University of Patras, Patras, Greece.,Erasmus University Medical Center, Faculty of Medicine and Health Sciences, Department of Pathology, Bioinformatics Unit, Rotterdam, Netherlands
| | - Peter J van der Spek
- Erasmus University Medical Center, Faculty of Medicine and Health Sciences, Department of Pathology, Bioinformatics Unit, Rotterdam, Netherlands
| | - Maria Koromina
- Laboratory of Pharmacogenomics and Individualized Therapy, Department of Pharmacy, School of Health Sciences, University of Patras, Patras, Greece
| | - George P Patrinos
- Laboratory of Pharmacogenomics and Individualized Therapy, Department of Pharmacy, School of Health Sciences, University of Patras, Patras, Greece.,Erasmus University Medical Center, Faculty of Medicine and Health Sciences, Department of Pathology, Bioinformatics Unit, Rotterdam, Netherlands.,Department of Pathology, College of Medicine and Health Sciences, United Arab Emirates University, Al-Ain, United Arab Emirates.,Zayed Center of Health Sciences, United Arab Emirates University, Al-Ain, United Arab Emirates
| |
Collapse
|
44
|
Niss K, Jakobsson ME, Westergaard D, Belling KG, Olsen JV, Brunak S. Effects of active farnesoid X receptor on GLUTag enteroendocrine L cells. Mol Cell Endocrinol 2020; 517:110923. [PMID: 32702472 DOI: 10.1016/j.mce.2020.110923] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/29/2020] [Revised: 05/27/2020] [Accepted: 06/23/2020] [Indexed: 12/21/2022]
Abstract
Activated transcription factor (TF) farnesoid X receptor (FXR) represses glucagon-like peptide-1 (GLP-1) secretion in enteroendocrine L cells. This, in turn, reduces insulin secretion, which is triggered when β cells bind GLP-1. Preventing FXR activation could boost GLP-1 production and insulin secretion. Yet, FXR's broader role in L cell biology still lacks understanding. Here, we show that FXR is a multifaceted TF in L cells using proteomics and gene expression data generated on GLUTag L cells. Most striking, 252 proteins regulated upon glucose stimulation have their abundances neutralized upon FXR activation. Mitochondrial repression or glucose import block are likely mechanisms of this. Further, FXR physically targets bile acid metabolism proteins, growth factors and other TFs, regulates ChREBP, while extensive text-mining found 30 FXR-regulated proteins to be well-known in L cell biology. Taken together, this outlines FXR as a powerful TF, where GLP-1 secretion block is just one of many downstream effects.
Collapse
Affiliation(s)
- Kristoffer Niss
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen, Denmark
| | - Magnus E Jakobsson
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen, Denmark; Department of Immunotechnology, Lund University, Medicon Village, 22100, Lund, Sweden
| | - David Westergaard
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen, Denmark; Dept. of Health Technology, Technical University of Denmark, DK-2800, Lyngby, Denmark
| | - Kirstine G Belling
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen, Denmark
| | - Jesper V Olsen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen, Denmark
| | - Søren Brunak
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen, Denmark; Dept. of Health Technology, Technical University of Denmark, DK-2800, Lyngby, Denmark.
| |
Collapse
|
45
|
Gobeill J, Caucheteur D, Michel PA, Mottin L, Pasche E, Ruch P. SIB Literature Services: RESTful customizable search engines in biomedical literature, enriched with automatically mapped biomedical concepts. Nucleic Acids Res 2020; 48:W12-W16. [PMID: 32379317 PMCID: PMC7319474 DOI: 10.1093/nar/gkaa328] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 04/09/2020] [Accepted: 04/22/2020] [Indexed: 01/05/2023] Open
Abstract
Thanks to recent efforts by the text mining community, biocurators have now access to plenty of good tools and Web interfaces for identifying and visualizing biomedical entities in literature. Yet, many of these systems start with a PubMed query, which is limited by strong Boolean constraints. Some semantic search engines exploit entities for Information Retrieval, and/or deliver relevance-based ranked results. Yet, they are not designed for supporting a specific curation workflow, and allow very limited control on the search process. The Swiss Institute of Bioinformatics Literature Services (SIBiLS) provide personalized Information Retrieval in the biological literature. Indeed, SIBiLS allow fully customizable search in semantically enriched contents, based on keywords and/or mapped biomedical entities from a growing set of standardized and legacy vocabularies. The services have been used and favourably evaluated to assist the curation of genes and gene products, by delivering customized literature triage engines to different curation teams. SIBiLS (https://candy.hesge.ch/SIBiLS) are freely accessible via REST APIs and are ready to empower any curation workflow, built on modern technologies scalable with big data: MongoDB and Elasticsearch. They cover MEDLINE and PubMed Central Open Access enriched by nearly 2 billion of mapped biomedical entities, and are daily updated.
Collapse
Affiliation(s)
- Julien Gobeill
- To whom correspondence should be addressed. Tel: +41 22 388 17 86; Fax: +41 22 546 97 38;
| | - Déborah Caucheteur
- BiTeM group, Information Sciences, HES-SO / HEG Geneva, 1227 Carouge, Switzerland
| | - Pierre-André Michel
- SIB Text Mining group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland
| | - Luc Mottin
- BiTeM group, Information Sciences, HES-SO / HEG Geneva, 1227 Carouge, Switzerland
| | - Emilie Pasche
- SIB Text Mining group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland
- BiTeM group, Information Sciences, HES-SO / HEG Geneva, 1227 Carouge, Switzerland
| | - Patrick Ruch
- Correspondence may also be addressed to Patrick Ruch. Tel: +41 22 388 17 81; Fax: +41 22 546 97 38;
| |
Collapse
|
46
|
Wang CCN, Jin J, Chang JG, Hayakawa M, Kitazawa A, Tsai JJP, Sheu PCY. Identification of most influential co-occurring gene suites for gastrointestinal cancer using biomedical literature mining and graph-based influence maximization. BMC Med Inform Decis Mak 2020; 20:208. [PMID: 32883271 PMCID: PMC7469322 DOI: 10.1186/s12911-020-01227-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Accepted: 08/20/2020] [Indexed: 12/02/2022] Open
Abstract
Background Gastrointestinal (GI) cancer including colorectal cancer, gastric cancer, pancreatic cancer, etc., are among the most frequent malignancies diagnosed annually and represent a major public health problem worldwide. Methods This paper reports an aided curation pipeline to identify potential influential genes for gastrointestinal cancer. The curation pipeline integrates biomedical literature to identify named entities by Bi-LSTM-CNN-CRF methods. The entities and their associations can be used to construct a graph, and from which we can compute the sets of co-occurring genes that are the most influential based on an influence maximization algorithm. Results The sets of co-occurring genes that are the most influential that we discover include RARA - CRBP1, CASP3 - BCL2, BCL2 - CASP3 – CRBP1, RARA - CASP3 – CRBP1, FOXJ1 - RASSF3 - ESR1, FOXJ1 - RASSF1A - ESR1, FOXJ1 - RASSF1A - TNFAIP8 - ESR1. With TCGA and functional and pathway enrichment analysis, we prove the proposed approach works well in the context of gastrointestinal cancer. Conclusions Our pipeline that uses text mining to identify objects and relationships to construct a graph and uses graph-based influence maximization to discover the most influential co-occurring genes presents a viable direction to assist knowledge discovery for clinical applications.
Collapse
Affiliation(s)
- Charles C N Wang
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan.,Center for Artificial Intelligence in Precision Medicine, UAsia University, Taichung, Taiwan
| | - Jennifer Jin
- Department of EECS and BME, University of California, Irvine, USA
| | - Jan-Gowth Chang
- Department of Laboratory Medicine, China Medical University Hospital, Taichung, Taiwan.,Center for Precision Medicine, China Medical University Hospital, Taichung, Taiwan.,Graduate Institute of Clinical Medical Science, School of Medicine, College of Medicine, China Medical University, Taichung, Taiwan
| | | | | | - Jeffrey J P Tsai
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan
| | - Phillip C-Y Sheu
- Department of EECS and BME, University of California, Irvine, USA.
| |
Collapse
|
47
|
Nguyen A, O'Dwyer J, Vu T, Webb PM, Johnatty SE, Spurdle AB. Generating high-quality data abstractions from scanned clinical records: text-mining-assisted extraction of endometrial carcinoma pathology features as proof of principle. BMJ Open 2020; 10:e037740. [PMID: 32532784 PMCID: PMC7295399 DOI: 10.1136/bmjopen-2020-037740] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
OBJECTIVE Medical research studies often rely on the manual collection of data from scanned typewritten clinical records, which can be laborious, time consuming and error prone because of the need to review individual clinical records. We aimed to use text mining to assist with the extraction of clinical features from complex text-based scanned pathology records for medical research studies. DESIGN Text mining performance was measured by extracting and annotating three distinct pathological features from scanned photocopies of endometrial carcinoma clinical pathology reports, and comparing results to manually abstracted terms. Inclusion and exclusion keyword trigger terms to capture leiomyomas, endometriosis and adenomyosis were provided based on expert knowledge. Terms were expanded with character variations based on common optical character recognition (OCR) error patterns as well as negation phrases found in sample reports. The approach was evaluated on an unseen test set of 1293 scanned pathology reports originating from laboratories across Australia. SETTING Scanned typewritten pathology reports for women aged 18-79 years with newly diagnosed endometrial cancer (2005-2007) in Australia. RESULTS High concordance with final abstracted codes was observed for identifying the presence of three pathology features (94%-98% F-measure). The approach was more consistent and reliable than manual abstractions, identifying 3%-14% additional feature instances. CONCLUSION Keyword trigger-based automation with OCR error correction and negation handling proved not only to be rapid and convenient, but also providing consistent and reliable data abstractions from scanned clinical records. In conjunction with manual review, it can assist in the generation of high-quality data abstractions for medical research studies.
Collapse
Affiliation(s)
- Anthony Nguyen
- The Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Brisbane, Queensland, Australia
| | - John O'Dwyer
- The Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Brisbane, Queensland, Australia
| | - Thanh Vu
- The Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Brisbane, Queensland, Australia
| | - Penelope M Webb
- Department of Genetics and Computational Biology, QIMR Berghofer Medical Research Institute, Brisbane, Queensland, Australia
| | - Sharon E Johnatty
- Department of Genetics and Computational Biology, QIMR Berghofer Medical Research Institute, Brisbane, Queensland, Australia
| | - Amanda B Spurdle
- Department of Genetics and Computational Biology, QIMR Berghofer Medical Research Institute, Brisbane, Queensland, Australia
| |
Collapse
|
48
|
Comeau DC, Wei CH, Islamaj Doğan R, Lu Z. PMC text mining subset in BioC: about three million full-text articles and growing. Bioinformatics 2020; 35:3533-3535. [PMID: 30715220 DOI: 10.1093/bioinformatics/btz070] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Revised: 01/17/2018] [Accepted: 01/28/2019] [Indexed: 12/19/2022] Open
Abstract
MOTIVATION Interest in text mining full-text biomedical research articles is growing. To facilitate automated processing of nearly 3 million full-text articles (in PubMed Central® Open Access and Author Manuscript subsets) and to improve interoperability, we convert these articles to BioC, a community-driven simple data structure in either XML or JavaScript Object Notation format for conveniently sharing text and annotations. RESULTS The resultant articles can be downloaded via both File Transfer Protocol for bulk access and a Web API for updates or a more focused collection. Since the availability of the Web API in 2017, our BioC collection has been widely used by the research community. AVAILABILITY AND IMPLEMENTATION https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PMC/.
Collapse
Affiliation(s)
- Donald C Comeau
- National Center for Biotechnology Information (NCBI), U.S. Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), U.S. Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Rezarta Islamaj Doğan
- National Center for Biotechnology Information (NCBI), U.S. Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), U.S. Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| |
Collapse
|
49
|
Nicholson DN, Greene CS. Constructing knowledge graphs and their biomedical applications. Comput Struct Biotechnol J 2020; 18:1414-1428. [PMID: 32637040 PMCID: PMC7327409 DOI: 10.1016/j.csbj.2020.05.017] [Citation(s) in RCA: 97] [Impact Index Per Article: 19.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 05/22/2020] [Accepted: 05/23/2020] [Indexed: 12/31/2022] Open
Abstract
Knowledge graphs can support many biomedical applications. These graphs represent biomedical concepts and relationships in the form of nodes and edges. In this review, we discuss how these graphs are constructed and applied with a particular focus on how machine learning approaches are changing these processes. Biomedical knowledge graphs have often been constructed by integrating databases that were populated by experts via manual curation, but we are now seeing a more robust use of automated systems. A number of techniques are used to represent knowledge graphs, but often machine learning methods are used to construct a low-dimensional representation that can support many different applications. This representation is designed to preserve a knowledge graph's local and/or global structure. Additional machine learning methods can be applied to this representation to make predictions within genomic, pharmaceutical, and clinical domains. We frame our discussion first around knowledge graph construction and then around unifying representational learning techniques and unifying applications. Advances in machine learning for biomedicine are creating new opportunities across many domains, and we note potential avenues for future work with knowledge graphs that appear particularly promising.
Collapse
Affiliation(s)
- David N. Nicholson
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, United States
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Childhood Cancer Data Lab, Alex’s Lemonade Stand Foundation, United States
| |
Collapse
|
50
|
Leaman R, Wei CH, Allot A, Lu Z. Ten tips for a text-mining-ready article: How to improve automated discoverability and interpretability. PLoS Biol 2020; 18:e3000716. [PMID: 32479517 PMCID: PMC7289435 DOI: 10.1371/journal.pbio.3000716] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 06/11/2020] [Indexed: 12/22/2022] Open
Abstract
Data-driven research in biomedical science requires structured, computable data. Increasingly, these data are created with support from automated text mining. Text-mining tools have rapidly matured: although not perfect, they now frequently provide outstanding results. We describe 10 straightforward writing tips—and a web tool, PubReCheck—guiding authors to help address the most common cases that remain difficult for text-mining tools. We anticipate these guides will help authors’ work be found more readily and used more widely, ultimately increasing the impact of their work and the overall benefit to both authors and readers. PubReCheck is available at http://www.ncbi.nlm.nih.gov/research/pubrecheck. Your published research is already being processed with automated tools, and text mining will become more common; this Community Page article describes how you can help these tools process your work more accurately, including a web tool, PubReCheck.
Collapse
Affiliation(s)
- Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
- * E-mail:
| |
Collapse
|