1
|
Rosonovski S, Levchenko M, Bhatnagar R, Chandrasekaran U, Faulk L, Hassan I, Jeffryes M, Mubashar SI, Nassar M, Jayaprabha Palanisamy M, Parkin M, Poluru J, Rogers F, Saha S, Selim M, Shafique Z, Ide-Smith M, Stephenson D, Tirunagari S, Venkatesan A, Xing L, Harrison M. Europe PMC in 2023. Nucleic Acids Res 2024; 52:D1668-D1676. [PMID: 37994696 PMCID: PMC10767826 DOI: 10.1093/nar/gkad1085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/23/2023] [Accepted: 10/30/2023] [Indexed: 11/24/2023] Open
Abstract
Europe PMC (https://europepmc.org/) is an open access database of life science journal articles and preprints, which contains over 42 million abstracts and over 9 million full text articles accessible via the website, APIs and bulk download. This publication outlines new developments to the Europe PMC platform since the last database update in 2020 (1) and focuses on five main areas. (i) Improving discoverability, reproducibility and trust in preprints by indexing new preprint content, enriching preprint metadata and identifying withdrawn and removed preprints. (ii) Enhancing support for text and data mining by expanding the types of annotations provided and developing the Europe PMC Annotations Corpus, which can be used to train machine learning models to increase their accuracy and precision. (iii) Developing the Article Status Monitor tool and email alerts, to notify users about new articles and updates to existing records. (iv) Positioning Europe PMC as an open scholarly infrastructure through increasing the portion of open source core software, improving sustainability and accessibility of the service.
Collapse
Affiliation(s)
- Summer Rosonovski
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Maria Levchenko
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Rajat Bhatnagar
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | | | - Lynne Faulk
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Islam Hassan
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Matt Jeffryes
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | | | - Maaly Nassar
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | | | - Michael Parkin
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | | | - Frances Rogers
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Shyamasree Saha
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Mohamed Selim
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Zunaira Shafique
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Michele Ide-Smith
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - David Stephenson
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Santosh Tirunagari
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Aravind Venkatesan
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Lijun Xing
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Melissa Harrison
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| |
Collapse
|
2
|
Yang X, Saha S, Venkatesan A, Tirunagari S, Vartak V, McEntyre J. Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms. Sci Data 2023; 10:722. [PMID: 37857688 PMCID: PMC10587067 DOI: 10.1038/s41597-023-02617-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 10/03/2023] [Indexed: 10/21/2023] Open
Abstract
Named entity recognition (NER) is a widely used text-mining and natural language processing (NLP) subtask. In recent years, deep learning methods have superseded traditional dictionary- and rule-based NER approaches. A high-quality dataset is essential to fully leverage recent deep learning advancements. While several gold-standard corpora for biomedical entities in abstracts exist, only a few are based on full-text research articles. The Europe PMC literature database routinely annotates Gene/Proteins, Diseases, and Organisms entities. To transition this pipeline from a dictionary-based to a machine learning-based approach, we have developed a human-annotated full-text corpus for these entities, comprising 300 full-text open-access research articles. Over 72,000 mentions of biomedical concepts have been identified within approximately 114,000 sentences. This article describes the corpus and details how to access and reuse this open community resource.
Collapse
Affiliation(s)
- Xiao Yang
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Shyamasree Saha
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
- Open Targets, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Aravind Venkatesan
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Santosh Tirunagari
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK.
- Open Targets, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - Vid Vartak
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Johanna McEntyre
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| |
Collapse
|
3
|
Caucheteur D, May Pendlington Z, Roncaglia P, Gobeill J, Mottin L, Matentzoglu N, Agosti D, Osumi-Sutherland D, Parkinson H, Ruch P. COVoc and COVTriage: novel resources to support literature triage. Bioinformatics 2023; 39:6895097. [PMID: 36511598 PMCID: PMC9825781 DOI: 10.1093/bioinformatics/btac800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2022] [Revised: 10/28/2022] [Accepted: 12/12/2022] [Indexed: 12/15/2022] Open
Abstract
MOTIVATION Since early 2020, the coronavirus disease 2019 (COVID-19) pandemic has confronted the biomedical community with an unprecedented challenge. The rapid spread of COVID-19 and ease of transmission seen worldwide is due to increased population flow and international trade. Front-line medical care, treatment research and vaccine development also require rapid and informative interpretation of the literature and COVID-19 data produced around the world, with 177 500 papers published between January 2020 and November 2021, i.e. almost 8500 papers per month. To extract knowledge and enable interoperability across resources, we developed the COVID-19 Vocabulary (COVoc), an application ontology related to the research on this pandemic. The main objective of COVoc development was to enable seamless navigation from biomedical literature to core databases and tools of ELIXIR, a European-wide intergovernmental organization for life sciences. RESULTS This collaborative work provided data integration into SIB Literature services, an application ontology (COVoc) and a triage service named COVTriage and based on annotation processing to search for COVID-related information across pre-defined aspects with daily updates. Thanks to its interoperability potential, COVoc lends itself to wider applications, hopefully through further connections with other novel COVID-19 ontologies as has been established with Coronavirus Infectious Disease Ontology. AVAILABILITY AND IMPLEMENTATION The data at https://github.com/EBISPOT/covoc and the service at https://candy.hesge.ch/COVTriage.
Collapse
Affiliation(s)
| | - Zoë May Pendlington
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge CB10 1SD, UK
| | - Paola Roncaglia
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge CB10 1SD, UK
| | - Julien Gobeill
- SIB Text Mining Group, Swiss Institute of Bioinformatics, Geneva 1206, Switzerland
- BiTeM Group, Information Sciences, HES-SO/HEG Genève, Carouge 1227, Switzerland
| | - Luc Mottin
- SIB Text Mining Group, Swiss Institute of Bioinformatics, Geneva 1206, Switzerland
- BiTeM Group, Information Sciences, HES-SO/HEG Genève, Carouge 1227, Switzerland
- Department of Microbiology and Molecular Medicine, Faculty of Medicine, University of Geneva, Geneva 1205, Switzerland
| | - Nicolas Matentzoglu
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge CB10 1SD, UK
- Semanticly Ltd, London, WC2H 9JQ, UK
| | - Donat Agosti
- SIB Text Mining Group, Swiss Institute of Bioinformatics, Geneva 1206, Switzerland
- Plazi, Bern 3007, Switzerland
| | - David Osumi-Sutherland
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge CB10 1SD, UK
| | - Helen Parkinson
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge CB10 1SD, UK
| | - Patrick Ruch
- SIB Text Mining Group, Swiss Institute of Bioinformatics, Geneva 1206, Switzerland
- BiTeM Group, Information Sciences, HES-SO/HEG Genève, Carouge 1227, Switzerland
| |
Collapse
|
4
|
Ahmed YW, Alemu BA, Bekele SA, Gizaw ST, Zerihun MF, Wabalo EK, Teklemariam MD, Mihrete TK, Hanurry EY, Amogne TG, Gebrehiwot AD, Berga TN, Haile EA, Edo DO, Alemu BD. Epigenetic tumor heterogeneity in the era of single-cell profiling with nanopore sequencing. Clin Epigenetics 2022; 14:107. [PMID: 36030244 PMCID: PMC9419648 DOI: 10.1186/s13148-022-01323-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2022] [Accepted: 08/12/2022] [Indexed: 11/29/2022] Open
Abstract
Nanopore sequencing has brought the technology to the next generation in the science of sequencing. This is achieved through research advancing on: pore efficiency, creating mechanisms to control DNA translocation, enhancing signal-to-noise ratio, and expanding to long-read ranges. Heterogeneity regarding epigenetics would be broad as mutations in the epigenome are sensitive to cause new challenges in cancer research. Epigenetic enzymes which catalyze DNA methylation and histone modification are dysregulated in cancer cells and cause numerous heterogeneous clones to evolve. Detection of this heterogeneity in these clones plays an indispensable role in the treatment of various cancer types. With single-cell profiling, the nanopore sequencing technology could provide a simple sequence at long reads and is expected to be used soon at the bedside or doctor's office. Here, we review the advancements of nanopore sequencing and its use in the detection of epigenetic heterogeneity in cancer.
Collapse
Affiliation(s)
- Yohannis Wondwosen Ahmed
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia.
| | - Berhan Ababaw Alemu
- Department of Medical Biochemistry, School of Medicine, St. Paul's Hospital, Millennium Medical College, Addis Ababa, Ethiopia
| | - Sisay Addisu Bekele
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Solomon Tebeje Gizaw
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Muluken Fekadie Zerihun
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Endriyas Kelta Wabalo
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Maria Degef Teklemariam
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Tsehayneh Kelemu Mihrete
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Endris Yibru Hanurry
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Tensae Gebru Amogne
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Assaye Desalegne Gebrehiwot
- Department of Medical Anatomy, School of Medicine, College of Health Sciences, Addis Ababa University, Addis Ababa, Ethiopia
| | - Tamirat Nida Berga
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Ebsitu Abate Haile
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Dessiet Oma Edo
- Department of Medical Biochemistry, School of Medicine, College of Health Sciences, Addis Ababa University, P.O. Box: 9086, Addis Ababa, Ethiopia
| | - Bizuwork Derebew Alemu
- Department of Statistics, College of Natural and Computational Sciences, Mizan Tepi University, Tepi, Ethiopia
| |
Collapse
|
5
|
Cognitive analysis of metabolomics data for systems biology. Nat Protoc 2021; 16:1376-1418. [PMID: 33483720 DOI: 10.1038/s41596-020-00455-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Accepted: 10/27/2020] [Indexed: 01/30/2023]
Abstract
Cognitive computing is revolutionizing the way big data are processed and integrated, with artificial intelligence (AI) natural language processing (NLP) platforms helping researchers to efficiently search and digest the vast scientific literature. Most available platforms have been developed for biomedical researchers, but new NLP tools are emerging for biologists in other fields and an important example is metabolomics. NLP provides literature-based contextualization of metabolic features that decreases the time and expert-level subject knowledge required during the prioritization, identification and interpretation steps in the metabolomics data analysis pipeline. Here, we describe and demonstrate four workflows that combine metabolomics data with NLP-based literature searches of scientific databases to aid in the analysis of metabolomics data and their biological interpretation. The four procedures can be used in isolation or consecutively, depending on the research questions. The first, used for initial metabolite annotation and prioritization, creates a list of metabolites that would be interesting for follow-up. The second workflow finds literature evidence of the activity of metabolites and metabolic pathways in governing the biological condition on a systems biology level. The third is used to identify candidate biomarkers, and the fourth looks for metabolic conditions or drug-repurposing targets that the two diseases have in common. The protocol can take 1-4 h or more to complete, depending on the processing time of the various software used.
Collapse
|
6
|
Ferguson C, Araújo D, Faulk L, Gou Y, Hamelers A, Huang Z, Ide-Smith M, Levchenko M, Marinos N, Nambiar R, Nassar M, Parkin M, Pi X, Rahman F, Rogers F, Roochun Y, Saha S, Selim M, Shafique Z, Sharma S, Stephenson D, Talo' F, Thouvenin A, Tirunagari S, Vartak V, Venkatesan A, Yang X, McEntyre J. Europe PMC in 2020. Nucleic Acids Res 2021; 49:D1507-D1514. [PMID: 33180112 PMCID: PMC7778976 DOI: 10.1093/nar/gkaa994] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2020] [Revised: 10/08/2020] [Accepted: 10/19/2020] [Indexed: 12/23/2022] Open
Abstract
Europe PMC (https://europepmc.org) is a database of research articles, including peer reviewed full text articles and abstracts, and preprints - all freely available for use via website, APIs and bulk download. This article outlines new developments since 2017 where work has focussed on three key areas: (i) Europe PMC has added to its core content to include life science preprint abstracts and a special collection of full text of COVID-19-related preprints. Europe PMC is unique as an aggregator of biomedical preprints alongside peer-reviewed articles, with over 180 000 preprints available to search. (ii) Europe PMC has significantly expanded its links to content related to the publications, such as links to Unpaywall, providing wider access to full text, preprint peer-review platforms, all major curated data resources in the life sciences, and experimental protocols. The redesigned Europe PMC website features the PubMed abstract and corresponding PMC full text merged into one article page; there is more evident and user-friendly navigation within articles and to related content, plus a figure browse feature. (iii) The expanded annotations platform offers ∼1.3 billion text mined biological terms and concepts sourced from 10 providers and over 40 global data resources.
Collapse
Affiliation(s)
- Christine Ferguson
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Dayane Araújo
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Lynne Faulk
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Yuci Gou
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Audrey Hamelers
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Zhan Huang
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Michele Ide-Smith
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Maria Levchenko
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Nikos Marinos
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Rakesh Nambiar
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Maaly Nassar
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Michael Parkin
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Xingjun Pi
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Faisal Rahman
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Frances Rogers
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Yogmatee Roochun
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Shyamasree Saha
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Mohamed Selim
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Zunaira Shafique
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Shrey Sharma
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - David Stephenson
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Francesco Talo'
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Arthur Thouvenin
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Santosh Tirunagari
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Vid Vartak
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Aravind Venkatesan
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Xiao Yang
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Johanna McEntyre
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| |
Collapse
|
7
|
Gobeill J, Caucheteur D, Michel PA, Mottin L, Pasche E, Ruch P. SIB Literature Services: RESTful customizable search engines in biomedical literature, enriched with automatically mapped biomedical concepts. Nucleic Acids Res 2020; 48:W12-W16. [PMID: 32379317 PMCID: PMC7319474 DOI: 10.1093/nar/gkaa328] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 04/09/2020] [Accepted: 04/22/2020] [Indexed: 01/05/2023] Open
Abstract
Thanks to recent efforts by the text mining community, biocurators have now access to plenty of good tools and Web interfaces for identifying and visualizing biomedical entities in literature. Yet, many of these systems start with a PubMed query, which is limited by strong Boolean constraints. Some semantic search engines exploit entities for Information Retrieval, and/or deliver relevance-based ranked results. Yet, they are not designed for supporting a specific curation workflow, and allow very limited control on the search process. The Swiss Institute of Bioinformatics Literature Services (SIBiLS) provide personalized Information Retrieval in the biological literature. Indeed, SIBiLS allow fully customizable search in semantically enriched contents, based on keywords and/or mapped biomedical entities from a growing set of standardized and legacy vocabularies. The services have been used and favourably evaluated to assist the curation of genes and gene products, by delivering customized literature triage engines to different curation teams. SIBiLS (https://candy.hesge.ch/SIBiLS) are freely accessible via REST APIs and are ready to empower any curation workflow, built on modern technologies scalable with big data: MongoDB and Elasticsearch. They cover MEDLINE and PubMed Central Open Access enriched by nearly 2 billion of mapped biomedical entities, and are daily updated.
Collapse
Affiliation(s)
- Julien Gobeill
- To whom correspondence should be addressed. Tel: +41 22 388 17 86; Fax: +41 22 546 97 38;
| | - Déborah Caucheteur
- BiTeM group, Information Sciences, HES-SO / HEG Geneva, 1227 Carouge, Switzerland
| | - Pierre-André Michel
- SIB Text Mining group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland
| | - Luc Mottin
- BiTeM group, Information Sciences, HES-SO / HEG Geneva, 1227 Carouge, Switzerland
| | - Emilie Pasche
- SIB Text Mining group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland
- BiTeM group, Information Sciences, HES-SO / HEG Geneva, 1227 Carouge, Switzerland
| | - Patrick Ruch
- Correspondence may also be addressed to Patrick Ruch. Tel: +41 22 388 17 81; Fax: +41 22 546 97 38;
| |
Collapse
|
8
|
Wei CH, Allot A, Leaman R, Lu Z. PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res 2020; 47:W587-W593. [PMID: 31114887 DOI: 10.1093/nar/gkz389] [Citation(s) in RCA: 175] [Impact Index Per Article: 43.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2019] [Revised: 04/08/2019] [Accepted: 04/30/2019] [Indexed: 11/12/2022] Open
Abstract
PubTator Central (https://www.ncbi.nlm.nih.gov/research/pubtator/) is a web service for viewing and retrieving bioconcept annotations in full text biomedical articles. PubTator Central (PTC) provides automated annotations from state-of-the-art text mining systems for genes/proteins, genetic variants, diseases, chemicals, species and cell lines, all available for immediate download. PTC annotates PubMed (29 million abstracts) and the PMC Text Mining subset (3 million full text articles). The new PTC web interface allows users to build full text document collections and visualize concept annotations in each document. Annotations are downloadable in multiple formats (XML, JSON and tab delimited) via the online interface, a RESTful web service and bulk FTP. Improved concept identification systems and a new disambiguation module based on deep learning increase annotation accuracy, and the new server-side architecture is significantly faster. PTC is synchronized with PubMed and PubMed Central, with new articles added daily. The original PubTator service has served annotated abstracts for ∼300 million requests, enabling third-party research in use cases such as biocuration support, gene prioritization, genetic disease analysis, and literature-based knowledge discovery. We demonstrate the full text results in PTC significantly increase biomedical concept coverage and anticipate this expansion will both enhance existing downstream applications and enable new use cases.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| |
Collapse
|
9
|
Palopoli N, Iserte JA, Chemes LB, Marino-Buslje C, Parisi G, Gibson TJ, Davey NE. The articles.ELM resource: simplifying access to protein linear motif literature by annotation, text-mining and classification. Database (Oxford) 2020; 2020:baaa040. [PMID: 32507889 PMCID: PMC7276420 DOI: 10.1093/database/baaa040] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2019] [Revised: 04/24/2020] [Accepted: 05/06/2020] [Indexed: 11/12/2022]
Abstract
Modern biology produces data at a staggering rate. Yet, much of these biological data is still isolated in the text, figures, tables and supplementary materials of articles. As a result, biological information created at great expense is significantly underutilised. The protein motif biology field does not have sufficient resources to curate the corpus of motif-related literature and, to date, only a fraction of the available articles have been curated. In this study, we develop a set of tools and a web resource, 'articles.ELM', to rapidly identify the motif literature articles pertinent to a researcher's interest. At the core of the resource is a manually curated set of about 8000 motif-related articles. These articles are automatically annotated with a range of relevant biological data allowing in-depth search functionality. Machine-learning article classification is used to group articles based on their similarity to manually curated motif classes in the Eukaryotic Linear Motif resource. Articles can also be manually classified within the resource. The 'articles.ELM' resource permits the rapid and accurate discovery of relevant motif articles thereby improving the visibility of motif literature and simplifying the recovery of valuable biological insights sequestered within scientific articles. Consequently, this web resource removes a critical bottleneck in scientific productivity for the motif biology field. Database URL: http://slim.icr.ac.uk/articles/.
Collapse
Affiliation(s)
- N Palopoli
- Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, CONICET, Roque Saenz Peña 352, Bernal, Buenos Aires B1876BXD, Argentina
| | - J A Iserte
- Fundación Instituto Leloir, Instituto de Investigaciones Bioquímicas de Buenos Aires, CONICET, Av. Patricias Argentinas 435, Ciudad de Buenos Aires C1405BWE, Argentina
| | - L B Chemes
- Instituto de Investigaciones Biotecnológicas, Universidad Nacional de General San Martín, IIB-INTECH-CONICET, Av. 25 de Mayo y Francia, San Martín, Buenos Aires B1650, Argentina
| | - C Marino-Buslje
- Fundación Instituto Leloir, Instituto de Investigaciones Bioquímicas de Buenos Aires, CONICET, Av. Patricias Argentinas 435, Ciudad de Buenos Aires C1405BWE, Argentina
| | - G Parisi
- Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, CONICET, Roque Saenz Peña 352, Bernal, Buenos Aires B1876BXD, Argentina
| | - T J Gibson
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstraße 1, Heidelberg 69117, Germany
| | - N E Davey
- Division of Cancer Biology, The Institute of Cancer Research, 237 Fulham Road, London SW3 6JB, UK
| |
Collapse
|
10
|
Holmås S, Riudavets Puig R, Acencio ML, Mironov V, Kuiper M. The Cytoscape BioGateway App: explorative network building from the BioGateway triple store. Bioinformatics 2019; 36:btz835. [PMID: 31710663 PMCID: PMC7703768 DOI: 10.1093/bioinformatics/btz835] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2019] [Revised: 10/11/2019] [Accepted: 11/05/2019] [Indexed: 01/04/2023] Open
Abstract
SUMMARY The BioGateway App is a Cytoscape (version 3) plugin designed to provide easy query access to the BioGateway RDF triple store, which contains functional and interaction information for proteins from several curated resources. For explorative network building, we have added a comprehensive dataset with regulatory relationships of mammalian DNA binding transcription factors and their target genes, compiled both from curated resources and from a text mining effort. Query results are visualised using the inherent flexibility of the Cytoscape framework, and network links can be checked against curated database records or against the original publication. AVAILABILITY Install through the Cytoscape application manager or visit www.biogateway.eu for download and tutorial documents. SUPPLEMENTARY INFORMATION Supplementary information is available at Bioinformatics online.
Collapse
Affiliation(s)
- Stian Holmås
- Semantic Systems Biology Group, Department of Biology
| | | | - Marcio Luis Acencio
- Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology, 7491 Trondheim, Norway
| | | | - Martin Kuiper
- Semantic Systems Biology Group, Department of Biology
| |
Collapse
|
11
|
Firth R, Talo F, Venkatesan A, Mukhopadhyay A, McEntyre J, Velankar S, Morris C. Automatic annotation of protein residues in published papers. Acta Crystallogr F Struct Biol Commun 2019; 75:665-672. [PMID: 31702580 DOI: 10.1107/s2053230x1901210x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2018] [Accepted: 09/01/2019] [Indexed: 11/10/2022] Open
Abstract
This work presents an annotation tool that automatically locates mentions of particular amino-acid residues in published papers and identifies the protein concerned. These matches can be provided in context or in a searchable format in order for researchers to better use the existing and future literature.
Collapse
Affiliation(s)
- Robert Firth
- STFC, Daresbury Laboratory, Warrington WA4 4AD, England
| | - Francesco Talo
- European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, England
| | - Aravind Venkatesan
- European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, England
| | - Abhik Mukhopadhyay
- European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, England
| | - Johanna McEntyre
- European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, England
| | - Sameer Velankar
- European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, England
| | - Chris Morris
- STFC, Daresbury Laboratory, Warrington WA4 4AD, England
| |
Collapse
|
12
|
Levchenko M, Gou Y, Graef F, Hamelers A, Huang Z, Ide-Smith M, Iyer A, Kilian O, Katuri J, Kim JH, Marinos N, Nambiar R, Parkin M, Pi X, Rogers F, Talo F, Vartak V, Venkatesan A, McEntyre J. Europe PMC in 2017. Nucleic Acids Res 2019; 46:D1254-D1260. [PMID: 29161421 PMCID: PMC5753258 DOI: 10.1093/nar/gkx1005] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Accepted: 11/13/2017] [Indexed: 11/13/2022] Open
Abstract
Europe PMC (https://europepmc.org) is a comprehensive resource of biomedical research publications that offers advanced tools for search, retrieval, and interaction with the scientific literature. This article outlines new developments since 2014. In addition to delivering the core database and services, Europe PMC focuses on three areas of development: individual user services, data integration, and infrastructure to support text and data mining. Europe PMC now provides user accounts to save search queries and claim publications to ORCIDs, as well as open access profiles for authors based on public ORCID records. We continue to foster connections between scientific data and literature in a number of ways. All the data behind the paper - whether in structured archives, generic archives or as supplemental files - are now available via links to the BioStudies database. Text-mined biological concepts, including database accession numbers and data DOIs, are highlighted in the text and linked to the appropriate data resources. The SciLite community annotation platform accepts text-mining results from various contributors and overlays them on research articles as licence allows. In addition, text miners and developers can access all open content via APIs or via the FTP site.
Collapse
Affiliation(s)
- Maria Levchenko
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Yuci Gou
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Florian Graef
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Audrey Hamelers
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Zhan Huang
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Michele Ide-Smith
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Anusha Iyer
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Oliver Kilian
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Jyothi Katuri
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Jee-Hyub Kim
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Nikos Marinos
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Rakesh Nambiar
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Michael Parkin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Xingjun Pi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Frances Rogers
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Francesco Talo
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Vid Vartak
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Aravind Venkatesan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| | - Johanna McEntyre
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK
| |
Collapse
|
13
|
Vamathevan J, Apweiler R, Birney E. Biomolecular Data Resources: Bioinformatics Infrastructure for Biomedical Data Science. Annu Rev Biomed Data Sci 2019. [DOI: 10.1146/annurev-biodatasci-072018-021321] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Technological advances have continuously driven the generation of bio-molecular data and the development of bioinformatics infrastructure, which enables data reuse for scientific discovery. Several types of data management resources have arisen, such as data deposition databases, added-value databases or knowledgebases, and biology-driven portals. In this review, we provide a unique overview of the gradual evolution of these resources and discuss the goals and features that must be considered in their development. With the increasing application of genomics in the health care context and with 60 to 500 million whole genomes estimated to be sequenced by 2022, biomedical research infrastructure is transforming, too. Systems for federated access, portable tools, provision of reference data, and interpretation tools will enable researchers to derive maximal benefits from these data. Collaboration, coordination, and sustainability of data resources are key to ensure that biomedical knowledge management can scale with technology shifts and growing data volumes.
Collapse
Affiliation(s)
- Jessica Vamathevan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Rolf Apweiler
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| | - Ewan Birney
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom
| |
Collapse
|
14
|
Palmblad M. Visual and Semantic Enrichment of Analytical Chemistry Literature Searches by Combining Text Mining and Computational Chemistry. Anal Chem 2019; 91:4312-4316. [PMID: 30835438 PMCID: PMC6448173 DOI: 10.1021/acs.analchem.8b05818] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
![]()
The
open-access scientific literature contains a wealth of information
for meaningful text mining. However, this information is not always
easy to retrieve. This technical note addresses the problem by a new
flexible method combining in a single workflow existing resources
for literature searches, text mining, and large-scale prediction of
physicochemical and biological properties. The results are visualized
as virtual mass spectra, chromatograms, or images in styles new to
text mining but familiar to analytical chemistry. The method is demonstrated
on comparisons of analytical-chemistry techniques and semantically
enriched searches for proteins and their activities, but it may also
be of general utility in experimental design, drug discovery, chemical
syntheses, business intelligence, and historical studies. The method
is realized in shareable scientific workflows using only freely available
data, services, and software that scale to millions of publications
and named chemical entities in the literature.
Collapse
Affiliation(s)
- Magnus Palmblad
- Center for Proteomics and Metabolomics , Leiden University Medical Center , Postzone S3-P, Postbus 9600, 2300 RC Leiden , The Netherlands
| |
Collapse
|
15
|
Cook CE, Lopez R, Stroe O, Cochrane G, Brooksbank C, Birney E, Apweiler R. The European Bioinformatics Institute in 2018: tools, infrastructure and training. Nucleic Acids Res 2019; 47:D15-D22. [PMID: 30445657 PMCID: PMC6323906 DOI: 10.1093/nar/gky1124] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2018] [Revised: 10/19/2018] [Accepted: 11/11/2018] [Indexed: 02/03/2023] Open
Abstract
The European Bioinformatics Institute (https://www.ebi.ac.uk/) archives, curates and analyses life sciences data produced by researchers throughout the world, and makes these data available for re-use globally (https://www.ebi.ac.uk/). Data volumes continue to grow exponentially: total raw storage capacity now exceeds 160 petabytes, and we manage these increasing data flows while maintaining the quality of our services. This year we have improved the efficiency of our computational infrastructure and doubled the bandwidth of our connection to the worldwide web. We report two new data resources, the Single Cell Expression Atlas (https://www.ebi.ac.uk/gxa/sc/), which is a component of the Expression Atlas; and the PDBe-Knowledgebase (https://www.ebi.ac.uk/pdbe/pdbe-kb), which collates functional annotations and predictions for structure data in the Protein Data Bank. Additionally, Europe PMC (http://europepmc.org/) has added preprint abstracts to its search results, supplementing results from peer-reviewed publications. EMBL-EBI maintains over 150 analytical bioinformatics tools that complement our data resources. We make these tools available for users through a web interface as well as programmatically using application programming interfaces, whilst ensuring the latest versions are available for our users. Our training team, with support from all of our staff, continued to provide on-site, off-site and web-based training opportunities for thousands of researchers worldwide this year.
Collapse
Affiliation(s)
- Charles E Cook
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Rodrigo Lopez
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Oana Stroe
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Guy Cochrane
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Cath Brooksbank
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Ewan Birney
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Rolf Apweiler
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
16
|
Venkatesan A, Tagny Ngompe G, Hassouni NE, Chentli I, Guignon V, Jonquet C, Ruiz M, Larmande P. Agronomic Linked Data (AgroLD): A knowledge-based system to enable integrative biology in agronomy. PLoS One 2018; 13:e0198270. [PMID: 30500839 PMCID: PMC6269127 DOI: 10.1371/journal.pone.0198270] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Accepted: 09/03/2018] [Indexed: 12/22/2022] Open
Abstract
Recent advances in high-throughput technologies have resulted in a tremendous increase in the amount of omics data produced in plant science. This increase, in conjunction with the heterogeneity and variability of the data, presents a major challenge to adopt an integrative research approach. We are facing an urgent need to effectively integrate and assimilate complementary datasets to understand the biological system as a whole. The Semantic Web offers technologies for the integration of heterogeneous data and their transformation into explicit knowledge thanks to ontologies. We have developed the Agronomic Linked Data (AgroLD- www.agrold.org), a knowledge-based system relying on Semantic Web technologies and exploiting standard domain ontologies, to integrate data about plant species of high interest for the plant science community e.g., rice, wheat, arabidopsis. We present some integration results of the project, which initially focused on genomics, proteomics and phenomics. AgroLD is now an RDF (Resource Description Format) knowledge base of 100M triples created by annotating and integrating more than 50 datasets coming from 10 data sources-such as Gramene.org and TropGeneDB-with 10 ontologies-such as the Gene Ontology and Plant Trait Ontology. Our evaluation results show users appreciate the multiple query modes which support different use cases. AgroLD's objective is to offer a domain specific knowledge platform to solve complex biological and agronomical questions related to the implication of genes/proteins in, for instances, plant disease resistance or high yield traits. We expect the resolution of these questions to facilitate the formulation of new scientific hypotheses to be validated with a knowledge-oriented approach.
Collapse
Affiliation(s)
- Aravind Venkatesan
- Institut de Biologie Computationnelle (IBC), Univ. of Montpellier, Montpellier, France
- LIRMM, Univ. of Montpellier & CNRS, Montpellier, France
| | - Gildas Tagny Ngompe
- Institut de Biologie Computationnelle (IBC), Univ. of Montpellier, Montpellier, France
- LIRMM, Univ. of Montpellier & CNRS, Montpellier, France
| | - Nordine El Hassouni
- Institut de Biologie Computationnelle (IBC), Univ. of Montpellier, Montpellier, France
- UMR AGAP, CIRAD, Montpellier, France
- South Green Bioinformatics Platform, Montpellier, France
| | - Imene Chentli
- Institut de Biologie Computationnelle (IBC), Univ. of Montpellier, Montpellier, France
- LIRMM, Univ. of Montpellier & CNRS, Montpellier, France
| | - Valentin Guignon
- South Green Bioinformatics Platform, Montpellier, France
- Bioversity International, Montpellier, France
| | - Clement Jonquet
- Institut de Biologie Computationnelle (IBC), Univ. of Montpellier, Montpellier, France
- LIRMM, Univ. of Montpellier & CNRS, Montpellier, France
| | - Manuel Ruiz
- Institut de Biologie Computationnelle (IBC), Univ. of Montpellier, Montpellier, France
- UMR AGAP, CIRAD, Montpellier, France
- South Green Bioinformatics Platform, Montpellier, France
- AGAP, Univ. of Montpellier, CIRAD, INRA, INRIA, SupAgro, Montpellier, France
| | - Pierre Larmande
- Institut de Biologie Computationnelle (IBC), Univ. of Montpellier, Montpellier, France
- LIRMM, Univ. of Montpellier & CNRS, Montpellier, France
- South Green Bioinformatics Platform, Montpellier, France
- DIADE, IRD, Univ. of Montpellier, Montpellier, France
| |
Collapse
|
17
|
Thompson P, Daikou S, Ueno K, Batista-Navarro R, Tsujii J, Ananiadou S. Annotation and detection of drug effects in text for pharmacovigilance. J Cheminform 2018; 10:37. [PMID: 30105604 PMCID: PMC6089860 DOI: 10.1186/s13321-018-0290-y] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2018] [Accepted: 07/20/2018] [Indexed: 02/02/2023] Open
Abstract
Pharmacovigilance (PV) databases record the benefits and risks of different drugs, as a means to ensure their safe and effective use. Creating and maintaining such resources can be complex, since a particular medication may have divergent effects in different individuals, due to specific patient characteristics and/or interactions with other drugs being administered. Textual information from various sources can provide important evidence to curators of PV databases about the usage and effects of drug targets in different medical subjects. However, the efficient identification of relevant evidence can be challenging, due to the increasing volume of textual data. Text mining (TM) techniques can support curators by automatically detecting complex information, such as interactions between drugs, diseases and adverse effects. This semantic information supports the quick identification of documents containing information of interest (e.g., the different types of patients in which a given adverse drug reaction has been observed to occur). TM tools are typically adapted to different domains by applying machine learning methods to corpora that are manually labelled by domain experts using annotation guidelines to ensure consistency. We present a semantically annotated corpus of 597 MEDLINE abstracts, PHAEDRA, encoding rich information on drug effects and their interactions, whose quality is assured through the use of detailed annotation guidelines and the demonstration of high levels of inter-annotator agreement (e.g., 92.6% F-Score for identifying named entities and 78.4% F-Score for identifying complex events, when relaxed matching criteria are applied). To our knowledge, the corpus is unique in the domain of PV, according to the level of detail of its annotations. To illustrate the utility of the corpus, we have trained TM tools based on its rich labels to recognise drug effects in text automatically. The corpus and annotation guidelines are available at: http://www.nactem.ac.uk/PHAEDRA/ .
Collapse
Affiliation(s)
- Paul Thompson
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| | - Sophia Daikou
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| | - Kenju Ueno
- Artificial Intelligence Research Center, National Research and Development Agency (AIST), Tokyo Waterfront 2-3-2 Aomi, Koto-ku, Tokyo, 135-0064 Japan
| | - Riza Batista-Navarro
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| | - Jun’ichi Tsujii
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
- Artificial Intelligence Research Center, National Research and Development Agency (AIST), Tokyo Waterfront 2-3-2 Aomi, Koto-ku, Tokyo, 135-0064 Japan
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, Manchester Institute of Biotechnology, University of Manchester, 131 Princess Street, Manchester, M1 7DN UK
| |
Collapse
|
18
|
Britan A, Cusin I, Hinard V, Mottin L, Pasche E, Gobeill J, Rech de Laval V, Gleizes A, Teixeira D, Michel PA, Ruch P, Gaudet P. Accelerating annotation of articles via automated approaches: evaluation of the neXtA5 curation-support tool by neXtProt. Database (Oxford) 2018; 2018:5255187. [PMID: 30576492 PMCID: PMC6301339 DOI: 10.1093/database/bay129] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2018] [Revised: 10/04/2018] [Accepted: 11/09/2018] [Indexed: 11/14/2022]
Abstract
The development of efficient text-mining tools promises to boost the curation workflow by significantly reducing the time needed to process the literature into biological databases. We have developed a curation support tool, neXtA5, that provides a search engine coupled with an annotation system directly integrated into a biocuration workflow. neXtA5 assists curation with modules optimized for the thevarious curation tasks: document triage, entity recognition and information extraction.Here, we describe the evaluation of neXtA5 by expert curators. We first assessed the annotations of two independent curators to provide a baseline for comparison. To evaluate the performance of neXtA5, we submitted requests and compared the neXtA5 results with the manual curation. The analysis focuses on the usability of neXtA5 to support the curation of two types of data: biological processes (BPs) and diseases (Ds). We evaluated the relevance of the papers proposed as well as the recall and precision of the suggested annotations.The evaluation of document triage by neXtA5 precision showed that both curators agree with neXtA5 for 67 (BP) and 63% (D) of abstracts, while curators agree on accepting or rejecting an abstract ~80% of the time. Hence, the precision of the triage system is satisfactory.For concept extraction, curators approved 35 (BP) and 25% (D) of the neXtA5 annotations. Conversely, neXtA5 successfully annotated up to 36 (BP) and 68% (D) of the terms identified by curators. The user feedback obtained in these tests highlighted the need for improvement in the ranking function of neXtA5 annotations. Therefore, we transformed the information extraction component into an annotation ranking system. This improvement results in a top precision (precision at first rank) of 59 (D) and 63% (BP). These results suggest that when considering only the first extracted entity, the current system achieves a precision comparable with expert biocurators.
Collapse
Affiliation(s)
- Aurore Britan
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Isabelle Cusin
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Valérie Hinard
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Luc Mottin
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Emilie Pasche
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Julien Gobeill
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Valentine Rech de Laval
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Anne Gleizes
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Daniel Teixeira
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Pierre-André Michel
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Patrick Ruch
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Pascale Gaudet
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| |
Collapse
|
19
|
Mottin L, Pasche E, Gobeill J, Rech de Laval V, Gleizes A, Michel PA, Bairoch A, Gaudet P, Ruch P. Triage by ranking to support the curation of protein interactions. Database (Oxford) 2017; 2017:3866793. [PMID: 29220432 PMCID: PMC5502361 DOI: 10.1093/database/bax040] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2016] [Revised: 04/19/2017] [Accepted: 04/20/2017] [Indexed: 01/08/2023]
Abstract
Database URL http://candy.hesge.ch/nextA5.
Collapse
Affiliation(s)
- Luc Mottin
- Information Science Department, BiTeM Group, HES-SO/HEG Genève, 17 Rue de la Tambourine, Carouge CH-1227, Switzerland
- SIB Text Mining, Swiss Institute of Bioinformatics, 17 Rue de la Tambourine, Carouge CH-1227, Switzerland
| | - Emilie Pasche
- Information Science Department, BiTeM Group, HES-SO/HEG Genève, 17 Rue de la Tambourine, Carouge CH-1227, Switzerland
- SIB Text Mining, Swiss Institute of Bioinformatics, 17 Rue de la Tambourine, Carouge CH-1227, Switzerland
| | - Julien Gobeill
- Information Science Department, BiTeM Group, HES-SO/HEG Genève, 17 Rue de la Tambourine, Carouge CH-1227, Switzerland
- SIB Text Mining, Swiss Institute of Bioinformatics, 17 Rue de la Tambourine, Carouge CH-1227, Switzerland
| | - Valentine Rech de Laval
- CALIPHO Group, Swiss Institute of Bioinformatics, 1 Rue Michel-Servet, Geneva CH-1206, Switzerland
- University of Geneva, Geneva
| | - Anne Gleizes
- CALIPHO Group, Swiss Institute of Bioinformatics, 1 Rue Michel-Servet, Geneva CH-1206, Switzerland
| | - Pierre-André Michel
- CALIPHO Group, Swiss Institute of Bioinformatics, 1 Rue Michel-Servet, Geneva CH-1206, Switzerland
| | - Amos Bairoch
- CALIPHO Group, Swiss Institute of Bioinformatics, 1 Rue Michel-Servet, Geneva CH-1206, Switzerland
- University of Geneva, Geneva
| | - Pascale Gaudet
- CALIPHO Group, Swiss Institute of Bioinformatics, 1 Rue Michel-Servet, Geneva CH-1206, Switzerland
- University of Geneva, Geneva
| | - Patrick Ruch
- Information Science Department, BiTeM Group, HES-SO/HEG Genève, 17 Rue de la Tambourine, Carouge CH-1227, Switzerland
- SIB Text Mining, Swiss Institute of Bioinformatics, 17 Rue de la Tambourine, Carouge CH-1227, Switzerland
| |
Collapse
|