1
|
Carpenter B, Kamalakannan S, Saikam P, Alvarez DV, Hanass-Hancock J, Murthy GVS, Pinilla-Roncancio M, Velarde MR, Teodoro D, Mitra S. Data resource profile: the disability statistics questionnaire review database (DS-QR Database): a database of population censuses and household surveys with internationally comparable disability questions. Int J Popul Data Sci 2024; 8:2477. [PMID: 40109444 PMCID: PMC11922099 DOI: 10.23889/ijpds.v8i6.2477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/22/2025] Open
Abstract
Introduction The 2030 Sustainable Development Agenda and the United Nations Convention on the Rights of Persons with Disabilities (CRPD) aspire to leave no one behind and call for the inclusion of persons with disabilities in all spheres of life. To monitor this goal of inclusion, CRPD's Article 31 requires state parties to collect data about the situation of persons with disabilities. The Disability Statistics - Questionnaire Review Database (DS-QR Database) reports on whether population and housing censuses and household surveys include internationally recommended disability questions for adults ages 15 and older. Methods The Disability Data Initiative (DDI), an international consortium of researchers, regularly retrieves and analyses a list of surveys and censuses from international catalogs, libraries and websites of national statistics offices. Questionnaires are reviewed to identify if they include internationally recommended questions on functional difficulties (e.g. difficulty seeing), more specifically (i) the Washington Group Short Set (WG-SS) or (ii) questions that meet at least the United Nations 2017 guidelines for disability measurement in censuses (other functional difficulty questions thereafter). Results The DS-QR Database includes the review results for the questionnaires of 3027 population censuses and surveys from 199 countries and territories collected from 2009 to 2023. The review has information on whether each dataset has the WG-SS or other functional difficulty questions and overall results per country, region, type of dataset and over time. Conclusion By identifying countries that collect internationally comparable disability data, the DS-QR Database can help researchers, policymakers and advocates determine whether countries fulfill their obligations as per CRPD Article 31. It can also assist in identifying which datasets use functional difficulty questions and can be used to research and monitor disability rights over time and across countries. The DS-QR Database is in a Supplementary file and will be accessible on a website upon publication of this article.
Collapse
Affiliation(s)
- Bradley Carpenter
- Gender and Health Research Unit, South African Medical Research Council, 491 Peter Mokaba Ridge Road, Overport, Durban, South Africa
- College of Health Science, University of KwaZulu-Natal, University Road, Westville, Durban, South Africa
| | - Sureshkumar Kamalakannan
- PRASHO (Pragyaan Sustainable Health Outcomes Foundation), Level 2, Kapil Kavuri Hub, No. 144, Survey 37, Financial District, Nanakramguda, Hyderabad, Telangana, India
- Department of Social Work Education and Community Wellbeing, Northumbria University, Newcastle Upon Tyne, NE7 7TR England, United Kingdom
- Royal College of Occupational Therapists (RCOT) Phoenix House, 106-114 Borough High Street. London SE1 1LB England, United Kingdom
| | - Pavani Saikam
- PRASHO (Pragyaan Sustainable Health Outcomes Foundation), Level 2, Kapil Kavuri Hub, No. 144, Survey 37, Financial District, Nanakramguda, Hyderabad, Telangana, India
| | | | - Jill Hanass-Hancock
- Gender and Health Research Unit, South African Medical Research Council, 491 Peter Mokaba Ridge Road, Overport, Durban, South Africa
- College of Health Science, University of KwaZulu-Natal, University Road, Westville, Durban, South Africa
| | - GVS Murthy
- PRASHO (Pragyaan Sustainable Health Outcomes Foundation), Level 2, Kapil Kavuri Hub, No. 144, Survey 37, Financial District, Nanakramguda, Hyderabad, Telangana, India
| | - Monica Pinilla-Roncancio
- School of Medicine and Centre of Sustainable Development Goals (CODS). Universidad de los Andes, Cra. 1 #18a-12 Bogota, Colombia
| | - Minerva Rivas Velarde
- Geneva School of Health Sciences, HES-SO, Av. de Champel 47, 1206 Genève, Switzerland
- The Institute for Ethics, History, and the Humanities (iEH2) Faculty of Medicine, Université de Genève
| | - Douglas Teodoro
- Université de Genève, 24 rue du Général-Dufour, 1211 Genève 4, Switzerland
| | - Sophie Mitra
- Fordham University, 441 East Fordham Road, Bronx, NY 10458, USA
| |
Collapse
|
2
|
Nascimento Dial A, Vicente D, Mitra S, Teodoro D, Rivas Velarde M. Did high frequency phone surveys during the COVID-19 pandemic include disability questions? An assessment of COVID-19 surveys from March 2020 to December 2022. BMJ Open 2024; 14:e079760. [PMID: 38991678 PMCID: PMC11288142 DOI: 10.1136/bmjopen-2023-079760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 06/19/2024] [Indexed: 07/13/2024] Open
Abstract
OBJECTIVES In the midst of the pandemic, face-to-face data collection for national censuses and surveys was suspended due to limitations on mobility and social distancing, limiting the collection of already scarce disability data. Responses to these constraints were met with a surge of high-frequency phone surveys (HFPSs) that aimed to provide timely data for understanding the socioeconomic impacts of and responses to the pandemic. This paper provides an assessment of HFPS datasets and their inclusion of disability questions to evaluate the visibility of persons with disabilities during the COVID-19 pandemic. DESIGN We collected HFPS questionnaires conducted globally from the onset of the pandemic emergency in March 2020 until December 2022 from various online survey repositories. Each HFPS questionnaire was searched using a set of keywords for inclusion of different types of disability questions. Results were recorded in an Excel review log, which was manually reviewed by two researchers. METHODS The review of HFPS datasets involved two stages: (1) a main review of 294 HFPS dataset-waves and (2) a semiautomated review of the same dataset-waves using a search engine-powered questionnaire review tool developed by our team. The results from the main review were compared with those of a sensitivity analysis using and testing the tool as an alternative to manual search. RESULTS Roughly half of HFPS datasets reviewed and 60% of the countries included in this study had some type of question on disability. While disability questions were not widely absent from HFPS datasets, only 3% of HFPS datasets included functional difficulty questions that meet international standards. The search engine-powered questionnaire review tool proved to be able to streamline the search process for future research on inclusive data. CONCLUSIONS The dearth of functional difficulty questions and the Washington-Group Short Set in particular in HFPS has contributed to the relative invisibility of persons with disabilities during the pandemic emergency, the lingering effects of which could impede policy-making, monitoring and advocacy on behalf of persons with disabilities.
Collapse
Affiliation(s)
| | - David Vicente
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| | - Sophie Mitra
- Department of Economics, Fordham University, New York, New York, USA
| | - Douglas Teodoro
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| | | |
Collapse
|
3
|
Zhang B, Naderi N, Mishra R, Teodoro D. Online Health Search Via Multidimensional Information Quality Assessment Based on Deep Language Models: Algorithm Development and Validation. JMIR AI 2024; 3:e42630. [PMID: 38875551 PMCID: PMC11099810 DOI: 10.2196/42630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Revised: 07/12/2023] [Accepted: 01/15/2024] [Indexed: 06/16/2024]
Abstract
BACKGROUND Widespread misinformation in web resources can lead to serious implications for individuals seeking health advice. Despite that, information retrieval models are often focused only on the query-document relevance dimension to rank results. OBJECTIVE We investigate a multidimensional information quality retrieval model based on deep learning to enhance the effectiveness of online health care information search results. METHODS In this study, we simulated online health information search scenarios with a topic set of 32 different health-related inquiries and a corpus containing 1 billion web documents from the April 2019 snapshot of Common Crawl. Using state-of-the-art pretrained language models, we assessed the quality of the retrieved documents according to their usefulness, supportiveness, and credibility dimensions for a given search query on 6030 human-annotated, query-document pairs. We evaluated this approach using transfer learning and more specific domain adaptation techniques. RESULTS In the transfer learning setting, the usefulness model provided the largest distinction between help- and harm-compatible documents, with a difference of +5.6%, leading to a majority of helpful documents in the top 10 retrieved. The supportiveness model achieved the best harm compatibility (+2.4%), while the combination of usefulness, supportiveness, and credibility models achieved the largest distinction between help- and harm-compatibility on helpful topics (+16.9%). In the domain adaptation setting, the linear combination of different models showed robust performance, with help-harm compatibility above +4.4% for all dimensions and going as high as +6.8%. CONCLUSIONS These results suggest that integrating automatic ranking models created for specific information quality dimensions can increase the effectiveness of health-related information retrieval. Thus, our approach could be used to enhance searches made by individuals seeking online health information.
Collapse
Affiliation(s)
- Boya Zhang
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| | - Nona Naderi
- Department of Computer Science, Université Paris-Saclay, Centre national de la recherche scientifique, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay, France
| | - Rahul Mishra
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| | - Douglas Teodoro
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| |
Collapse
|
4
|
Hawkins NT, Maldaver M, Yannakopoulos A, Guare LA, Krishnan A. Systematic tissue annotations of genomics samples by modeling unstructured metadata. Nat Commun 2022; 13:6736. [PMID: 36347858 PMCID: PMC9643451 DOI: 10.1038/s41467-022-34435-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 10/25/2022] [Indexed: 11/10/2022] Open
Abstract
There are currently >1.3 million human -omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are routinely described using varied terminologies written in unstructured natural language. We propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for genomics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms. Our approach significantly outperforms an advanced graph-based reasoning annotation method (MetaSRA) and a baseline exact string matching method (TAGGER). Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text. Additionally, these models correctly classify tissue-associated biological processes and diseases based on their text descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the genomics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto .
Collapse
Affiliation(s)
- Nathaniel T Hawkins
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA
| | - Marc Maldaver
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA
| | - Anna Yannakopoulos
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA
| | - Lindsay A Guare
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, 48824, USA
- Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI, 48824, USA
| | - Arjun Krishnan
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA.
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, 48824, USA.
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA.
| |
Collapse
|
5
|
Penev L, Koureas D, Groom Q, Lanfear J, Agosti D, Casino A, Miller J, Arvanitidis C, Cochrane G, Hobern D, Banki O, Addink W, Kõljalg U, Copas K, Mergen P, Güntsch A, Benichou L, Benito Gonzalez Lopez J, Ruch P, Martin C, Barov B, Hristova K. Biodiversity Community Integrated Knowledge Library (BiCIKL). RESEARCH IDEAS AND OUTCOMES 2022. [DOI: 10.3897/rio.8.e81136] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
BiCIKL is an European Union Horizon 2020 project that will initiate and build a new European starting community of key research infrastructures, establishing open science practices in the domain of biodiversity through provision of access to data, associated tools and services at each separate stage of and along the entire research cycle. BiCIKL will provide new methods and workflows for an integrated access to harvesting, liberating, linking, accessing and re-using of subarticle-level data (specimens, material citations, samples, sequences, taxonomic names, taxonomic treatments, figures, tables) extracted from literature. BiCIKL will provide for the first time access and tools for seamless linking and usage tracking of data along the line: specimens > sequences > species > analytics > publications > biodiversity knowledge graph > re-use.
Collapse
|
6
|
Teodoro D, Ferdowsi S, Borissov N, Kashani E, Vicente Alvarez D, Copara J, Gouareb R, Naderi N, Amini P. Information retrieval in an infodemic: the case of COVID-19 publications. J Med Internet Res 2021; 23:e30161. [PMID: 34375298 PMCID: PMC8451964 DOI: 10.2196/30161] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Revised: 07/22/2021] [Accepted: 08/05/2021] [Indexed: 12/31/2022] Open
Abstract
Background The COVID-19 global health crisis has led to an exponential surge in published scientific literature. In an attempt to tackle the pandemic, extremely large COVID-19–related corpora are being created, sometimes with inaccurate information, which is no longer at scale of human analyses. Objective In the context of searching for scientific evidence in the deluge of COVID-19–related literature, we present an information retrieval methodology for effective identification of relevant sources to answer biomedical queries posed using natural language. Methods Our multistage retrieval methodology combines probabilistic weighting models and reranking algorithms based on deep neural architectures to boost the ranking of relevant documents. Similarity of COVID-19 queries is compared to documents, and a series of postprocessing methods is applied to the initial ranking list to improve the match between the query and the biomedical information source and boost the position of relevant documents. Results The methodology was evaluated in the context of the TREC-COVID challenge, achieving competitive results with the top-ranking teams participating in the competition. Particularly, the combination of bag-of-words and deep neural language models significantly outperformed an Okapi Best Match 25–based baseline, retrieving on average, 83% of relevant documents in the top 20. Conclusions These results indicate that multistage retrieval supported by deep learning could enhance identification of literature for COVID-19–related questions posed using natural language.
Collapse
Affiliation(s)
- Douglas Teodoro
- HES-SO University of Applied Arts and Sciences of Western Switzerland, Rue de la Tambourine 17, Carouge, CH.,SIB Swiss Institute of Bioinformatics, Lausanne, CH
| | - Sohrab Ferdowsi
- HES-SO University of Applied Arts and Sciences of Western Switzerland, Rue de la Tambourine 17, Carouge, CH
| | | | - Elham Kashani
- Institute of Pathology, University of Bern, Bern, CH
| | - David Vicente Alvarez
- HES-SO University of Applied Arts and Sciences of Western Switzerland, Rue de la Tambourine 17, Carouge, CH
| | - Jenny Copara
- HES-SO University of Applied Arts and Sciences of Western Switzerland, Rue de la Tambourine 17, Carouge, CH.,SIB Swiss Institute of Bioinformatics, Lausanne, CH.,University of Geneva, Geneva, CH
| | - Racha Gouareb
- HES-SO University of Applied Arts and Sciences of Western Switzerland, Rue de la Tambourine 17, Carouge, CH
| | - Nona Naderi
- HES-SO University of Applied Arts and Sciences of Western Switzerland, Rue de la Tambourine 17, Carouge, CH.,SIB Swiss Institute of Bioinformatics, Lausanne, CH
| | - Poorya Amini
- Risklick AG, Bern, CH.,Clinical Trials Unit Bern, Bern, CH
| |
Collapse
|
7
|
Haas Q, Alvarez DV, Borissov N, Ferdowsi S, von Meyenn L, Trelle S, Teodoro D, Amini P. Utilizing Artificial Intelligence to Manage COVID-19 Scientific Evidence Torrent with Risklick AI: A Critical Tool for Pharmacology and Therapy Development. Pharmacology 2021; 106:244-253. [PMID: 33910199 PMCID: PMC8247831 DOI: 10.1159/000515908] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Accepted: 03/11/2021] [Indexed: 01/19/2023]
Abstract
INTRODUCTION The SARS-CoV-2 pandemic has led to one of the most critical and boundless waves of publications in the history of modern science. The necessity to find and pursue relevant information and quantify its quality is broadly acknowledged. Modern information retrieval techniques combined with artificial intelligence (AI) appear as one of the key strategies for COVID-19 living evidence management. Nevertheless, most AI projects that retrieve COVID-19 literature still require manual tasks. METHODS In this context, we pre-sent a novel, automated search platform, called Risklick AI, which aims to automatically gather COVID-19 scientific evidence and enables scientists, policy makers, and healthcare professionals to find the most relevant information tailored to their question of interest in real time. RESULTS Here, we compare the capacity of Risklick AI to find COVID-19-related clinical trials and scientific publications in comparison with clinicaltrials.gov and PubMed in the field of pharmacology and clinical intervention. DISCUSSION The results demonstrate that Risklick AI is able to find COVID-19 references more effectively, both in terms of precision and recall, compared to the baseline platforms. Hence, Risklick AI could become a useful alternative assistant to scientists fighting the COVID-19 pandemic.
Collapse
Affiliation(s)
- Quentin Haas
- Risklick AG, Spin-off University of Bern, Bern, Switzerland
- Clinical Trial Unit Bern, University of Bern, Bern, Switzerland
| | - David Vicente Alvarez
- HES-SO University of Applied Sciences and Arts Western Switzerland, Geneva, Switzerland
| | - Nikolay Borissov
- Risklick AG, Spin-off University of Bern, Bern, Switzerland
- Clinical Trial Unit Bern, University of Bern, Bern, Switzerland
| | - Sohrab Ferdowsi
- HES-SO University of Applied Sciences and Arts Western Switzerland, Geneva, Switzerland
| | | | - Sven Trelle
- Clinical Trial Unit Bern, University of Bern, Bern, Switzerland
| | - Douglas Teodoro
- HES-SO University of Applied Sciences and Arts Western Switzerland, Geneva, Switzerland
| | - Poorya Amini
- Risklick AG, Spin-off University of Bern, Bern, Switzerland
- Clinical Trial Unit Bern, University of Bern, Bern, Switzerland
| |
Collapse
|
8
|
Teodoro D, Knafou J, Naderi N, Pasche E, Gobeill J, Arighi CN, Ruch P. UPCLASS: a deep learning-based classifier for UniProtKB entry publications. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5822772. [PMID: 32367111 PMCID: PMC7198315 DOI: 10.1093/database/baaa026] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 02/19/2020] [Accepted: 03/11/2020] [Indexed: 12/20/2022]
Abstract
In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession. Database URL: https://goldorak.hesge.ch/bioexpclass/upclass/.
Collapse
Affiliation(s)
- Douglas Teodoro
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Julien Knafou
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Nona Naderi
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Emilie Pasche
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Julien Gobeill
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Cecilia N Arighi
- Center of Bioinformatics and Computational Biology, 15 Innovation Way, 19711, Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Patrick Ruch
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| |
Collapse
|
9
|
Cieslewicz A, Dutkiewicz J, Jedrzejek C. Baseline and extensions approach to information retrieval of complex medical data: Poznan's approach to the bioCADDIE 2016. Database (Oxford) 2018; 2018:4930756. [PMID: 29688372 PMCID: PMC5846287 DOI: 10.1093/database/bax103] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2017] [Revised: 12/18/2017] [Accepted: 12/18/2017] [Indexed: 11/23/2022]
Abstract
Database URL https://biocaddie.org/benchmark-data.
Collapse
Affiliation(s)
- Artur Cieslewicz
- Department of Clinical Pharmacology, Poznan University of Medical Sciences, Dluga 1/2 Str., 61-848 Poznan, Poland
| | - Jakub Dutkiewicz
- Institute of Control, Robotics and Information Engineering, Poznan University of Technology, ul. Piotrowo 3a, 60-965 Poznań, Poland
| | - Czeslaw Jedrzejek
- Institute of Control, Robotics and Information Engineering, Poznan University of Technology, ul. Piotrowo 3a, 60-965 Poznań, Poland
| |
Collapse
|
10
|
Britan A, Cusin I, Hinard V, Mottin L, Pasche E, Gobeill J, Rech de Laval V, Gleizes A, Teixeira D, Michel PA, Ruch P, Gaudet P. Accelerating annotation of articles via automated approaches: evaluation of the neXtA5 curation-support tool by neXtProt. Database (Oxford) 2018; 2018:5255187. [PMID: 30576492 PMCID: PMC6301339 DOI: 10.1093/database/bay129] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2018] [Revised: 10/04/2018] [Accepted: 11/09/2018] [Indexed: 11/14/2022]
Abstract
The development of efficient text-mining tools promises to boost the curation workflow by significantly reducing the time needed to process the literature into biological databases. We have developed a curation support tool, neXtA5, that provides a search engine coupled with an annotation system directly integrated into a biocuration workflow. neXtA5 assists curation with modules optimized for the thevarious curation tasks: document triage, entity recognition and information extraction.Here, we describe the evaluation of neXtA5 by expert curators. We first assessed the annotations of two independent curators to provide a baseline for comparison. To evaluate the performance of neXtA5, we submitted requests and compared the neXtA5 results with the manual curation. The analysis focuses on the usability of neXtA5 to support the curation of two types of data: biological processes (BPs) and diseases (Ds). We evaluated the relevance of the papers proposed as well as the recall and precision of the suggested annotations.The evaluation of document triage by neXtA5 precision showed that both curators agree with neXtA5 for 67 (BP) and 63% (D) of abstracts, while curators agree on accepting or rejecting an abstract ~80% of the time. Hence, the precision of the triage system is satisfactory.For concept extraction, curators approved 35 (BP) and 25% (D) of the neXtA5 annotations. Conversely, neXtA5 successfully annotated up to 36 (BP) and 68% (D) of the terms identified by curators. The user feedback obtained in these tests highlighted the need for improvement in the ranking function of neXtA5 annotations. Therefore, we transformed the information extraction component into an annotation ranking system. This improvement results in a top precision (precision at first rank) of 59 (D) and 63% (BP). These results suggest that when considering only the first extracted entity, the current system achieves a precision comparable with expert biocurators.
Collapse
Affiliation(s)
- Aurore Britan
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Isabelle Cusin
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Valérie Hinard
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Luc Mottin
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Emilie Pasche
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Julien Gobeill
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Valentine Rech de Laval
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Anne Gleizes
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Daniel Teixeira
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Pierre-André Michel
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Patrick Ruch
- Haute école spécialisée de Suisse occidentale, Haute Ecole de Gestion de Genève, Carouge, Switzerland
- SIB Text Mining, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| | - Pascale Gaudet
- Computer and Laboratory Investigation of Proteins of Human Origin Group, SIB Swiss Institute of Bioinformatics, Geneva 4, Switzerland
| |
Collapse
|