1
|
Cassim N, Mapundu M, Olago V, Celik T, George JA, Glencross DK. Using text mining techniques to extract prostate cancer predictive information (Gleason score) from semi-structured narrative laboratory reports in the Gauteng province, South Africa. BMC Med Inform Decis Mak 2021; 21:330. [PMID: 34823522 PMCID: PMC8614040 DOI: 10.1186/s12911-021-01697-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Accepted: 11/18/2021] [Indexed: 12/24/2022] Open
Abstract
Background Prostate cancer (PCa) is the leading male neoplasm in South Africa with an age-standardised incidence rate of 68.0 per 100,000 population in 2018. The Gleason score (GS) is the strongest predictive factor for PCa treatment and is embedded within semi-structured prostate biopsy narrative reports. The manual extraction of the GS is labour-intensive. The objective of our study was to explore the use of text mining techniques to automate the extraction of the GS from irregularly reported text-intensive patient reports. Methods We used the associated Systematized Nomenclature of Medicine clinical terms morphology and topography codes to identify prostate biopsies with a PCa diagnosis for men aged > 30 years between 2006 and 2016 in the Gauteng Province, South Africa. We developed a text mining algorithm to extract the GS from 1000 biopsy reports with a PCa diagnosis from the National Health Laboratory Service database and validated the algorithm using 1000 biopsies from the private sector. The logical steps for the algorithm were data acquisition, pre-processing, feature extraction, feature value representation, feature selection, information extraction, classification, and discovered knowledge. We evaluated the algorithm using precision, recall and F-score. The GS was manually coded by two experts for both datasets. The top five GS were reported, with the remaining scores categorised as “Other” for both datasets. The percentage of biopsies with a high-risk GS (≥ 8) was also reported. Results The first output reported an F-score of 0.99 that improved to 1.00 after the algorithm was amended (the GS reported in clinical history was ignored). For the validation dataset, an F-score of 0.99 was reported. The most commonly reported GS were 5 + 4 = 9 (17.6%), 3 + 3 = 6 (17.5%), 4 + 3 = 7 (16.4%), 3 + 4 = 7 (14.7%) and 4 + 4 = 8 (14.2%). For the validation dataset, the most commonly reported GS were: (i) 3 + 3 = 6 (37.7%), (ii) 3 + 4 = 7 (19.4%), (iii) 4 + 3 = 7 (14.9%), (iv) 4 + 4 = 8 (10.0%) and (v) 4 + 5 = 9 (7.4%). A high-risk GS was reported for 31.8% compared to 17.4% for the validation dataset. Conclusions We demonstrated reliable extraction of information about GS from narrative text-based patient reports using an in-house developed text mining algorithm. A secondary outcome was that late presentation could be assessed.
Collapse
Affiliation(s)
- Naseem Cassim
- Department of Molecular Medicine and Haematology, Faculty of Health Sciences, University of Witwatersrand and National Health Laboratory Service (NHLS), 7 York Road, Parktown, Johannesburg, South Africa.
| | - Michael Mapundu
- School of Public Health, Faculty of Health Sciences, University of Witwatersrand, 7 York Road, Parktown, Johannesburg, South Africa
| | - Victor Olago
- National Health Laboratory Service (NHLS), National Cancer Registry (NCR), 1 Modderfontein Road, Sandringham, Johannesburg, South Africa
| | - Turgay Celik
- School of Electrical & Information Engineering and Wits Institute of Data Science, University of Witwatersrand, 1 Jan Smuts Avenue, Braamfontein, Johannesburg, South Africa
| | - Jaya Anna George
- Department of Chemical Pathology, Faculty of Health Sciences, University of Witwatersrand and National Health Laboratory Service (NHLS), 7 York Road, Parktown, Johannesburg, South Africa
| | - Deborah Kim Glencross
- Department of Molecular Medicine and Haematology, Faculty of Health Sciences, University of Witwatersrand and National Health Laboratory Service (NHLS), 7 York Road, Parktown, Johannesburg, South Africa
| |
Collapse
|
2
|
Bousquet C, Souvignet J, Sadou É, Jaulent MC, Declerck G. Ontological and Non-Ontological Resources for Associating Medical Dictionary for Regulatory Activities Terms to SNOMED Clinical Terms With Semantic Properties. Front Pharmacol 2019; 10:975. [PMID: 31551780 PMCID: PMC6747929 DOI: 10.3389/fphar.2019.00975] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2018] [Accepted: 07/31/2019] [Indexed: 11/20/2022] Open
Abstract
Background: Formal definitions allow selecting terms (e.g., identifying all terms related to “Infectious disease” using the query “has causative agent organism”) and terminological reasoning (e.g., “hepatitis B” is a “hepatitis” and is an “infectious disease”). However, the standard international terminology Medical Dictionary for Regulatory Activities (MedDRA) used for coding adverse drug reactions in pharmacovigilance databases does not beneficiate from such formal definitions. Our objective was to evaluate the potential of reuse of ontological and non-ontological resources for generating such definitions for MedDRA. Methods: We developed several methods that collectively allow a semiautomatic semantic enrichment of MedDRA: 1) using MedDRA-to-SNOMED Clinical Terms (SNOMED CT) mappings (available in the Unified Medical Language System metathesaurus or other mapping resources, e.g., the MedDRA preferred term “hepatitis B” is associated to the SNOMED CT concept “type B viral hepatitis”) to extract term definitions (e.g., “hepatitis B” is associated with the following properties: has finding site liver structure, has associated morphology inflammation morphology, and has causative agent hepatitis B virus); 2) using MedDRA labels and lexical/syntactic methods for automatic decomposition of complex MedDRA terms (e.g., the MedDRA systems organ class “blood and lymphatic system disorders” is decomposed in blood system disorders and lymphatic system disorders) or automatic suggestions of properties (e.g., the string “cyclic” in preferred term “cyclic neutropenia” leads to the property has clinical course cyclic). Results: The Unified Medical Language System metathesaurus was the main ontological resource reusable for generating formal definitions for MedDRA terms. The non-ontological resources (another mapping resource provided by Nadkarni and Darer in 2010 and MedDRA labels) allowed defining few additional preferred terms. While the Ci4SeR tool helped the curator to define 1,935 terms by suggesting potential supplemental relations based on the parents’ and siblings’ semantic definition, defining manually all MedDRA terms remains expensive in time. Discussion: Several ontological and non-ontological resources are available for associating MedDRA terms to SNOMED CT concepts with semantic properties, but providing manual definitions is still necessary. The ontology of adverse events is a possible alternative but does not cover all MedDRA terms either. Perspectives are to implement more efficient techniques to find more logical relations between SNOMED CT and MedDRA in an automated way.
Collapse
Affiliation(s)
- Cédric Bousquet
- Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en e-Santé, LIMICS, Sorbonne Université, Inserm, Université Paris 13, Paris, France.,Unit of Public Health and Medical Informatics, University of Saint Etienne, Saint Etienne, France
| | - Julien Souvignet
- Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en e-Santé, LIMICS, Sorbonne Université, Inserm, Université Paris 13, Paris, France.,Unit of Public Health and Medical Informatics, University of Saint Etienne, Saint Etienne, France
| | - Éric Sadou
- Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en e-Santé, LIMICS, Sorbonne Université, Inserm, Université Paris 13, Paris, France
| | - Marie-Christine Jaulent
- Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en e-Santé, LIMICS, Sorbonne Université, Inserm, Université Paris 13, Paris, France
| | - Gunnar Declerck
- EA 2223 Costech (Connaissance, Organisation et Systèmes Techniques), Centre de Recherche, Sorbonne Universités, Université de technologie de Compiègne, Compiègne, France
| |
Collapse
|
3
|
Névéol A, Dalianis H, Velupillai S, Savova G, Zweigenbaum P. Clinical Natural Language Processing in languages other than English: opportunities and challenges. J Biomed Semantics 2018; 9:12. [PMID: 29602312 PMCID: PMC5877394 DOI: 10.1186/s13326-018-0179-8] [Citation(s) in RCA: 91] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2017] [Accepted: 02/14/2018] [Indexed: 01/22/2023] Open
Abstract
Background Natural language processing applied to clinical text or aimed at a clinical outcome has been thriving in recent years. This paper offers the first broad overview of clinical Natural Language Processing (NLP) for languages other than English. Recent studies are summarized to offer insights and outline opportunities in this area. Main Body We envision three groups of intended readers: (1) NLP researchers leveraging experience gained in other languages, (2) NLP researchers faced with establishing clinical text processing in a language other than English, and (3) clinical informatics researchers and practitioners looking for resources in their languages in order to apply NLP techniques and tools to clinical practice and/or investigation. We review work in clinical NLP in languages other than English. We classify these studies into three groups: (i) studies describing the development of new NLP systems or components de novo, (ii) studies describing the adaptation of NLP architectures developed for English to another language, and (iii) studies focusing on a particular clinical application. Conclusion We show the advantages and drawbacks of each method, and highlight the appropriate application context. Finally, we identify major challenges and opportunities that will affect the impact of NLP on clinical practice and public health studies in a context that encompasses English as well as other languages.
Collapse
Affiliation(s)
- Aurélie Névéol
- LIMSI, CNRS, Université Paris Saclay, Rue John von Neumann, Paris, F-91405 Orsay, France
| | | | - Sumithra Velupillai
- School of Computer Science and Communication, KTH, Stockholm, Sweden.,Institute of Psychiatry, Psychology and Neuroscience, King's College, London, UK
| | - Guergana Savova
- Children's Hospital Boston and Harvard Medical School, Boston, Massachusetts, USA
| | - Pierre Zweigenbaum
- LIMSI, CNRS, Université Paris Saclay, Rue John von Neumann, Paris, F-91405 Orsay, France
| |
Collapse
|
4
|
Souvignet J, Declerck G, Asfari H, Jaulent MC, Bousquet C. OntoADR a semantic resource describing adverse drug reactions to support searching, coding, and information retrieval. J Biomed Inform 2016; 63:100-107. [PMID: 27369567 DOI: 10.1016/j.jbi.2016.06.010] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2016] [Revised: 06/25/2016] [Accepted: 06/27/2016] [Indexed: 10/21/2022]
Abstract
INTRODUCTION Efficient searching and coding in databases that use terminological resources requires that they support efficient data retrieval. The Medical Dictionary for Regulatory Activities (MedDRA) is a reference terminology for several countries and organizations to code adverse drug reactions (ADRs) for pharmacovigilance. Ontologies that are available in the medical domain provide several advantages such as reasoning to improve data retrieval. The field of pharmacovigilance does not yet benefit from a fully operational ontology to formally represent the MedDRA terms. Our objective was to build a semantic resource based on formal description logic to improve MedDRA term retrieval and aid the generation of on-demand custom groupings by appropriately and efficiently selecting terms: OntoADR. METHODS The method consists of the following steps: (1) mapping between MedDRA terms and SNOMED-CT, (2) generation of semantic definitions using semi-automatic methods, (3) storage of the resource and (4) manual curation by pharmacovigilance experts. RESULTS We built a semantic resource for ADRs enabling a new type of semantics-based term search. OntoADR adds new search capabilities relative to previous approaches, overcoming the usual limitations of computation using lightweight description logic, such as the intractability of unions or negation queries, bringing it closer to user needs. Our automated approach for defining MedDRA terms enabled the association of at least one defining relationship with 67% of preferred terms. The curation work performed on our sample showed an error level of 14% for this automated approach. We tested OntoADR in practice, which allowed us to build custom groupings for several medical topics of interest. DISCUSSION The methods we describe in this article could be adapted and extended to other terminologies which do not benefit from a formal semantic representation, thus enabling better data retrieval performance. Our custom groupings of MedDRA terms were used while performing signal detection, which suggests that the graphical user interface we are currently implementing to process OntoADR could be usefully integrated into specialized pharmacovigilance software that rely on MedDRA.
Collapse
Affiliation(s)
- Julien Souvignet
- INSERM, U1142, LIMICS, F-75006 Paris, France; Sorbonne Universités, UPMC Univ Paris 06, UMR_S 1142, LIMICS, F-75006 Paris, France; Université Paris 13, Sorbonne Paris Cité, LIMICS, (UMR_S 1142), F-93430 Villetaneuse, France; SSPIM, CHU University Hospital of Saint Etienne, Saint Etienne, France
| | - Gunnar Declerck
- Sorbonne Universités, Université de technologie de Compiègne, EA 2223 Costech (Connaissance, Organisation et Systèmes Techniques), Centre Pierre Guillaumat, CS 60 319, 60 203 Compiègne cedex, France
| | - Hadyl Asfari
- INSERM, U1142, LIMICS, F-75006 Paris, France; Sorbonne Universités, UPMC Univ Paris 06, UMR_S 1142, LIMICS, F-75006 Paris, France; Université Paris 13, Sorbonne Paris Cité, LIMICS, (UMR_S 1142), F-93430 Villetaneuse, France
| | - Marie-Christine Jaulent
- INSERM, U1142, LIMICS, F-75006 Paris, France; Sorbonne Universités, UPMC Univ Paris 06, UMR_S 1142, LIMICS, F-75006 Paris, France; Université Paris 13, Sorbonne Paris Cité, LIMICS, (UMR_S 1142), F-93430 Villetaneuse, France
| | - Cédric Bousquet
- INSERM, U1142, LIMICS, F-75006 Paris, France; Sorbonne Universités, UPMC Univ Paris 06, UMR_S 1142, LIMICS, F-75006 Paris, France; Université Paris 13, Sorbonne Paris Cité, LIMICS, (UMR_S 1142), F-93430 Villetaneuse, France; SSPIM, CHU University Hospital of Saint Etienne, Saint Etienne, France.
| |
Collapse
|
5
|
Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. J Am Med Inform Assoc 2012; 18:544-51. [PMID: 21846786 DOI: 10.1136/amiajnl-2011-000464] [Citation(s) in RCA: 440] [Impact Index Per Article: 36.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
OBJECTIVES To provide an overview and tutorial of natural language processing (NLP) and modern NLP-system design. TARGET AUDIENCE This tutorial targets the medical informatics generalist who has limited acquaintance with the principles behind NLP and/or limited knowledge of the current state of the art. SCOPE We describe the historical evolution of NLP, and summarize common NLP sub-problems in this extensive field. We then provide a synopsis of selected highlights of medical NLP efforts. After providing a brief description of common machine-learning approaches that are being used for diverse NLP sub-problems, we discuss how modern NLP architectures are designed, with a summary of the Apache Foundation's Unstructured Information Management Architecture. We finally consider possible future directions for NLP, and reflect on the possible impact of IBM Watson on the medical field.
Collapse
|
6
|
Demner-Fushman D, Chapman WW, McDonald CJ. What can natural language processing do for clinical decision support? J Biomed Inform 2009; 42:760-72. [PMID: 19683066 PMCID: PMC2757540 DOI: 10.1016/j.jbi.2009.08.007] [Citation(s) in RCA: 266] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2008] [Revised: 08/10/2009] [Accepted: 08/11/2009] [Indexed: 11/29/2022]
Abstract
Computerized clinical decision support (CDS) aims to aid decision making of health care providers and the public by providing easily accessible health-related information at the point and time it is needed. natural language processing (NLP) is instrumental in using free-text information to drive CDS, representing clinical knowledge and CDS interventions in standardized formats, and leveraging clinical narrative. The early innovative NLP research of clinical narrative was followed by a period of stable research conducted at the major clinical centers and a shift of mainstream interest to biomedical NLP. This review primarily focuses on the recently renewed interest in development of fundamental NLP methods and advances in the NLP systems for CDS. The current solutions to challenges posed by distinct sublanguages, intended user groups, and support goals are discussed.
Collapse
Affiliation(s)
- Dina Demner-Fushman
- U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | | | |
Collapse
|