1
|
Bilal M, Hamza A, Malik N. NLP for Analyzing Electronic Health Records and Clinical Notes in Cancer Research: A Review. J Pain Symptom Manage 2025; 69:e374-e394. [PMID: 39894080 DOI: 10.1016/j.jpainsymman.2025.01.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Revised: 12/31/2024] [Accepted: 01/20/2025] [Indexed: 02/04/2025]
Abstract
This review examines the application of natural language processing (NLP) techniques in cancer research using electronic health records (EHRs) and clinical notes. It addresses gaps in existing literature by providing a broader perspective than previous studies focused on specific cancer types or applications. A comprehensive literature search in the Scopus database identified 94 relevant studies published between 2019 and 2024. The analysis revealed a growing trend in NLP applications for cancer research, with information extraction (47 studies) and text classification (40 studies) emerging as predominant NLP tasks, followed by named entity recognition (7 studies). Among cancer types, breast, lung, and colorectal cancers were found to be the most studied. A significant shift from rule-based and traditional machine learning approaches to advanced deep learning techniques and transformer-based models was observed. It was found that dataset sizes used in existing studies varied widely, ranging from small, manually annotated datasets to large-scale EHRs. The review highlighted key challenges, including the limited generalizability of proposed solutions and the need for improved integration into clinical workflows. While NLP techniques show significant potential in analyzing EHRs and clinical notes for cancer research, future work should focus on improving model generalizability, enhancing robustness in handling complex clinical language, and expanding applications to understudied cancer types. The integration of NLP tools into palliative medicine and addressing ethical considerations remain crucial for utilizing the full potential of NLP in enhancing cancer diagnosis, treatment, and patient outcomes. This review provides valuable insights into the current state and future directions of NLP applications in cancer research.
Collapse
Affiliation(s)
- Muhammad Bilal
- Department of Pharmaceutical Outcomes and Policy (M.B.), University of Florida, Gainesville, Florida, USA; Department of Software Engineering (M.B.), National University of Computer and Emerging Sciences, Islamabad, Pakistan.
| | - Ameer Hamza
- Department of Computer Science (A.H.), Faculty of Computing and IT, University of Sargodha, Sargodha, Punjab, Pakistan
| | - Nadia Malik
- Department of Software Engineering (N.M.), Faculty of Computing and IT, University of Sargodha, Sargodha, Punjab, Pakistan
| |
Collapse
|
2
|
Stammers M, Ramgopal B, Owusu Nimako A, Vyas A, Nouraei R, Metcalf C, Batchelor J, Shepherd J, Gwiggner M. A foundation systematic review of natural language processing applied to gastroenterology & hepatology. BMC Gastroenterol 2025; 25:58. [PMID: 39915703 PMCID: PMC11800601 DOI: 10.1186/s12876-025-03608-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/15/2024] [Accepted: 01/13/2025] [Indexed: 02/11/2025] Open
Abstract
OBJECTIVE This review assesses the progress of NLP in gastroenterology to date, grades the robustness of the methodology, exposes the field to a new generation of authors, and highlights opportunities for future research. DESIGN Seven scholarly databases (ACM Digital Library, Arxiv, Embase, IEEE Explore, Pubmed, Scopus and Google Scholar) were searched for studies published between 2015 and 2023 that met the inclusion criteria. Studies lacking a description of appropriate validation or NLP methods were excluded, as were studies ufinavailable in English, those focused on non-gastrointestinal diseases and those that were duplicates. Two independent reviewers extracted study information, clinical/algorithm details, and relevant outcome data. Methodological quality and bias risks were appraised using a checklist of quality indicators for NLP studies. RESULTS Fifty-three studies were identified utilising NLP in endoscopy, inflammatory bowel disease, gastrointestinal bleeding, liver and pancreatic disease. Colonoscopy was the focus of 21 (38.9%) studies; 13 (24.1%) focused on liver disease, 7 (13.0%) on inflammatory bowel disease, 4 (7.4%) on gastroscopy, 4 (7.4%) on pancreatic disease and 2 (3.7%) on endoscopic sedation/ERCP and gastrointestinal bleeding. Only 30 (56.6%) of the studies reported patient demographics, and only 13 (24.5%) had a low risk of validation bias. Thirty-five (66%) studies mentioned generalisability, but only 5 (9.4%) mentioned explainability or shared code/models. CONCLUSION NLP can unlock substantial clinical information from free-text notes stored in EPRs and is already being used, particularly to interpret colonoscopy and radiology reports. However, the models we have thus far lack transparency, leading to duplication, bias, and doubts about generalisability. Therefore, greater clinical engagement, collaboration, and open sharing of appropriate datasets and code are needed.
Collapse
Affiliation(s)
- Matthew Stammers
- University Hospital Southampton, Tremona Road, Southampton, SO16 6YD, UK.
- Southampton Emerging Therapies and Technologies (SETT) Centre, Southampton, SO16 6YD, UK.
- Clinical Informatics Research Unit (CIRU), Coxford Road, Southampton, SO16 5AF, UK.
- University of Southampton, Southampton, SO17 1BJ, UK.
| | | | | | - Anand Vyas
- University Hospital Southampton, Tremona Road, Southampton, SO16 6YD, UK
| | - Reza Nouraei
- Clinical Informatics Research Unit (CIRU), Coxford Road, Southampton, SO16 5AF, UK
- University of Southampton, Southampton, SO17 1BJ, UK
- Queen's Medical Centre, ENT Department, Nottingham, NG7 2UH, UK
| | - Cheryl Metcalf
- University of Southampton, Southampton, SO17 1BJ, UK
- School of Healthcare Enterprise and Innovation, University of Southampton, University of Southampton Science Park, Enterprise Road, Chilworth, Southampton, SO16 7NS, UK
| | - James Batchelor
- Clinical Informatics Research Unit (CIRU), Coxford Road, Southampton, SO16 5AF, UK
- University of Southampton, Southampton, SO17 1BJ, UK
| | - Jonathan Shepherd
- Southampton Health Technologies Assessment Centre (SHTAC), Enterprise Road, Alpha House, Southampton, SO16 7NS, England
| | - Markus Gwiggner
- University Hospital Southampton, Tremona Road, Southampton, SO16 6YD, UK
- University of Southampton, Southampton, SO17 1BJ, UK
| |
Collapse
|
3
|
Russell CD, Daley AV, Van Arnem DR, Hila AV, Johnson KJ, Davies JN, Cytron HS, Ready KJ, Armstrong CM, Sylvester ME, Caleshu CA. Validation of a guidelines-based digital tool to assess the need for germline cancer genetic testing. Hered Cancer Clin Pract 2024; 22:24. [PMID: 39516903 PMCID: PMC11545665 DOI: 10.1186/s13053-024-00298-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2024] [Accepted: 10/17/2024] [Indexed: 11/16/2024] Open
Abstract
BACKGROUND Efficient and scalable solutions are needed to identify patients who qualify for germline cancer genetic testing. We evaluated the clinical validity of a brief, patient-administered hereditary cancer risk assessment digital tool programmed to assess if patients meet criteria for germline genetic testing, based on personal and family history, and in line with national guidelines. METHODS We applied the tool to cases seen in a nationwide telehealth genetic counseling practice. Validity of the tool was evaluated by comparing the tool's assessment to that of the genetic counselor who saw the patient. Patients' histories were extracted from genetic counselor-collected pedigrees and input into the tool by the research team to model how a patient would complete the tool. We also validated the tool's assessment of which specific aspects of the personal and family history met criteria for genetic testing. Descriptive statistics were used. RESULTS Of the 152 cases (80% female, mean age 52.3), 56% had a personal history of cancer and 66% met genetic testing criteria. The tool and genetic counselor agreed in 96% of cases. Most disagreements (4/6; 67%) occurred because the genetic counselor's assessment relied on details the tool was not programmed to collect since patients typically don't have access to the relevant information (pathology details, risk models). We also found complete agreement between the tool and research team on which specific aspects of the patient's history met criteria for genetic testing. CONCLUSION We observed a high level of agreement with genetic counselor assessments, affirming the tool's clinical validity in identifying individuals for hereditary cancer predisposition testing and its potential for increasing access to hereditary cancer risk assessment.
Collapse
Affiliation(s)
- Callan D Russell
- Genome Medical, 611 Gateway Blvd Suite 120, South San Francisco, CA, 94080, USA
- Northside Hospital, 1000 Johnson Ferry Rd NE, Atlanta, GA, 30342, USA
| | - Ashley V Daley
- Genome Medical, 611 Gateway Blvd Suite 120, South San Francisco, CA, 94080, USA
| | - Durand R Van Arnem
- Genome Medical, 611 Gateway Blvd Suite 120, South San Francisco, CA, 94080, USA
| | - Andi V Hila
- Genome Medical, 611 Gateway Blvd Suite 120, South San Francisco, CA, 94080, USA
| | - Kiley J Johnson
- Genome Medical, 611 Gateway Blvd Suite 120, South San Francisco, CA, 94080, USA
| | - Jill N Davies
- Genome Medical, 611 Gateway Blvd Suite 120, South San Francisco, CA, 94080, USA
| | - Hanah S Cytron
- Genome Medical, 611 Gateway Blvd Suite 120, South San Francisco, CA, 94080, USA
| | - Kaylene J Ready
- GeneMatters, 611 Gateway Blvd Suite 120, South San Francisco, CA, 94080, USA
| | - Cary M Armstrong
- Genome Medical, 611 Gateway Blvd Suite 120, South San Francisco, CA, 94080, USA
| | - Mark E Sylvester
- Genome Medical, 611 Gateway Blvd Suite 120, South San Francisco, CA, 94080, USA
| | - Colleen A Caleshu
- Genome Medical, 611 Gateway Blvd Suite 120, South San Francisco, CA, 94080, USA.
| |
Collapse
|
4
|
Kiser D, Elhanan G, Bolze A, Neveux I, Schlauch KA, Metcalf WJ, Cirulli ET, McCarthy C, Greenberg LA, Grime S, Blitstein JMS, Plauth W, Grzymski JJ. Screening Familial Risk for Hereditary Breast and Ovarian Cancer. JAMA Netw Open 2024; 7:e2435901. [PMID: 39320887 PMCID: PMC11425146 DOI: 10.1001/jamanetworkopen.2024.35901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 09/26/2024] Open
Abstract
Importance Most patients with pathogenic or likely pathogenic (P/LP) variants for breast cancer have not undergone genetic testing. Objective To identify patients meeting family history criteria for genetic testing in the electronic health record (EHR). Design, Setting, and Participants This study included both cross-sectional (observation date, February 1, 2024) and retrospective cohort (observation period, January 1, 2018, to February 1, 2024) analyses. Participants included patients aged 18 to 79 years enrolled in Renown Health, a large health system in Northern Nevada. Genotype was known for 38 003 patients enrolled in Healthy Nevada Project (HNP), a population genomics study. Exposure An EHR indicating that a patient is positive for criteria according to the Seven-Question Family History Questionnaire (hereafter, FHS7 positive) assessing familial risk for hereditary breast and ovarian cancer (HBOC). Main Outcomes and Measures The primary outcomes were the presence of P/LP variants in the ATM, BRCA1, BRCA2, CHEK2, or PALB2 genes (cross-sectional analysis) or a diagnosis of cancer (cohort analysis). Age-adjusted cancer incidence rates per 100 000 patients per year were calculated using the 2020 US population as the standard. Hazard ratios (HRs) for cancer attributable to FHS7-positive status were estimated using cause-specific hazard models. Results Among 835 727 patients, 423 393 (50.7%) were female and 29 913 (3.6%) were FHS7 positive. Among those who were FHS7 positive, 24 535 (82.0%) had no evidence of prior genetic testing for HBOC in their EHR. Being FHS7 positive was associated with increased prevalence of P/LP variants in BRCA1/BRCA2 (odds ratio [OR], 3.34; 95% CI, 2.48-4.47), CHEK2 (OR, 1.62; 95% CI, 1.05-2.43), and PALB2 (OR, 2.84; 95% CI, 1.23-6.16) among HNP female individuals, and in BRCA1/BRCA2 (OR, 3.35; 95% CI, 1.93-5.56) among HNP male individuals. Being FHS7 positive was also associated with significantly increased risk of cancer among 131 622 non-HNP female individuals (HR, 1.44; 95% CI, 1.22-1.70) but not among 114 982 non-HNP male individuals (HR, 1.11; 95% CI, 0.87-1.42). Among 1527 HNP survey respondents, 352 of 383 EHR-FHS7 positive patients (91.9%) were survey-FHS7 positive, but only 352 of 883 survey-FHS7 positive patients (39.9%) were EHR-FHS7 positive. Of the 29 913 FHS7-positive patients, 19 764 (66.1%) were identified only after parsing free-text family history comments. Socioeconomic differences were also observed between EHR-FHS7-negative and EHR-FHS7-positive patients, suggesting disparities in recording family history. Conclusions and Relevance In this cross-sectional study, EHR-derived FHS7 identified thousands of patients with familial risk for breast cancer, indicating a substantial gap in genetic testing. However, limitations in EHR family history data suggested that other identification methods, such as direct-to-patient questionnaires, are required to fully address this gap.
Collapse
Affiliation(s)
- Daniel Kiser
- University of Nevada Reno School of Medicine, Reno
| | - Gai Elhanan
- University of Nevada Reno School of Medicine, Reno
| | | | - Iva Neveux
- University of Nevada Reno School of Medicine, Reno
| | | | | | | | | | | | | | | | | | - Joseph J Grzymski
- University of Nevada Reno School of Medicine, Reno
- Renown Health, Reno, Nevada
| |
Collapse
|
5
|
Wang L, Larki NR, Dobkin J, Salgado S, Ahmad N, Kaplan DE, Yang W, Yang YX. A Clinical Prediction Model to Assess Risk for Pancreatic Cancer Among Patients With Acute Pancreatitis. Pancreas 2024; 53:e254-e259. [PMID: 38266222 PMCID: PMC11214820 DOI: 10.1097/mpa.0000000000002295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
OBJECTIVES We aimed to develop and validate a prediction model as the first step in a sequential screening strategy to identify acute pancreatitis (AP) individuals at risk for pancreatic cancer (PC). MATERIALS AND METHODS We performed a population-based retrospective cohort study among individuals 40 years or older with a hospitalization for AP in the US Veterans Health Administration. For variable selection, we used least absolute shrinkage and selection operator regression with 10-fold cross-validation to identify a parsimonious logistic regression model for predicting the outcome, PC diagnosed within 2 years after AP. We evaluated model discrimination and calibration. RESULTS Among 51,613 eligible study patients with AP, 801 individuals were diagnosed with PC within 2 years. The final model (area under the receiver operating curve, 0.70; 95% confidence interval, 0.67-0.73) included histories of gallstones, pancreatic cyst, alcohol use, smoking, and levels of bilirubin, triglycerides, alkaline phosphatase, aspartate aminotransferase, alanine aminotransferase, and albumin. If the predicted risk threshold was set at 2% over 2 years, 20.3% of the AP population would undergo definitive screening, identifying nearly 50% of PC associated with AP. CONCLUSIONS We developed a prediction model using widely available clinical factors to identify high-risk patients with PC-associated AP, the first step in a sequential screening strategy.
Collapse
Affiliation(s)
- Louise Wang
- Section of Digestive Diseases, Yale School of Medicine, New Haven, CT
- VA Connecticut Healthcare System, West Haven, CT, USA
| | - Navid Rahimi Larki
- Section of Digestive Diseases, Yale School of Medicine, New Haven, CT
- VA Connecticut Healthcare System, West Haven, CT, USA
| | - Jane Dobkin
- Columbia Irving Medical Center, New York City, NY
| | - Sanjay Salgado
- Division of Gastroenterology and Hepatology, New York Presbyterian Hospital/Weill Cornell Medical College, New York, New York, USA
| | - Nuzhat Ahmad
- Division of Gastroenterology and Hepatology, Perelman School of Medicine, Philadelphia, PA
| | - David E. Kaplan
- Division of Gastroenterology and Hepatology, Perelman School of Medicine, Philadelphia, PA
- Corporal Michael J. Crescenz VA Medical Center, Philadelphia, PA
| | - Wei Yang
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, Philadelphia, PA
| | - Yu-Xiao Yang
- Division of Gastroenterology and Hepatology, Perelman School of Medicine, Philadelphia, PA
- Corporal Michael J. Crescenz VA Medical Center, Philadelphia, PA
| |
Collapse
|
6
|
Bradshaw RL, Kawamoto K, Bather JR, Goodman MS, Kohlmann WK, Chavez-Yenter D, Volkmar M, Monahan R, Kaphingst KA, Del Fiol G. Enhanced family history-based algorithms increase the identification of individuals meeting criteria for genetic testing of hereditary cancer syndromes but would not reduce disparities on their own. J Biomed Inform 2024; 149:104568. [PMID: 38081564 PMCID: PMC10842777 DOI: 10.1016/j.jbi.2023.104568] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 11/07/2023] [Accepted: 12/07/2023] [Indexed: 12/17/2023]
Abstract
OBJECTIVE This study aimed to 1) investigate algorithm enhancements for identifying patients eligible for genetic testing of hereditary cancer syndromes using family history data from electronic health records (EHRs); and 2) assess their impact on relative differences across sex, race, ethnicity, and language preference. MATERIALS AND METHODS The study used EHR data from a tertiary academic medical center. A baseline rule-base algorithm, relying on structured family history data (structured data; SD), was enhanced using a natural language processing (NLP) component and a relaxed criteria algorithm (partial match [PM]). The identification rates and differences were analyzed considering sex, race, ethnicity, and language preference. RESULTS Among 120,007 patients aged 25-60, detection rate differences were found across all groups using the SD (all P < 0.001). Both enhancements increased identification rates; NLP led to a 1.9 % increase and the relaxed criteria algorithm (PM) led to an 18.5 % increase (both P < 0.001). Combining SD with NLP and PM yielded a 20.4 % increase (P < 0.001). Similar increases were observed within subgroups. Relative differences persisted across most categories for the enhanced algorithms, with disproportionately higher identification of patients who are White, Female, non-Hispanic, and whose preferred language is English. CONCLUSION Algorithm enhancements increased identification rates for patients eligible for genetic testing of hereditary cancer syndromes, regardless of sex, race, ethnicity, and language preference. However, differences in identification rates persisted, emphasizing the need for additional strategies to reduce disparities such as addressing underlying biases in EHR family health information and selectively applying algorithm enhancements for disadvantaged populations. Systematic assessment of differences in algorithm performance across population subgroups should be incorporated into algorithm development processes.
Collapse
Affiliation(s)
- Richard L Bradshaw
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA; University of Utah Health, Salt Lake City, UT, USA
| | - Kensaku Kawamoto
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA; University of Utah Health, Salt Lake City, UT, USA
| | - Jemar R Bather
- Department of Biostatistics, New York University School of Global Public Health, New York, NY, USA; Center for Anti-racism, Social Justice, & Public Health, New York University School of Global Public Health, New York, NY, USA
| | - Melody S Goodman
- Department of Biostatistics, New York University School of Global Public Health, New York, NY, USA; Center for Anti-racism, Social Justice, & Public Health, New York University School of Global Public Health, New York, NY, USA
| | - Wendy K Kohlmann
- University of Utah Health, Salt Lake City, UT, USA; Department of Population Health Sciences, University of Utah, Salt Lake City, UT, USA; Huntsman Cancer Institute, University of Utah, Salt Lake City, UT, USA
| | - Daniel Chavez-Yenter
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT, USA; Department of Communication, University of Utah, Salt Lake City, UT, USA
| | - Molly Volkmar
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT, USA
| | | | - Kimberly A Kaphingst
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT, USA; Department of Communication, University of Utah, Salt Lake City, UT, USA
| | - Guilherme Del Fiol
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, USA; University of Utah Health, Salt Lake City, UT, USA.
| |
Collapse
|