1
|
Wang S, Wang X, Xia J, Mu Q. Identification of M1 macrophage infiltration-related genes for immunotherapy in Her2-positive breast cancer based on bioinformatics analysis and machine learning. Sci Rep 2025; 15:12525. [PMID: 40216945 PMCID: PMC11992169 DOI: 10.1038/s41598-025-96917-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Accepted: 04/01/2025] [Indexed: 04/14/2025] Open
Abstract
Over the past several decades, there has been a significant increase in the number of breast cancer patients. Among the four subtypes of breast cancer, Her2-positive breast cancer is one of the most aggressive breast cancers. In this study, we screened the differentially expressed genes from The Cancer Genome Atlas-Breast cancer database and analyzed the relationship between immune cell infiltration and differentially expressed genes using weighted gene co-expression network analysis. By constructing a module-trait relationships heatmap, the red module, which had the highest correlation value with M1 macrophages, was selected. Twenty hub genes were selected based on a protein-protein interaction network. Then, four overlapping M1 macrophage infiltration-related genes (M1 MIRGs), namely CCDC69, PPP1R16B, IL21R, and FOXP3, were obtained using five machine-learning algorithms. Subsequently, nomogram models were constructed to predict the incidence of Her2-positive breast cancer patients. The outer datasets and receiver operating characteristic curve analysis were used to validate the accuracy of the four M1 MIRGs and nomogram models. The average value of the area under the curve for the nomogram models was higher than 0.75 in both the training and testing sets. After that, survival analysis showed that higher expression of CCDC69, PPP1R16B, and IL21R were associated with overall survival of Her2-positive breast cancer patients. The expression of CCDC69 and PPP1R16B could lead to more benefits than the expression of IL21R and FOXP3 for immunotherapy. Lastly, we conducted immunohistochemistry staining to validate the aforementioned results. In conclusion, we found four M1 MIRGs that may be helpful for the diagnosis, prognosis, and immunotherapy of Her2-positive breast cancer.
Collapse
Affiliation(s)
- Sizhang Wang
- Qingdao Medical College of Qingdao University, Qingdao, 266042, Shandong, China
- Department of Breast surgery, Qingdao Central Hospital, University of Health and Rehabilitation Sciences, Qingdao, 266042, Shandong, China
| | - Xiaoyan Wang
- General Practice Department, Qingdao Central Hospital, University of Health and Rehabilitation Sciences, Qingdao, 266042, Shandong, China
| | - Jing Xia
- Department of Breast surgery, Qingdao Central Hospital, University of Health and Rehabilitation Sciences, Qingdao, 266042, Shandong, China
| | - Qiang Mu
- Department of Breast surgery, Qingdao Central Hospital, University of Health and Rehabilitation Sciences, Qingdao, 266042, Shandong, China.
| |
Collapse
|
2
|
Lee K, Liu Z, Huang Q, Corrigan D, Kalsekar I, Jun T, Stolovitzky G, Oh WK, Rajaram R, Wang X. Decoding Recurrence in Early-Stage and Locoregionally Advanced Non-Small Cell Lung Cancer: Insights From Electronic Health Records and Natural Language Processing. JCO Clin Cancer Inform 2025; 9:e2400227. [PMID: 40249880 PMCID: PMC12011440 DOI: 10.1200/cci-24-00227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Revised: 01/20/2025] [Accepted: 03/04/2025] [Indexed: 04/20/2025] Open
Abstract
PURPOSE Recurrences after curative resection in early-stage and locoregionally advanced non-small cell lung cancer (NSCLC) are common, necessitating a nuanced understanding of associated risk factors. This study aimed to establish a natural language processing (NLP) system to efficiently curate recurrence data in NSCLC and analyze risk factors longitudinally. PATIENTS AND METHODS Electronic health records of 6,351 patients with NSCLC with >700,000 notes were obtained from Mount Sinai's data sets. A deep learning-based customized NLP system was developed to identify cohorts experiencing recurrence. Recurrence types and rates over time were stratified by various clinical features. Cohort description analysis, Kaplan-Meier analysis for overall recurrence-free survival (RFS) and distant metastasis-free survival (DMFS), and Cox proportional hazards analysis were performed. RESULTS Of 1,295 patients with stage I-IIIA NSCLC with surgical resections, 336 patients (25.9%) experienced recurrence, as identified through NLP. The NLP system achieved a precision of 94.3%, a recall of 93%, and an F1 score of 93.5. Among 336 patients, 52.4% had local/regional recurrences, 44% distant metastases, and 3.6% unknown recurrence. RFS rates at years 1-5 were 93%, 81%, 73%, 67%, and 61%, respectively (96%, 89%, 84%, 80%, and 75% for distant metastasis). Stage-specific RFS rates at year 5 were 73% (IA), 62% (IB), 47% (IIA), 46% (IIB), and 20% (IIIA). Stage IB patients had a significantly higher likelihood of recurrence versus stage IA (adjusted hazard ratio [aHR], 1.63; P = .02). The RFS was lower in patients with clinically significant TP53 alteration (v TP53-negative or unknown significance), affecting overall RFS (aHR, 1.89; P = .007) and DMFS (aHR, 2.47; P = .009) among stage IA/IB patients. CONCLUSION Our scalable NLP system enabled us to generate real-world insights into NSCLC recurrences, paving the way for predictive models for preventing, diagnosing, and treating NSCLC recurrence.
Collapse
Affiliation(s)
| | | | - Qing Huang
- Lung Cancer Initiative, Johnson & Johnson, New Brunswick, NJ
| | | | | | | | | | - William K. Oh
- GeneDx (Sema4), Stamford, CT
- Icahn School of Medicine at Mount Sinai, New York, NY
| | - Ravi Rajaram
- Department of Thoracic and Cardiovascular Surgery, The University of Texas MD Anderson Cancer Center, Houston, TX
| | | |
Collapse
|
3
|
Silveira JA, da Silva AR, de Lima MZT. Harnessing artificial intelligence for predicting breast cancer recurrence: a systematic review of clinical and imaging data. Discov Oncol 2025; 16:135. [PMID: 39921795 PMCID: PMC11807043 DOI: 10.1007/s12672-025-01908-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/17/2024] [Accepted: 02/03/2025] [Indexed: 02/10/2025] Open
Abstract
Breast cancer is a leading cause of mortality among women, with recurrence prediction remaining a significant challenge. In this context, artificial intelligence application and its resources can serve as a powerful tool in analyzing large amounts of data and predicting cancer recurrence, potentially enabling personalized medical treatment and improving the patient's quality of life. Thus, the systematic review examines the role of AI in predicting breast cancer recurrence using clinical data, imaging data, and combined datasets. Support Vector Machine (SVM) and Neural Networks, especially when applied to combined data, demonstrate strong potential in improving prediction accuracy. SVMs are effective with high-dimensional clinical data, while Neural Networks in genetic and molecular analysis. Despite these advancements, limitations such as dataset diversity, sample size, and evaluation standardization persist, emphasizing the need for further research. AI integration in recurrence prediction offers promising prospects for personalized care but requires rigorous validation for safe clinical application.
Collapse
Affiliation(s)
| | - Alexandre Ray da Silva
- OncoAI, Oncologia Inteligência Artificial, Cel Jose Eusebio, 95, Sao Paulo, Sao Paulo, 01239-030, Brazil
| | - Mariana Zuliani Theodoro de Lima
- OncoAI, Oncologia Inteligência Artificial, Cel Jose Eusebio, 95, Sao Paulo, Sao Paulo, 01239-030, Brazil.
- Engineering School, Mackenzie Presbyterian University, Consolacao street, 930, Sao Paulo, Sao Paulo, 01302-907, Brazil.
| |
Collapse
|
4
|
Lee JJ, Zepeda A, Arbour G, Isaac KV, Ng RT, Nichol AM. Automated Identification of Breast Cancer Relapse in Computed Tomography Reports Using Natural Language Processing. JCO Clin Cancer Inform 2024; 8:e2400107. [PMID: 39705642 DOI: 10.1200/cci.24.00107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Revised: 08/15/2024] [Accepted: 10/18/2024] [Indexed: 12/22/2024] Open
Abstract
PURPOSE Breast cancer relapses are rarely collected by cancer registries because of logistical and financial constraints. Hence, we investigated natural language processing (NLP), enhanced with state-of-the-art deep learning transformer tools and large language models, to automate relapse identification in the text of computed tomography (CT) reports. METHODS We analyzed follow-up CT reports from patients diagnosed with breast cancer between January 1, 2005, and December 31, 2014. The reports were curated and annotated for the presence or absence of local, regional, and distant breast cancer relapses. We performed 10-fold cross-validation to evaluate models identifying different types of relapses in CT reports. Model performance was assessed with classification metrics, reported with 95% confidence intervals. RESULTS In our data set of 1,445 CT reports, 799 (55.3%) described any relapse, 72 (5.0%) local relapses, 97 (6.7%) regional relapses, and 743 (51.4%) distant relapses. The any-relapse model achieved an accuracy of 89.6% (87.8-91.1), with a sensitivity of 93.2% (91.4-94.9) and a specificity of 84.2% (80.9-87.1). The local relapse model achieved an accuracy of 94.6% (93.3-95.7), a sensitivity of 44.4% (32.8-56.3), and a specificity of 97.2% (96.2-98.0). The regional relapse model showed an accuracy of 93.6% (92.3-94.9), a sensitivity of 70.1% (60.0-79.1), and a specificity of 95.3% (94.2-96.5). Finally, the distant relapse model demonstrated an accuracy of 88.1% (86.2-89.7), a sensitivity of 91.8% (89.9-93.8), and a specificity of 83.7% (80.5-86.4). CONCLUSION We developed NLP models to identify local, regional, and distant breast cancer relapses from CT reports. Automating the identification of breast cancer relapses can enhance data collection about patient outcomes.
Collapse
Affiliation(s)
- Jaimie J Lee
- Department of Radiation Oncology, BC Cancer, Vancouver, BC, Canada
- Department of Surgery, University of British Columbia, Vancouver, BC, Canada
| | - Andres Zepeda
- Department of Computer Science, University of British Columbia, Vancouver, BC, Canada
| | - Gregory Arbour
- Department of Computer Science, University of British Columbia, Vancouver, BC, Canada
| | - Kathryn V Isaac
- Department of Surgery, University of British Columbia, Vancouver, BC, Canada
| | - Raymond T Ng
- Department of Computer Science, University of British Columbia, Vancouver, BC, Canada
| | - Alan M Nichol
- Department of Radiation Oncology, BC Cancer, Vancouver, BC, Canada
- Department of Surgery, University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
5
|
Lee S, Kim JH, Ha HI, Lim MC, Cho H. Development of an Automatic Rule-Based Algorithm for the Detection of Ovarian Cancer Recurrence From Electronic Health Records. JCO Clin Cancer Inform 2024; 8:e2300150. [PMID: 38442323 PMCID: PMC10927333 DOI: 10.1200/cci.23.00150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 12/18/2023] [Accepted: 01/10/2024] [Indexed: 03/07/2024] Open
Abstract
PURPOSE As the onset of cancer recurrence is not explicitly recorded in the electronic health record (EHR), a high volume of manual chart review is required to detect the cancer recurrence. This study aims to develop an automatic rule-based algorithm for detecting ovarian cancer (OC) recurrence on the basis of minimally preprocessed EHR data. METHODS The automatic rule-based recurrence detection algorithm (Auto-Recur), using notes on image reading (positron emission tomography-computed tomography [PET-CT], CT, magnetic resonance imaging [MRI]), biomarker (CA125), and treatment information (surgery, chemotherapy, radiotherapy), was developed to detect the first OC recurrence. Auto-Recur contains three single algorithms (images, biomarkers, treatments) and hybrid algorithms (combinations of the single algorithms). The performance of Auto-Recur was assessed using sensitivity, specificity, and accuracy of the recurrence time detected. The recurrence-free survival probabilities were estimated and compared with the retrospective chart review results. RESULTS The proposed Auto-Recur considerably reduced human resources and time; it saved approximately 1,340 days when scaled to 100,000 patients compared with the conventional retrospective chart review. The hybrid algorithm on the basis of a combination of image, biomarker, and treatment information was the most efficient (sensitivity: 93.4%, specificity: 97.4%) and precisely captured recurrence time (average time error: 8.5 days). The estimated 3-year recurrence-free survival probability (44%) was close to the estimates by the retrospective chart review (45%, log-rank P value = .894). CONCLUSION Our rule-based algorithm effectively captured the first OC recurrence from large-scale EHR while closely approximating the recurrence-free survival estimates obtained by conventional retrospective chart reviews. The study findings facilitate large-scale EHR analysis, enhancing clinical research opportunities.
Collapse
Affiliation(s)
- Sanghee Lee
- Department of Cancer Control & Population Health, National Cancer Center Graduate School of Cancer Science and Policy, Goyang, Republic of Korea
- Health Insurance Research Institute, National Health Insurance Service, Wonju, Republic of Korea
| | - Ji Hyun Kim
- Center for Gynecologic Cancer, Research Institute and Hospital, National Cancer Center, Goyang, Republic of Korea
| | - Hyeong In Ha
- Department of Obstetrics and Gynecology, Pusan National University Yangsan Hospital, Pusan National University School of Medicine, Yangsan, Korea
| | - Myong Cheol Lim
- Department of Cancer Control & Population Health, National Cancer Center Graduate School of Cancer Science and Policy, Goyang, Republic of Korea
- Center for Gynecologic Cancer, Research Institute and Hospital, National Cancer Center, Goyang, Republic of Korea
- Rare and Pediatric Cancer Branch and Immuno-oncology Branch, Division of Rare and Refractory Cancer, Research Institute, National Cancer Center, Goyang, Republic of Korea
- Center for Clinical Trials, Hospital, National Cancer Center, Goyang, Republic of Korea
| | - Hyunsoon Cho
- Department of Cancer Control & Population Health, National Cancer Center Graduate School of Cancer Science and Policy, Goyang, Republic of Korea
- Department of Cancer AI and Digital Health, National Cancer Center Graduate School of Cancer Science and Policy, National Cancer Center, Goyang, South Korea
- Integrated Biostatistics Branch, Division of Cancer Data Science, Research Institute, National Cancer Center, Goyang, Republic of Korea
| |
Collapse
|
6
|
Aiello Bowles EJ, Kroenke CH, Chubak J, Bhimani J, O'Connell K, Brandzel S, Valice E, Doud R, Theis MK, Roh JM, Heon N, Persaud S, Griggs JJ, Bandera EV, Kushi LH, Kantor ED. Evaluation of Algorithms Using Automated Health Plan Data to Identify Breast Cancer Recurrences. Cancer Epidemiol Biomarkers Prev 2024; 33:355-364. [PMID: 38088912 PMCID: PMC10922110 DOI: 10.1158/1055-9965.epi-23-0782] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 11/20/2023] [Accepted: 12/11/2023] [Indexed: 02/12/2024] Open
Abstract
BACKGROUND We updated algorithms to identify breast cancer recurrences from administrative data, extending previously developed methods. METHODS In this validation study, we evaluated pairs of breast cancer recurrence algorithms (vs. individual algorithms) to identify recurrences. We generated algorithm combinations that categorized discordant algorithm results as no recurrence [High Specificity and PPV (positive predictive value) Combination] or recurrence (High Sensitivity Combination). We compared individual and combined algorithm results to manually abstracted recurrence outcomes from a sample of 600 people with incident stage I-IIIA breast cancer diagnosed between 2004 and 2015. We used Cox regression to evaluate risk factors associated with age- and stage-adjusted recurrence rates using different recurrence definitions, weighted by inverse sampling probabilities. RESULTS Among 600 people, we identified 117 recurrences using the High Specificity and PPV Combination, 505 using the High Sensitivity Combination, and 118 using manual abstraction. The High Specificity and PPV Combination had good specificity [98%, 95% confidence interval (CI): 97-99] and PPV (72%, 95% CI: 63-80) but modest sensitivity (64%, 95% CI: 44-80). The High Sensitivity Combination had good sensitivity (80%, 95% CI: 49-94) and specificity (83%, 95% CI: 80-86) but low PPV (29%, 95% CI: 25-34). Recurrence rates using combined algorithms were similar in magnitude for most risk factors. CONCLUSIONS By combining algorithms, we identified breast cancer recurrences with greater PPV than individual algorithms, without additional review of discordant records. IMPACT Researchers should consider tradeoffs between accuracy and manual chart abstraction resources when using previously developed algorithms. We provided guidance for future studies that use breast cancer recurrence algorithms with or without supplemental manual chart abstraction.
Collapse
Affiliation(s)
- Erin J Aiello Bowles
- Kaiser Permanente Washington Health Research Institute, Kaiser Permanente Washington, Seattle, Washington
| | - Candyce H Kroenke
- Division of Research, Kaiser Permanente Northern California, Oakland, California
| | - Jessica Chubak
- Kaiser Permanente Washington Health Research Institute, Kaiser Permanente Washington, Seattle, Washington
| | - Jenna Bhimani
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Kelli O'Connell
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Susan Brandzel
- Kaiser Permanente Washington Health Research Institute, Kaiser Permanente Washington, Seattle, Washington
| | - Emily Valice
- Division of Research, Kaiser Permanente Northern California, Oakland, California
| | - Rachael Doud
- Kaiser Permanente Washington Health Research Institute, Kaiser Permanente Washington, Seattle, Washington
| | - Mary Kay Theis
- Kaiser Permanente Washington Health Research Institute, Kaiser Permanente Washington, Seattle, Washington
| | - Janise M Roh
- Division of Research, Kaiser Permanente Northern California, Oakland, California
| | - Narre Heon
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, New York
- Office of Faculty Professional Development, Diversity and Inclusion, Columbia University Irving Medical Center, New York, New York
| | - Sonia Persaud
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Jennifer J Griggs
- Departments of Internal Medicine, Hematology and Oncology Division, and Health Management and Policy, Institute for Healthcare Policy and Innovation, University of Michigan, Ann Arbor, Michigan
| | - Elisa V Bandera
- Cancer Epidemiology and Health Outcomes, Rutgers Cancer Institute of New Jersey, Rutgers, the State University of New Jersey, New Brunswick, New Jersey
| | - Lawrence H Kushi
- Division of Research, Kaiser Permanente Northern California, Oakland, California
| | - Elizabeth D Kantor
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, New York
| |
Collapse
|
7
|
Shahzad M, Rafi M, Alhalabi W, Minaz Ali N, Anwar MS, Jamal S, Barket Ali M, Alqurashi FA. Classification of clinically actionable genetic mutations in cancer patients. Front Mol Biosci 2024; 10:1277862. [PMID: 38274098 PMCID: PMC10808303 DOI: 10.3389/fmolb.2023.1277862] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 12/20/2023] [Indexed: 01/27/2024] Open
Abstract
Personalized medicine in cancer treatment aims to treat each individual's cancer tumor uniquely based on the genetic sequence of the cancer patient and is a much more effective approach compared to traditional methods which involve treating each type of cancer in the same, generic manner. However, personalized treatment requires the classification of cancer-related genes once profiled, which is a highly labor-intensive and time-consuming task for pathologists making the adoption of personalized medicine a slow progress worldwide. In this paper, we propose an intelligent multi-class classifier system that uses a combination of Natural Language Processing (NLP) techniques and Machine Learning algorithms to automatically classify clinically actionable genetic mutations using evidence from text-based medical literature. The training data set for the classifier was obtained from the Memorial Sloan Kettering Cancer Center and the Random Forest algorithm was applied with TF-IDF for feature extraction and truncated SVD for dimensionality reduction. The results show that the proposed model outperforms the previous research in terms of accuracy and precision scores, giving an accuracy score of approximately 82%. The system has the potential to revolutionize cancer treatment and lead to significant improvements in cancer therapy.
Collapse
Affiliation(s)
- Muhammad Shahzad
- National University of Computer and Emerging Sciences, Karachi, Pakistan
| | - Muhammad Rafi
- National University of Computer and Emerging Sciences, Karachi, Pakistan
| | - Wadee Alhalabi
- Department of Computer Science, Immersive Virtual Reality Research Group, King Abdulaziz University, Jeddah, Saudi Arabia
- Department of Computer Science, HECI School, Dar Alhekma University, Jeddah, Saudi Arabia
| | - Naz Minaz Ali
- National University of Computer and Emerging Sciences, Karachi, Pakistan
| | | | - Sara Jamal
- National University of Computer and Emerging Sciences, Karachi, Pakistan
| | - Muskan Barket Ali
- National University of Computer and Emerging Sciences, Karachi, Pakistan
| | - Fahad Abdullah Alqurashi
- Department of Computer Science, Faculty of Computing and Information Technology, Jeddah, Saudi Arabia
| |
Collapse
|
8
|
Simoulin A, Thiebaut N, Neuberger K, Ibnouhsein I, Brunel N, Viné R, Bousquet N, Latapy J, Reix N, Molière S, Lodi M, Mathelin C. From free-text electronic health records to structured cohorts: Onconum, an innovative methodology for real-world data mining in breast cancer. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 240:107693. [PMID: 37453367 DOI: 10.1016/j.cmpb.2023.107693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 05/25/2023] [Accepted: 06/23/2023] [Indexed: 07/18/2023]
Abstract
PURPOSE A considerable amount of valuable information is present in electronic health records (EHRs) however it remains inaccessible because it is embedded into unstructured narrative documents that cannot be easily analyzed. We wanted to develop and evaluate a methodology able to extract and structure information from electronic health records in breast cancer. METHODS We developed a software platform called Onconum (ClinicalTrials.gov Identifier: NCT02810093) which uses a hybrid method relying on machine learning approaches and rule-based lexical methods. It is based on natural language processing techniques that allows a targeted analysis of free-text medical data related to breast cancer, independently of any pre-existing dictionary, in a French context (available in N files). We then evaluated it on a validation cohort called Senometry. FINDINGS Senometry cohort included 9,599 patients with breast cancer (both invasive and in situ), treated between 2000 and 2017 in the breast cancer unit of Strasbourg University Hospitals. Extraction rates ranged from 45 to 100%, depending on the type of each parameter. Precision of extracted information was 68%-94% compared to a structured cohort, and 89%-98% compared to manually structured databases and it retrieved more rare occurrences compared to another database search engine (+17%). INTERPRETATION This innovative method can accurately structure relevant medical information embedded in EHRs in the context of breast cancer. Missing data handling is the main limitation of this method however multiple sources can be incorporated to reduce this limit. Nevertheless, this methodology does not need neither pre-existing dictionaries nor manually annotated corpora. It can therefore be easily implemented in non-English-speaking countries and in other diseases outside breast cancer, and it allows prospective inclusion of new patients.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Nicolas Bousquet
- Quantmetry, 52 rue d'Anjou, 75008 Paris, France; Sorbonne University, 4 place Jussieu, 75005 Paris, France
| | | | - Nathalie Reix
- ICube UMR 7537, Strasbourg University / CNRS, Fédération de Médecine Translationnelle de Strasbourg, 67200 Strasbourg, France; Biochemistry and Molecular Biology Laboratory, Strasbourg University Hospitals, 1 place de l'Hôpital, 67091 Strasbourg, France
| | - Sébastien Molière
- Radiology Department, Strasbourg University Hospitals, 1 avenue Molière, 67098 Strasbourg, France
| | - Massimo Lodi
- Institut de cancérologie Strasbourg Europe (ICANS), 17 avenue Albert Calmette, 67033 Strasbourg Cedex, France; Department of Functional Genomics and Cancer, Institut de Génétique et de Biologie Moléculaire et Cellulaire, CNRS UMR 7104, INSERM U964, Strasbourg University, Illkirch, France; Strasbourg University Hospitals, 1 place de l'Hôpital, 67091 Strasbourg, France.
| | - Carole Mathelin
- Institut de cancérologie Strasbourg Europe (ICANS), 17 avenue Albert Calmette, 67033 Strasbourg Cedex, France; Department of Functional Genomics and Cancer, Institut de Génétique et de Biologie Moléculaire et Cellulaire, CNRS UMR 7104, INSERM U964, Strasbourg University, Illkirch, France; Strasbourg University Hospitals, 1 place de l'Hôpital, 67091 Strasbourg, France.
| |
Collapse
|
9
|
Adamson B, Waskom M, Blarre A, Kelly J, Krismer K, Nemeth S, Gippetti J, Ritten J, Harrison K, Ho G, Linzmayer R, Bansal T, Wilkinson S, Amster G, Estola E, Benedum CM, Fidyk E, Estévez M, Shapiro W, Cohen AB. Approach to machine learning for extraction of real-world data variables from electronic health records. Front Pharmacol 2023; 14:1180962. [PMID: 37781703 PMCID: PMC10541019 DOI: 10.3389/fphar.2023.1180962] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Accepted: 08/25/2023] [Indexed: 10/03/2023] Open
Abstract
Background: As artificial intelligence (AI) continues to advance with breakthroughs in natural language processing (NLP) and machine learning (ML), such as the development of models like OpenAI's ChatGPT, new opportunities are emerging for efficient curation of electronic health records (EHR) into real-world data (RWD) for evidence generation in oncology. Our objective is to describe the research and development of industry methods to promote transparency and explainability. Methods: We applied NLP with ML techniques to train, validate, and test the extraction of information from unstructured documents (e.g., clinician notes, radiology reports, lab reports, etc.) to output a set of structured variables required for RWD analysis. This research used a nationwide electronic health record (EHR)-derived database. Models were selected based on performance. Variables curated with an approach using ML extraction are those where the value is determined solely based on an ML model (i.e. not confirmed by abstraction), which identifies key information from visit notes and documents. These models do not predict future events or infer missing information. Results: We developed an approach using NLP and ML for extraction of clinically meaningful information from unstructured EHR documents and found high performance of output variables compared with variables curated by manually abstracted data. These extraction methods resulted in research-ready variables including initial cancer diagnosis with date, advanced/metastatic diagnosis with date, disease stage, histology, smoking status, surgery status with date, biomarker test results with dates, and oral treatments with dates. Conclusion: NLP and ML enable the extraction of retrospective clinical data in EHR with speed and scalability to help researchers learn from the experience of every person with cancer.
Collapse
Affiliation(s)
- Blythe Adamson
- Flatiron Health, Inc., New York, NY, United States
- The Comparative Health Outcomes, Policy, and Economics (CHOICE) Institute, Department of Pharmacy, University of Washington, Seattle, WA, United States
| | | | | | | | | | | | | | - John Ritten
- Flatiron Health, Inc., New York, NY, United States
| | | | - George Ho
- Flatiron Health, Inc., New York, NY, United States
| | | | - Tarun Bansal
- Flatiron Health, Inc., New York, NY, United States
| | | | - Guy Amster
- Flatiron Health, Inc., New York, NY, United States
| | - Evan Estola
- Flatiron Health, Inc., New York, NY, United States
| | | | - Erin Fidyk
- Flatiron Health, Inc., New York, NY, United States
| | | | - Will Shapiro
- Flatiron Health, Inc., New York, NY, United States
| | - Aaron B. Cohen
- Flatiron Health, Inc., New York, NY, United States
- Department of Medicine, NYU Grossman School of Medicine, New York, NY, United States
| |
Collapse
|
10
|
González-Castro L, Chávez M, Duflot P, Bleret V, Martin AG, Zobel M, Nateqi J, Lin S, Pazos-Arias JJ, Del Fiol G, López-Nores M. Machine Learning Algorithms to Predict Breast Cancer Recurrence Using Structured and Unstructured Sources from Electronic Health Records. Cancers (Basel) 2023; 15:2741. [PMID: 37345078 DOI: 10.3390/cancers15102741] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Revised: 04/26/2023] [Accepted: 05/06/2023] [Indexed: 06/23/2023] Open
Abstract
Recurrence is a critical aspect of breast cancer (BC) that is inexorably tied to mortality. Reuse of healthcare data through Machine Learning (ML) algorithms offers great opportunities to improve the stratification of patients at risk of cancer recurrence. We hypothesized that combining features from structured and unstructured sources would provide better prediction results for 5-year cancer recurrence than either source alone. We collected and preprocessed clinical data from a cohort of BC patients, resulting in 823 valid subjects for analysis. We derived three sets of features: structured information, features from free text, and a combination of both. We evaluated the performance of five ML algorithms to predict 5-year cancer recurrence and selected the best-performing to test our hypothesis. The XGB (eXtreme Gradient Boosting) model yielded the best performance among the five evaluated algorithms, with precision = 0.900, recall = 0.907, F1-score = 0.897, and area under the receiver operating characteristic AUROC = 0.807. The best prediction results were achieved with the structured dataset, followed by the unstructured dataset, while the combined dataset achieved the poorest performance. ML algorithms for BC recurrence prediction are valuable tools to improve patient risk stratification, help with post-cancer monitoring, and plan more effective follow-up. Structured data provides the best results when fed to ML algorithms. However, an approach based on natural language processing offers comparable results while potentially requiring less mapping effort.
Collapse
Affiliation(s)
| | - Marcela Chávez
- Department of Information System Management, Centre Hospitalier Universitaire de Liège, 4000 Liège, Belgium
| | - Patrick Duflot
- Department of Information System Management, Centre Hospitalier Universitaire de Liège, 4000 Liège, Belgium
| | - Valérie Bleret
- Senology Department, Centre Hospitalier Universitaire de Liège, 4000 Liège, Belgium
| | | | - Marc Zobel
- Science Department, Symptoma GmbH, 1030 Vienna, Austria
| | - Jama Nateqi
- Science Department, Symptoma GmbH, 1030 Vienna, Austria
- Department of Internal Medicine, Paracelsus Medical University, 5020 Salzburg, Austria
| | - Simon Lin
- Science Department, Symptoma GmbH, 1030 Vienna, Austria
- Department of Internal Medicine, Paracelsus Medical University, 5020 Salzburg, Austria
| | - José J Pazos-Arias
- atlanTTic Research Center, Department of Telematics Engineering, University of Vigo, 36310 Vigo, Spain
| | - Guilherme Del Fiol
- Department of Biomedical Informatics, University of Utah School of Medicine, Salt Lake City, UT 84108, USA
| | - Martín López-Nores
- atlanTTic Research Center, Department of Telematics Engineering, University of Vigo, 36310 Vigo, Spain
| |
Collapse
|
11
|
Diab KM, Deng J, Wu Y, Yesha Y, Collado-Mesa F, Nguyen P. Natural Language Processing for Breast Imaging: A Systematic Review. Diagnostics (Basel) 2023; 13:diagnostics13081420. [PMID: 37189521 DOI: 10.3390/diagnostics13081420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Revised: 04/05/2023] [Accepted: 04/11/2023] [Indexed: 05/17/2023] Open
Abstract
Natural Language Processing (NLP) has gained prominence in diagnostic radiology, offering a promising tool for improving breast imaging triage, diagnosis, lesion characterization, and treatment management in breast cancer and other breast diseases. This review provides a comprehensive overview of recent advances in NLP for breast imaging, covering the main techniques and applications in this field. Specifically, we discuss various NLP methods used to extract relevant information from clinical notes, radiology reports, and pathology reports and their potential impact on the accuracy and efficiency of breast imaging. In addition, we reviewed the state-of-the-art in NLP-based decision support systems for breast imaging, highlighting the challenges and opportunities of NLP applications for breast imaging in the future. Overall, this review underscores the potential of NLP in enhancing breast imaging care and offers insights for clinicians and researchers interested in this exciting and rapidly evolving field.
Collapse
Affiliation(s)
- Kareem Mahmoud Diab
- Institute for Data Science and Computing, University of Miami, Miami, FL 33146, USA
| | - Jamie Deng
- Department of Computer Science, University of Miami, Miami, FL 33146, USA
| | - Yusen Wu
- Institute for Data Science and Computing, University of Miami, Miami, FL 33146, USA
| | - Yelena Yesha
- Institute for Data Science and Computing, University of Miami, Miami, FL 33146, USA
- Department of Computer Science, University of Miami, Miami, FL 33146, USA
- Department of Radiology, Miller School of Medicine, University of Miami, Miami, FL 33146, USA
| | - Fernando Collado-Mesa
- Department of Radiology, Miller School of Medicine, University of Miami, Miami, FL 33146, USA
| | - Phuong Nguyen
- Institute for Data Science and Computing, University of Miami, Miami, FL 33146, USA
- Department of Computer Science, University of Miami, Miami, FL 33146, USA
- OpenKnect Inc., Halethorpe, MD 21227, USA
| |
Collapse
|
12
|
Habbous S, Barisic A, Homenauth E, Kandasamy S, Forster K, Eisen A, Holloway C. Estimating the incidence of breast cancer recurrence using administrative data. Breast Cancer Res Treat 2023; 198:509-522. [PMID: 36422755 DOI: 10.1007/s10549-022-06812-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Accepted: 11/09/2022] [Indexed: 11/25/2022]
Abstract
BACKGROUND Breast cancer is the most common cancer among women, but most cancer registries do not capture recurrences. We estimated the incidence of local, regional, and distant recurrences using administrative data. METHODS Patients diagnosed with stage I-III primary breast cancer in Ontario, Canada from 2013 to 2017 were included. Patients were followed until 31/Dec/2021, death, or a new primary cancer diagnosis. We used hospital administrative data (diagnostic and intervention codes) to identify local recurrence, regional recurrence, and distant metastasis after primary diagnosis. We used logistic regression to explore factors associated with developing a distant metastasis. RESULTS With a median follow-up 67 months, 5,431/45,857 (11.8%) of patients developed a distant metastasis a median 23 (9, 42) months after diagnosis of the primary tumor. 1086 (2.4%) and 1069 (2.3%) patients developed an isolated regional or a local recurrence, respectively. Patients with distant metastatic disease had a median overall survival of 15.4 months (95% CI 14.4-16.4 months) from the time recurrence/metastasis was identified. In contrast, the median survival for all other patients was not reached. Patients were more likely to develop a distant metastasis if they had more advanced stage, greater comorbidity, and presented with symptoms (p < 0.0001). Trastuzumab halved the risk of recurrence [OR 0.53 (0.45-0.63), p < 0.0001]. CONCLUSION Distant metastasis is not a rare outcome for patients diagnosed with breast cancer, translating to an annual incidence of 2132 new cases (17.8% of all breast cancer diagnoses). Overall survival remains high for patients with locoregional recurrences, but was poor following a diagnosis of a distant metastasis.
Collapse
Affiliation(s)
- Steven Habbous
- Ontario Health (Cancer Care Ontario), 525 University Ave, Toronto, ON, M5G2L3, Canada.
- Department of Epidemiology & Biostatistics, Western University, London, ON, N6A 5C1, Canada.
| | - Andriana Barisic
- Ontario Health (Cancer Care Ontario), 525 University Ave, Toronto, ON, M5G2L3, Canada
| | - Esha Homenauth
- Ontario Health (Cancer Care Ontario), 525 University Ave, Toronto, ON, M5G2L3, Canada
| | - Sharmilaa Kandasamy
- Ontario Health (Cancer Care Ontario), 525 University Ave, Toronto, ON, M5G2L3, Canada
| | - Katharina Forster
- Ontario Health (Cancer Care Ontario), 525 University Ave, Toronto, ON, M5G2L3, Canada
| | - Andrea Eisen
- Ontario Health (Cancer Care Ontario), 525 University Ave, Toronto, ON, M5G2L3, Canada
- Department of Medical Oncology, Sunnybrook Health Sciences Centre, Toronto, ON, M4Y 1H1, Canada
| | - Claire Holloway
- Ontario Health (Cancer Care Ontario), 525 University Ave, Toronto, ON, M5G2L3, Canada
- Department of Surgery, University of Toronto, Toronto, ON, M5T1P5, Canada
| |
Collapse
|
13
|
Bhakar S, Sinwar D, Pradhan N, Dhaka VS, Cherrez-Ojeda I, Parveen A, Hassan MU. Computational Intelligence-Based Disease Severity Identification: A Review of Multidisciplinary Domains. Diagnostics (Basel) 2023; 13:diagnostics13071212. [PMID: 37046431 PMCID: PMC10093052 DOI: 10.3390/diagnostics13071212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2023] [Revised: 03/06/2023] [Accepted: 03/08/2023] [Indexed: 04/14/2023] Open
Abstract
Disease severity identification using computational intelligence-based approaches is gaining popularity nowadays. Artificial intelligence and deep-learning-assisted approaches are proving to be significant in the rapid and accurate diagnosis of several diseases. In addition to disease identification, these approaches have the potential to identify the severity of a disease. The problem of disease severity identification can be considered multi-class classification, where the class labels are the severity levels of the disease. Plenty of computational intelligence-based solutions have been presented by researchers for severity identification. This paper presents a comprehensive review of recent approaches for identifying disease severity levels using computational intelligence-based approaches. We followed the PRISMA guidelines and compiled several works related to the severity identification of multidisciplinary diseases of the last decade from well-known publishers, such as MDPI, Springer, IEEE, Elsevier, etc. This article is devoted toward the severity identification of two main diseases, viz. Parkinson's Disease and Diabetic Retinopathy. However, severity identification of a few other diseases, such as COVID-19, autonomic nervous system dysfunction, tuberculosis, sepsis, sleep apnea, psychosis, traumatic brain injury, breast cancer, knee osteoarthritis, and Alzheimer's disease, was also briefly covered. Each work has been carefully examined against its methodology, dataset used, and the type of disease on several performance metrics, accuracy, specificity, etc. In addition to this, we also presented a few public repositories that can be utilized to conduct research on disease severity identification. We hope that this review not only acts as a compendium but also provides insights to the researchers working on disease severity identification using computational intelligence-based approaches.
Collapse
Affiliation(s)
- Suman Bhakar
- Department of Computer and Communication Engineering, Manipal University Jaipur, Dehmi Kalan, Jaipur 303007, Rajasthan, India
| | - Deepak Sinwar
- Department of Computer and Communication Engineering, Manipal University Jaipur, Dehmi Kalan, Jaipur 303007, Rajasthan, India
| | - Nitesh Pradhan
- Department of Computer Science and Engineering, Manipal University Jaipur, Dehmi Kalan, Jaipur 303007, Rajasthan, India
| | - Vijaypal Singh Dhaka
- Department of Computer and Communication Engineering, Manipal University Jaipur, Dehmi Kalan, Jaipur 303007, Rajasthan, India
| | - Ivan Cherrez-Ojeda
- Allergy and Pulmonology, Espíritu Santo University, Samborondón 0901-952, Ecuador
| | - Amna Parveen
- College of Pharmacy, Gachon University, Medical Campus, No. 191, Hambakmoero, Yeonsu-gu, Incheon 21936, Republic of Korea
| | - Muhammad Umair Hassan
- Department of ICT and Natural Sciences, Norwegian University of Science and Technology (NTNU), 6009 Ålesund, Norway
| |
Collapse
|
14
|
Puts S, Nobel M, Zegers C, Bermejo I, Robben S, Dekker A. How Natural Language Processing Can Aid With Pulmonary Oncology Tumor Node Metastasis Staging From Free-Text Radiology Reports: Algorithm Development and Validation. JMIR Form Res 2023; 7:e38125. [PMID: 36947118 PMCID: PMC10131747 DOI: 10.2196/38125] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Revised: 09/25/2022] [Accepted: 12/22/2022] [Indexed: 03/23/2023] Open
Abstract
BACKGROUND Natural language processing (NLP) is thought to be a promising solution to extract and store concepts from free text in a structured manner for data mining purposes. This is also true for radiology reports, which still consist mostly of free text. Accurate and complete reports are very important for clinical decision support, for instance, in oncological staging. As such, NLP can be a tool to structure the content of the radiology report, thereby increasing the report's value. OBJECTIVE This study describes the implementation and validation of an N-stage classifier for pulmonary oncology. It is based on free-text radiological chest computed tomography reports according to the tumor, node, and metastasis (TNM) classification, which has been added to the already existing T-stage classifier to create a combined TN-stage classifier. METHODS SpaCy, PyContextNLP, and regular expressions were used for proper information extraction, after additional rules were set to accurately extract N-stage. RESULTS The overall TN-stage classifier accuracy scores were 0.84 and 0.85, respectively, for the training (N=95) and validation (N=97) sets. This is comparable to the outcomes of the T-stage classifier (0.87-0.92). CONCLUSIONS This study shows that NLP has potential in classifying pulmonary oncology from free-text radiological reports according to the TNM classification system as both the T- and N-stages can be extracted with high accuracy.
Collapse
Affiliation(s)
- Sander Puts
- GROW School for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, Netherlands
- Department of Radiation Oncology, Maastro, Maastricht, Netherlands
| | - Martijn Nobel
- School of Health Professions Education, Maastricht University, Maastricht, Netherlands
- Department of Radiology and Nuclear Medicine, Maastricht University Medical Center+, Maastricht, Netherlands
| | - Catharina Zegers
- GROW School for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, Netherlands
- Department of Radiation Oncology, Maastro, Maastricht, Netherlands
| | - Iñigo Bermejo
- GROW School for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, Netherlands
| | - Simon Robben
- School of Health Professions Education, Maastricht University, Maastricht, Netherlands
- Department of Radiology and Nuclear Medicine, Maastricht University Medical Center+, Maastricht, Netherlands
| | - Andre Dekker
- GROW School for Oncology and Reproduction, Maastricht University Medical Centre+, Maastricht, Netherlands
- Department of Radiation Oncology, Maastro, Maastricht, Netherlands
| |
Collapse
|
15
|
Wi S, Goldhoff PE, Fuller LA, Grewal K, Wentzensen N, Clarke MA, Lorey TS. Using Natural Language Processing to Improve Discrete Data Capture From Interpretive Cervical Biopsy Diagnoses at a Large Health Care Organization. Arch Pathol Lab Med 2023; 147:222-226. [PMID: 35390126 DOI: 10.5858/arpa.2021-0410-oa] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/15/2021] [Indexed: 02/05/2023]
Abstract
CONTEXT.— The terminology used by pathologists to describe and grade dysplasia and premalignant changes of the cervical epithelium has evolved over time. Unfortunately, coexistence of different classification systems combined with nonstandardized interpretive text has created multiple layers of interpretive ambiguity. OBJECTIVE.— To use natural language processing (NLP) to automate and expedite translation of interpretive text to a single most severe, and thus actionable, cervical intraepithelial neoplasia (CIN) diagnosis. DESIGN.— We developed and applied NLP algorithms to 35 847 unstructured cervical pathology reports and assessed NLP performance in identifying the most severe diagnosis, compared to expert manual review. NLP performance was determined by calculating precision, recall, and F score. RESULTS.— The NLP algorithms yielded a precision of 0.957, a recall of 0.925, and an F score of 0.94. Additionally, we estimated that the time to evaluate each monthly biopsy file was significantly reduced, from 30 hours to 0.5 hours. CONCLUSIONS.— A set of validated NLP algorithms applied to pathology reports can rapidly and efficiently assign a discrete, actionable diagnosis using CIN classification to assist with clinical management of cervical pathology and disease. Moreover, discrete diagnostic data encoded as CIN terminology can enhance the efficiency of clinical research.
Collapse
Affiliation(s)
- Soora Wi
- From Kaiser Permanente, TPMG Regional Laboratories, Berkeley, California (Wi, Goldhoff, Fuller, Grewal, Lorey)
| | - Patricia E Goldhoff
- From Kaiser Permanente, TPMG Regional Laboratories, Berkeley, California (Wi, Goldhoff, Fuller, Grewal, Lorey)
| | - Laurie A Fuller
- From Kaiser Permanente, TPMG Regional Laboratories, Berkeley, California (Wi, Goldhoff, Fuller, Grewal, Lorey)
| | - Kiranjit Grewal
- From Kaiser Permanente, TPMG Regional Laboratories, Berkeley, California (Wi, Goldhoff, Fuller, Grewal, Lorey)
| | - Nicolas Wentzensen
- From the Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland (Wentzensen, Clarke)
| | - Megan A Clarke
- From the Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland (Wentzensen, Clarke)
| | - Thomas S Lorey
- From Kaiser Permanente, TPMG Regional Laboratories, Berkeley, California (Wi, Goldhoff, Fuller, Grewal, Lorey)
| |
Collapse
|
16
|
Li C, Weng Y, Zhang Y, Wang B. A Systematic Review of Application Progress on Machine Learning-Based Natural Language Processing in Breast Cancer over the Past 5 Years. Diagnostics (Basel) 2023; 13:diagnostics13030537. [PMID: 36766641 PMCID: PMC9913934 DOI: 10.3390/diagnostics13030537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Accepted: 01/24/2023] [Indexed: 02/04/2023] Open
Abstract
Artificial intelligence (AI) has been steadily developing in the medical field in the past few years, and AI-based applications have advanced cancer diagnosis. Breast cancer has a massive amount of data in oncology. There has been a high level of research enthusiasm to apply AI techniques to assist in breast cancer diagnosis and improve doctors' efficiency. However, the wise utilization of tedious breast cancer-related medical care is still challenging. Over the past few years, AI-based NLP applications have been increasingly proposed in breast cancer. In this systematic review, we conduct the review using preferred reporting items for systematic reviews and meta-analyses (PRISMA) and investigate the recent five years of literature in natural language processing (NLP)-based AI applications. This systematic review aims to uncover the recent trends in this area, close the research gap, and help doctors better understand the NLP application pipeline. We first conduct an initial literature search of 202 publications from Scopus, Web of Science, PubMed, Google Scholar, and the Association for Computational Linguistics (ACL) Anthology. Then, we screen the literature based on inclusion and exclusion criteria. Next, we categorize and analyze the advantages and disadvantages of the different machine learning models. We also discuss the current challenges, such as the lack of a public dataset. Furthermore, we suggest some promising future directions, including semi-supervised learning, active learning, and transfer learning.
Collapse
Affiliation(s)
- Chengtai Li
- School of Computer Science, Faculty of Science and Engineering, University of Nottingham Ningbo China, Ningbo 315100, China
| | - Ying Weng
- School of Computer Science, Faculty of Science and Engineering, University of Nottingham Ningbo China, Ningbo 315100, China
- Correspondence:
| | - Yiming Zhang
- School of Computer Science, Faculty of Science and Engineering, University of Nottingham Ningbo China, Ningbo 315100, China
| | - Boding Wang
- Hwa Mei Hospital, University of Chinese Academy of Sciences, Ningbo 315010, China
| |
Collapse
|
17
|
Natural Language Processing Applications for Computer-Aided Diagnosis in Oncology. Diagnostics (Basel) 2023; 13:diagnostics13020286. [PMID: 36673096 PMCID: PMC9857980 DOI: 10.3390/diagnostics13020286] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Revised: 12/24/2022] [Accepted: 01/05/2023] [Indexed: 01/15/2023] Open
Abstract
In the era of big data, text-based medical data, such as electronic health records (EHR) and electronic medical records (EMR), are growing rapidly. EHR and EMR are collected from patients to record their basic information, lab tests, vital signs, clinical notes, and reports. EHR and EMR contain the helpful information to assist oncologists in computer-aided diagnosis and decision making. However, it is time consuming for doctors to extract the valuable information they need and analyze the information from the EHR and EMR data. Recently, more and more research works have applied natural language processing (NLP) techniques, i.e., rule-based, machine learning-based, and deep learning-based techniques, on the EHR and EMR data for computer-aided diagnosis in oncology. The objective of this review is to narratively review the recent progress in the area of NLP applications for computer-aided diagnosis in oncology. Moreover, we intend to reduce the research gap between artificial intelligence (AI) experts and clinical specialists to design better NLP applications. We originally identified 295 articles from the three electronic databases: PubMed, Google Scholar, and ACL Anthology; then, we removed the duplicated papers and manually screened the irrelevant papers based on the content of the abstract; finally, we included a total of 23 articles after the screening process of the literature review. Furthermore, we provided an in-depth analysis and categorized these studies into seven cancer types: breast cancer, lung cancer, liver cancer, prostate cancer, pancreatic cancer, colorectal cancer, and brain tumors. Additionally, we identified the current limitations of NLP applications on supporting the clinical practices and we suggest some promising future research directions in this paper.
Collapse
|
18
|
Chan RC, To CKC, Cheng KCT, Yoshikazu T, Yan LLA, Tse GM. Artificial intelligence in breast cancer histopathology. Histopathology 2023; 82:198-210. [PMID: 36482271 DOI: 10.1111/his.14820] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 09/22/2022] [Accepted: 09/28/2022] [Indexed: 12/13/2022]
Abstract
This is a review on the use of artificial intelligence for digital breast pathology. A systematic search on PubMed was conducted, identifying 17,324 research papers related to breast cancer pathology. Following a semimanual screening, 664 papers were retrieved and pursued. The papers are grouped into six major tasks performed by pathologists-namely, molecular and hormonal analysis, grading, mitotic figure counting, ki-67 indexing, tumour-infiltrating lymphocyte assessment, and lymph node metastases identification. Under each task, open-source datasets for research to build artificial intelligence (AI) tools are also listed. Many AI tools showed promise and demonstrated feasibility in the automation of routine pathology investigations. We expect continued growth of AI in this field as new algorithms mature.
Collapse
Affiliation(s)
- Ronald Ck Chan
- Department of Anatomical and Cellular Pathology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong, Hong Kong
| | - Chun Kit Curtis To
- Department of Anatomical and Cellular Pathology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong, Hong Kong
| | - Ka Chuen Tom Cheng
- Department of Anatomical and Cellular Pathology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong, Hong Kong
| | - Tada Yoshikazu
- Department of Anatomical and Cellular Pathology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong, Hong Kong
| | - Lai Ling Amy Yan
- Department of Anatomical and Cellular Pathology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong, Hong Kong
| | - Gary M Tse
- Department of Anatomical and Cellular Pathology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong, Hong Kong
| |
Collapse
|
19
|
Pan LC, Wu XR, Lu Y, Zhang HQ, Zhou YL, Liu X, Liu SL, Yan QY. Artificial intelligence empowered digital health technologies in cancer survivorship care: A scoping review. Asia Pac J Oncol Nurs 2022; 9:100127. [PMID: 36176267 PMCID: PMC9513729 DOI: 10.1016/j.apjon.2022.100127] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Accepted: 07/29/2022] [Indexed: 12/03/2022] Open
Abstract
Objective The objectives of this systematic review are to describe features and specific application scenarios for current cancer survivorship care services of Artificial intelligence (AI)-driven digital health technologies (DHTs) and to explore the acceptance and briefly evaluate its feasibility in the application process. Methods Search for literatures published from 2010 to 2022 on sites MEDLINE, IEEE-Xplor, PubMed, Embase, Cochrane Central Register of Controlled Trials and Scopus systematically. The types of literatures include original research, descriptive study, randomized controlled trial, pilot study, and feasible or acceptable study. The literatures above described current status and effectiveness of digital medical technologies based on AI and used in cancer survivorship care services. Additionally, we use QuADS quality assessment tool to evaluate the quality of literatures included in this review. Results 43 studies that met the inclusion criteria were analyzed and qualitatively synthesized. The current status and results related to the application of AI-driven DHTs in cancer survivorship care were reviewed. Most of these studies were designed specifically for breast cancer survivors' care and focused on the areas of recurrence or secondary cancer prediction, clinical decision support, cancer survivability prediction, population or treatment stratified, anti-cancer treatment-induced adverse reaction prediction, and so on. Applying AI-based DHTs to cancer survivors actually has shown some positive outcomes, including increased motivation of patient-reported outcomes (PROs), reduce fatigue and pain levels, improved quality of life, and physical function. However, current research mostly explored the technology development and formation (testing) phases, with limited-scale population, and single-center trial. Therefore, it is not suitable to draw conclusions that the effectiveness of AI-based DHTs in supportive cancer care, as most of applications are still in the early stage of development and feasibility testing. Conclusions While digital therapies are promising in the care of cancer patients, more high-quality studies are still needed in the future to demonstrate the effectiveness of digital therapies in cancer care. Studies should explore how to develop uniform standards for measuring patient-related outcomes, ensure the scientific validity of research methods, and emphasize patient and health practitioner involvement in the development and use of technology.
Collapse
Affiliation(s)
- Lu-Chen Pan
- Department of Nursing, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430022, China
| | - Xiao-Ru Wu
- School of Nursing, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
| | - Ying Lu
- Department of Nursing, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430022, China
- School of Nursing, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
| | - Han-Qing Zhang
- Health Science Center, Yangtze University, Jinzhou 434023, China
| | - Yao-Ling Zhou
- Department of Nursing, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430022, China
- School of Nursing, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
| | - Xue Liu
- School of Nursing, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
| | - Sheng-Lin Liu
- Department of Medical Engineering, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430022, China
| | - Qiao-Yuan Yan
- Department of Nursing, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430022, China
| |
Collapse
|
20
|
Li Y, Wu X, Yang P, Jiang G, Luo Y. Machine Learning for Lung Cancer Diagnosis, Treatment, and Prognosis. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:850-866. [PMID: 36462630 PMCID: PMC10025752 DOI: 10.1016/j.gpb.2022.11.003] [Citation(s) in RCA: 66] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 10/03/2022] [Accepted: 11/17/2022] [Indexed: 12/03/2022]
Abstract
The recent development of imaging and sequencing technologies enables systematic advances in the clinical study of lung cancer. Meanwhile, the human mind is limited in effectively handling and fully utilizing the accumulation of such enormous amounts of data. Machine learning-based approaches play a critical role in integrating and analyzing these large and complex datasets, which have extensively characterized lung cancer through the use of different perspectives from these accrued data. In this review, we provide an overview of machine learning-based approaches that strengthen the varying aspects of lung cancer diagnosis and therapy, including early detection, auxiliary diagnosis, prognosis prediction, and immunotherapy practice. Moreover, we highlight the challenges and opportunities for future applications of machine learning in lung cancer.
Collapse
Affiliation(s)
- Yawei Li
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Xin Wu
- Department of Medicine, University of Illinois at Chicago, Chicago, IL 60612, USA
| | - Ping Yang
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905 / Scottsdale, AZ 85259, USA
| | - Guoqian Jiang
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN 55905, USA
| | - Yuan Luo
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA.
| |
Collapse
|
21
|
Jiang X, Xu C. Deep Learning and Machine Learning with Grid Search to Predict Later Occurrence of Breast Cancer Metastasis Using Clinical Data. J Clin Med 2022; 11:jcm11195772. [PMID: 36233640 PMCID: PMC9570670 DOI: 10.3390/jcm11195772] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 07/30/2022] [Accepted: 09/21/2022] [Indexed: 11/16/2022] Open
Abstract
Background: It is important to be able to predict, for each individual patient, the likelihood of later metastatic occurrence, because the prediction can guide treatment plans tailored to a specific patient to prevent metastasis and to help avoid under-treatment or over-treatment. Deep neural network (DNN) learning, commonly referred to as deep learning, has become popular due to its success in image detection and prediction, but questions such as whether deep learning outperforms other machine learning methods when using non-image clinical data remain unanswered. Grid search has been introduced to deep learning hyperparameter tuning for the purpose of improving its prediction performance, but the effect of grid search on other machine learning methods are under-studied. In this research, we take the empirical approach to study the performance of deep learning and other machine learning methods when using non-image clinical data to predict the occurrence of breast cancer metastasis (BCM) 5, 10, or 15 years after the initial treatment. We developed prediction models using the deep feedforward neural network (DFNN) methods, as well as models using nine other machine learning methods, including naïve Bayes (NB), logistic regression (LR), support vector machine (SVM), LASSO, decision tree (DT), k-nearest neighbor (KNN), random forest (RF), AdaBoost (ADB), and XGBoost (XGB). We used grid search to tune hyperparameters for all methods. We then compared our feedforward deep learning models to the models trained using the nine other machine learning methods. Results: Based on the mean test AUC (Area under the ROC Curve) results, DFNN ranks 6th, 4th, and 3rd when predicting 5-year, 10-year, and 15-year BCM, respectively, out of 10 methods. The top performing methods in predicting 5-year BCM are XGB (1st), RF (2nd), and KNN (3rd). For predicting 10-year BCM, the top performers are XGB (1st), RF (2nd), and NB (3rd). Finally, for 15-year BCM, the top performers are SVM (1st), LR and LASSO (tied for 2nd), and DFNN (3rd). The ensemble methods RF and XGB outperform other methods when data are less balanced, while SVM, LR, LASSO, and DFNN outperform other methods when data are more balanced. Our statistical testing results show that at a significance level of 0.05, DFNN overall performs comparably to other machine learning methods when predicting 5-year, 10-year, and 15-year BCM. Conclusions: Our results show that deep learning with grid search overall performs at least as well as other machine learning methods when using non-image clinical data. It is interesting to note that some of the other machine learning methods, such as XGB, RF, and SVM, are very strong competitors of DFNN when incorporating grid search. It is also worth noting that the computation time required to do grid search with DFNN is much more than that required to do grid search with the other nine machine learning methods.
Collapse
Affiliation(s)
- Xia Jiang
- Correspondence: ; Tel.: +412-648-9310
| | | |
Collapse
|
22
|
Nandish S, R J P, N M N. Natural Language Processing Approaches for Automated Multilevel and Multiclass Classification of Breast Lesions on Free-Text Cytopathology Reports. JCO Clin Cancer Inform 2022; 6:e2200036. [PMID: 36103641 DOI: 10.1200/cci.22.00036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE The extensive growth and use of electronic health records (EHRs) and extending medical literature have led to huge opportunities to automate the extraction of relevant clinical information that helps in concise and effective clinical decision support. However, processing such information has traditionally been dependent on labor-intensive processes with human errors such as fatigue, oversight, and interobserver variability. Hence, this study aims at the processing of EHRs and performing multilevel and multiclass classification by fetching dominant characteristic features that are sufficient to detect and differentiate various types of breast lesions. PATIENTS AND METHODS In this study, unstructured EHRs on breast lesions obtained through fine-needle aspiration cytology technique are considered. The raw text was normalized into structured tabular form and converted to scores by performing sentiment analysis that helps to decide the total polarity or class label of the EHR. Supervised machine learning approaches, namely random forest and feed-forward neural network trained using Levenberg-Marquardt training function, are used for classification of the collected EHR data set containing 2,879 records that are split in the ratio of 80:20 as training and testing data sets, respectively. RESULTS Random forest and feed-forward neural network classifiers gave the best performance with an accuracy of 99.36%, an overall receiver operating characteristic-area under the curve of 99.2%, a correlation with ground truth of 98.3%, and a histopathologic correlation of 98.6%. CONCLUSION Natural language processing has huge potential to automate the extraction of clinical features from breast lesions. The proposed multilevel and multiclass classification approach is used to classify 13 different types of breast lesions with 20 different labels into five classes to decide the type of treatment that should be given to patients by a physician or oncologist.
Collapse
Affiliation(s)
- Sonali Nandish
- Department of Computer Science and Engineering, JSS Science and Technology University, Mysuru, Karnataka, India
| | - Prathibha R J
- Department of Information Science and Engineering, JSS Science and Technology University, Mysuru, Karnataka, India
| | - Nandini N M
- Department of Pathology, JSS Academy of Higher Education and Research, Mysuru, Karnataka, India
| |
Collapse
|
23
|
Wang L, Fu S, Wen A, Ruan X, He H, Liu S, Moon S, Mai M, Riaz IB, Wang N, Yang P, Xu H, Warner JL, Liu H. Assessment of Electronic Health Record for Cancer Research and Patient Care Through a Scoping Review of Cancer Natural Language Processing. JCO Clin Cancer Inform 2022; 6:e2200006. [PMID: 35917480 PMCID: PMC9470142 DOI: 10.1200/cci.22.00006] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Revised: 03/18/2022] [Accepted: 06/15/2022] [Indexed: 11/20/2022] Open
Abstract
PURPOSE The advancement of natural language processing (NLP) has promoted the use of detailed textual data in electronic health records (EHRs) to support cancer research and to facilitate patient care. In this review, we aim to assess EHR for cancer research and patient care by using the Minimal Common Oncology Data Elements (mCODE), which is a community-driven effort to define a minimal set of data elements for cancer research and practice. Specifically, we aim to assess the alignment of NLP-extracted data elements with mCODE and review existing NLP methodologies for extracting said data elements. METHODS Published literature studies were searched to retrieve cancer-related NLP articles that were written in English and published between January 2010 and September 2020 from main literature databases. After the retrieval, articles with EHRs as the data source were manually identified. A charting form was developed for relevant study analysis and used to categorize data including four main topics: metadata, EHR data and targeted cancer types, NLP methodology, and oncology data elements and standards. RESULTS A total of 123 publications were selected finally and included in our analysis. We found that cancer research and patient care require some data elements beyond mCODE as expected. Transparency and reproductivity are not sufficient in NLP methods, and inconsistency in NLP evaluation exists. CONCLUSION We conducted a comprehensive review of cancer NLP for research and patient care using EHRs data. Issues and barriers for wide adoption of cancer NLP were identified and discussed.
Collapse
Affiliation(s)
- Liwei Wang
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Sunyang Fu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Andrew Wen
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Xiaoyang Ruan
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Huan He
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Sijia Liu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Sungrim Moon
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Michelle Mai
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| | - Irbaz B. Riaz
- Department of Hematology/Oncology, Mayo Clinic, Scottsdale, AZ
| | - Nan Wang
- Department of Computer Science and Engineering, College of Science and Engineering, University of Minnesota, Minneapolis, MN
| | - Ping Yang
- Department of Quantitative Health Sciences, Mayo Clinic, Scottsdale, AZ
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX
| | - Jeremy L. Warner
- Departments of Medicine (Hematology/Oncology), Vanderbilt University, Nashville, TN
- Department Biomedical Informatics, Vanderbilt University, Nashville, TN
| | - Hongfang Liu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN
| |
Collapse
|
24
|
Khair S, Dort JC, Quan ML, Cheung WY, Sauro KM, Nakoneshny SC, Popowich BL, Liu P, Wu G, Xu Y. Validated algorithms for identifying timing of second event of oropharyngeal squamous cell carcinoma using real-world data. Head Neck 2022; 44:1909-1917. [PMID: 35653151 DOI: 10.1002/hed.27109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2021] [Revised: 04/29/2022] [Accepted: 05/18/2022] [Indexed: 11/07/2022] Open
Abstract
BACKGROUND Understanding occurrence and timing of second events (recurrence and second primary cancer) is essential for cancer specific survival analysis. However, this information is not readily available in administrative data. METHODS Alberta Cancer Registry, physician claims, and other administrative data were used. Timing of second event was estimated based on our developed algorithm. For validation, the difference, in days between the algorithm estimated and the chart-reviewed timing of second event. Further, the result of Cox-regression modeling cancer-free survival was compared to chart review data. RESULTS Majority (74.3%) of the patients had a difference between the chart-reviewed and algorithm-estimated timing of second event falling within the 0-60 days window. Kaplan-Meier curves generated from the estimated data and chart review data were comparable with a 5-year second-event-free survival rate of 75.4% versus 72.5%. CONCLUSION The algorithm provided an estimated timing of second event similar to that of the chart review.
Collapse
Affiliation(s)
- Shahreen Khair
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada
| | - Joseph C Dort
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.,Department of Surgery, Cumming School of Medicine, University of Calgary, North Tower, Foothills Medical Centre, Calgary, Alberta, Canada
| | - May Lynn Quan
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.,Department of Surgery, Cumming School of Medicine, University of Calgary, North Tower, Foothills Medical Centre, Calgary, Alberta, Canada.,Department of Oncology, Cumming School of Medicine, University of Calgary, Tom Baker, Cancer Centre, Calgary, Alberta, Canada
| | - Winson Y Cheung
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.,Department of Surgery, Cumming School of Medicine, University of Calgary, North Tower, Foothills Medical Centre, Calgary, Alberta, Canada
| | - Khara M Sauro
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.,Department of Surgery, Cumming School of Medicine, University of Calgary, North Tower, Foothills Medical Centre, Calgary, Alberta, Canada.,Department of Oncology, Cumming School of Medicine, University of Calgary, Tom Baker, Cancer Centre, Calgary, Alberta, Canada
| | - Steven C Nakoneshny
- The Ohlson Research Initiative, Arnie Charbonneau Cancer Institute, University of Calgary, Calgary, Alberta, Canada
| | - Brittany Lynn Popowich
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Teaching Research and Wellness (TRW), Calgary, Alberta, Canada
| | - Ping Liu
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada
| | - Guosong Wu
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.,Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Teaching Research and Wellness (TRW), Calgary, Alberta, Canada
| | - Yuan Xu
- Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, Alberta, Canada.,Department of Surgery, Cumming School of Medicine, University of Calgary, North Tower, Foothills Medical Centre, Calgary, Alberta, Canada.,Department of Oncology, Cumming School of Medicine, University of Calgary, Tom Baker, Cancer Centre, Calgary, Alberta, Canada.,Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Teaching Research and Wellness (TRW), Calgary, Alberta, Canada
| |
Collapse
|
25
|
Chen Y, Hao L, Zou VZ, Hollander Z, Ng RT, Isaac KV. Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system. BMC Med Res Methodol 2022; 22:136. [PMID: 35549854 PMCID: PMC9101856 DOI: 10.1186/s12874-022-01583-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2021] [Accepted: 03/15/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Manually extracted data points from health records are collated on an institutional, provincial, and national level to facilitate clinical research. However, the labour-intensive clinical chart review process puts an increasing burden on healthcare system budgets. Therefore, an automated information extraction system is needed to ensure the timeliness and scalability of research data. METHODS We used a dataset of 100 synoptic operative and 100 pathology reports, evenly split into 50 reports in training and test sets for each report type. The training set guided our development of a Natural Language Processing (NLP) extraction pipeline system, which accepts scanned images of operative and pathology reports. The system uses a combination of rule-based and transfer learning methods to extract numeric encodings from text. We also developed visualization tools to compare the manual and automated extractions. The code for this paper was made available on GitHub. RESULTS A test set of 50 operative and 50 pathology reports were used to evaluate the extraction accuracies of the NLP pipeline. Gold standard, defined as manual extraction by expert reviewers, yielded accuracies of 90.5% for operative reports and 96.0% for pathology reports, while the NLP system achieved overall 91.9% (operative) and 95.4% (pathology) accuracy. The pipeline successfully extracted outcomes data pertinent to breast cancer tumor characteristics (e.g. presence of invasive carcinoma, size, histologic type), prognostic factors (e.g. number of lymph nodes with micro-metastases and macro-metastases, pathologic stage), and treatment-related variables (e.g. margins, neo-adjuvant treatment, surgical indication) with high accuracy. Out of the 48 variables across operative and pathology codebooks, NLP yielded 43 variables with F-scores of at least 0.90; in comparison, a trained human annotator yielded 44 variables with F-scores of at least 0.90. CONCLUSIONS The NLP system achieves near-human-level accuracy in both operative and pathology reports using a minimal curated dataset. This system uniquely provides a robust solution for transparent, adaptable, and scalable automation of data extraction from patient health records. It may serve to advance breast cancer clinical research by facilitating collection of vast amounts of valuable health data at a population level.
Collapse
Affiliation(s)
- Yifu Chen
- Department of Computer Science, University of British Columbia, Faculty of Science, 201-2366 Main Mall, Vancouver, BC, V6T 1Z4, Canada
- Prevention of Organ Failure (PROOF) Centre of Excellence, 1190 Hornby Street, Vancouver, BC, V6Z 2K5, Canada
| | - Lucy Hao
- Department of Computer Science, University of British Columbia, Faculty of Science, 201-2366 Main Mall, Vancouver, BC, V6T 1Z4, Canada
- Prevention of Organ Failure (PROOF) Centre of Excellence, 1190 Hornby Street, Vancouver, BC, V6Z 2K5, Canada
| | - Vito Z Zou
- Department of Surgery, University of British Columbia, Faculty of Medicine, 2221 Wesbrook Mall, Vancouver, BC, V5Z 1M9, Canada
| | - Zsuzsanna Hollander
- Department of Computer Science, University of British Columbia, Faculty of Science, 201-2366 Main Mall, Vancouver, BC, V6T 1Z4, Canada
- Prevention of Organ Failure (PROOF) Centre of Excellence, 1190 Hornby Street, Vancouver, BC, V6Z 2K5, Canada
| | - Raymond T Ng
- Department of Computer Science, University of British Columbia, Faculty of Science, 201-2366 Main Mall, Vancouver, BC, V6T 1Z4, Canada
- Prevention of Organ Failure (PROOF) Centre of Excellence, 1190 Hornby Street, Vancouver, BC, V6Z 2K5, Canada
| | - Kathryn V Isaac
- Department of Surgery, University of British Columbia, Faculty of Medicine, 2221 Wesbrook Mall, Vancouver, BC, V5Z 1M9, Canada.
| |
Collapse
|
26
|
Wu Q, Deng L, Jiang Y, Zhang H. Application of the Machine-Learning Model to Improve Prediction of Non-Sentinel Lymph Node Metastasis Status Among Breast Cancer Patients. Front Surg 2022; 9:797377. [PMID: 35548185 PMCID: PMC9082647 DOI: 10.3389/fsurg.2022.797377] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Accepted: 03/18/2022] [Indexed: 11/13/2022] Open
Abstract
BackgroundPerforming axillary lymph node dissection (ALND) is the current standard option after a positive sentinel lymph node (SLN). However, whether 1–2 metastatic SLNs require ALND is debatable. The probability of metastasis in non-sentinel lymph nodes (NSLNs) can be calculated using nomograms. In this study, we developed an individualized model using machine-learning (ML) methods to select potential variables, which influence NSLN metastasis.Materials and MethodsCohorts of patients with early breast cancer who underwent SLN biopsy and ALND between 2012 and 2021 were created (training cohort, N 157 and validation cohort, N 58) for the development of the nomogram. Three ML methods were trained in the training set to create a strong predictive model. Finally, the multiple iterations of the least absolute shrinkage and selection operator regression method were used to determine the variables associated with NSLN status.ResultsFour independent variables (positive SLN number, absence of lymph node hilum, lymphovascular invasion (LVI), and total number of SLNs harvested) were combined to generate the nomogram. The area under the receiver operating characteristic curve (AUC) value of 0.759 was obtained in the entire set. The AUC values for the training set and the test set were 0.782 and 0.705, respectively. The Hosmer-Lemeshow test of the model fit accuracy was identified with p = 0.759.ConclusionThis study developed a nomogram that incorporates ultrasound (US)-related variables using the ML method and serves to clinically predict the non-metastatic status of NSLN and help in the selection of the appropriate treatment option.
Collapse
Affiliation(s)
- Qian Wu
- Department of General Surgery, Shanghai Public Health Center, Shanghai, China
| | - Li Deng
- Department of General Surgery, Shanghai Public Health Center, Shanghai, China
| | - Ying Jiang
- Department of General Surgery, Zhongshan Hospital, Fudan University, Shanghai, China
| | - Hongwei Zhang
- Department of General Surgery, Zhongshan Hospital, Fudan University, Shanghai, China
- *Correspondence: Hongwei Zhang
| |
Collapse
|
27
|
Kaur I, Doja M, Ahmad T. Data Mining and Machine Learning in Cancer Survival Research: An Overview and Future Recommendations. J Biomed Inform 2022; 128:104026. [DOI: 10.1016/j.jbi.2022.104026] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 02/07/2022] [Accepted: 02/09/2022] [Indexed: 12/29/2022]
|
28
|
Warren JL, Noone AM, Stevens J, Wu XC, Hseih MC, Mumphrey B, Schmidt R, Coyle L, Shields R, Mariotto AB. The Utility of Pathology Reports to Identify Persons With Cancer Recurrence. Med Care 2022; 60:44-49. [PMID: 34812787 PMCID: PMC8720471 DOI: 10.1097/mlr.0000000000001669] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
BACKGROUND Cancer recurrence is an important measure of the impact of cancer treatment. However, no population-based data on recurrence are available. Pathology reports could potentially identify cancer recurrences. Their utility to capture recurrences is unknown. OBJECTIVE This analysis assesses the sensitivity of pathology reports to identify patients with cancer recurrence and the stage at recurrence. SUBJECTS The study includes patients with recurrent breast (n=214) or colorectal (n=203) cancers. RESEARCH DESIGN This retrospective analysis included patients from a population-based cancer registry who were part of the Patient-Centered Outcomes Research (PCOR) Study, a project that followed cancer patients in-depth for 5 years after diagnosis to identify recurrences. MEASURES Information abstracted from pathology reports for patients with recurrence was compared with their PCOR data (gold standard) to determine what percent had a pathology report at the time of recurrence, the sensitivity of text in the report to identify recurrence, and if the stage at recurrence could be determined from the pathology report. RESULTS One half of cancer patients had a pathology report near the time of recurrence. For patients with a pathology report, the report's sensitivity to identify recurrence was 98.1% for breast cancer cases and 95.7% for colorectal cancer cases. The specific stage at recurrence from the pathology report had a moderate agreement with gold-standard data. CONCLUSIONS Pathology reports alone cannot measure population-based recurrence of solid cancers but can identify specific cohorts of recurrent cancer patients. As electronic submission of pathology reports increases, these reports may identify specific recurrent patients in near real-time.
Collapse
Affiliation(s)
- Joan L. Warren
- National Cancer Institute/Division of Cancer Control and Population Science, Bethesda, Maryland 20892
| | - Anne-Michelle Noone
- National Cancer Institute/Division of Cancer Control and Population Science, Bethesda, Maryland 20892
| | | | - Xiao-Cheng Wu
- Louisiana Tumor Registry, School of Public Health, Louisiana State University Health Sciences Center, New Orleans, Louisiana 70112
| | - Mei-chin Hseih
- Louisiana Tumor Registry, School of Public Health, Louisiana State University Health Sciences Center, New Orleans, Louisiana 70112
| | - Brent Mumphrey
- Louisiana Tumor Registry, School of Public Health, Louisiana State University Health Sciences Center, New Orleans, Louisiana 70112
| | | | - Linda Coyle
- Information Management Services, Calverton, Maryland 20705
| | - Rusty Shields
- Information Management Services, Calverton, Maryland 20705
| | - Angela B. Mariotto
- National Cancer Institute/Division of Cancer Control and Population Science, Bethesda, Maryland 20892
| |
Collapse
|
29
|
Jing X. The Unified Medical Language System at 30 Years and How It Is Used and Published: Systematic Review and Content Analysis. JMIR Med Inform 2021; 9:e20675. [PMID: 34236337 PMCID: PMC8433943 DOI: 10.2196/20675] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Revised: 11/25/2020] [Accepted: 07/02/2021] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND The Unified Medical Language System (UMLS) has been a critical tool in biomedical and health informatics, and the year 2021 marks its 30th anniversary. The UMLS brings together many broadly used vocabularies and standards in the biomedical field to facilitate interoperability among different computer systems and applications. OBJECTIVE Despite its longevity, there is no comprehensive publication analysis of the use of the UMLS. Thus, this review and analysis is conducted to provide an overview of the UMLS and its use in English-language peer-reviewed publications, with the objective of providing a comprehensive understanding of how the UMLS has been used in English-language peer-reviewed publications over the last 30 years. METHODS PubMed, ACM Digital Library, and the Nursing & Allied Health Database were used to search for studies. The primary search strategy was as follows: UMLS was used as a Medical Subject Headings term or a keyword or appeared in the title or abstract. Only English-language publications were considered. The publications were screened first, then coded and categorized iteratively, following the grounded theory. The review process followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. RESULTS A total of 943 publications were included in the final analysis. Moreover, 32 publications were categorized into 2 categories; hence the total number of publications before duplicates are removed is 975. After analysis and categorization of the publications, UMLS was found to be used in the following emerging themes or areas (the number of publications and their respective percentages are given in parentheses): natural language processing (230/975, 23.6%), information retrieval (125/975, 12.8%), terminology study (90/975, 9.2%), ontology and modeling (80/975, 8.2%), medical subdomains (76/975, 7.8%), other language studies (53/975, 5.4%), artificial intelligence tools and applications (46/975, 4.7%), patient care (35/975, 3.6%), data mining and knowledge discovery (25/975, 2.6%), medical education (20/975, 2.1%), degree-related theses (13/975, 1.3%), digital library (5/975, 0.5%), and the UMLS itself (150/975, 15.4%), as well as the UMLS for other purposes (27/975, 2.8%). CONCLUSIONS The UMLS has been used successfully in patient care, medical education, digital libraries, and software development, as originally planned, as well as in degree-related theses, the building of artificial intelligence tools, data mining and knowledge discovery, foundational work in methodology, and middle layers that may lead to advanced products. Natural language processing, the UMLS itself, and information retrieval are the 3 most common themes that emerged among the included publications. The results, although largely related to academia, demonstrate that UMLS achieves its intended uses successfully, in addition to achieving uses broadly beyond its original intentions.
Collapse
Affiliation(s)
- Xia Jing
- Department of Public Health Sciences, College of Behavioral, Social and Health Sciences, Clemson University, Clemson, SC, United States
| |
Collapse
|
30
|
Holmes B, Chitale D, Loving J, Tran M, Subramanian V, Berry A, Rioth M, Warrier R, Brown T. Customizable Natural Language Processing Biomarker Extraction Tool. JCO Clin Cancer Inform 2021; 5:833-841. [PMID: 34406803 DOI: 10.1200/cci.21.00017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE Natural language processing (NLP) in pathology reports to extract biomarker information is an ongoing area of research. MetaMap is a natural language processing tool developed and funded by the National Library of Medicine to map biomedical text to the Unified Medical Language System Metathesaurus by applying specific tags to clinically relevant terms. Although results are useful without additional postprocessing, these tags lack important contextual information. METHODS Our novel method takes terminology-driven semantic tags and incorporates those into a semantic frame that is task-specific to add necessary context to MetaMap. We use important contextual information to capture biomarker results to support Community Health System's use of Precision Medicine treatments for patients with cancer. For each biomarker, the name, type, numeric quantifiers, non-numeric qualifiers, and the time frame are extracted. These fields then associate biomarkers with their context in the pathology report such as test type, probe intensity, copy-number changes, and even failed results. A selection of 6,713 relevant reports contained the following standard-of-care biomarkers for metastatic breast cancer: breast cancer gene 1 and 2, estrogen receptor, progesterone receptor, human epidermal growth factor receptor 2, and programmed death-ligand 1. RESULTS The method was tested on pathology reports from the internal pathology laboratory at Henry Ford Health System. A certified tumor registrar reviewed 400 tests, which showed > 95% accuracy for all extracted biomarker types. CONCLUSION Using this new method, it is possible to extract high-quality, contextual biomarker information, and this represents a significant advance in biomarker extraction.
Collapse
|
31
|
Gupta R, Srivastava D, Sahu M, Tiwari S, Ambasta RK, Kumar P. Artificial intelligence to deep learning: machine intelligence approach for drug discovery. Mol Divers 2021; 25:1315-1360. [PMID: 33844136 PMCID: PMC8040371 DOI: 10.1007/s11030-021-10217-3] [Citation(s) in RCA: 424] [Impact Index Per Article: 106.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Accepted: 03/22/2021] [Indexed: 02/06/2023]
Abstract
Drug designing and development is an important area of research for pharmaceutical companies and chemical scientists. However, low efficacy, off-target delivery, time consumption, and high cost impose a hurdle and challenges that impact drug design and discovery. Further, complex and big data from genomics, proteomics, microarray data, and clinical trials also impose an obstacle in the drug discovery pipeline. Artificial intelligence and machine learning technology play a crucial role in drug discovery and development. In other words, artificial neural networks and deep learning algorithms have modernized the area. Machine learning and deep learning algorithms have been implemented in several drug discovery processes such as peptide synthesis, structure-based virtual screening, ligand-based virtual screening, toxicity prediction, drug monitoring and release, pharmacophore modeling, quantitative structure-activity relationship, drug repositioning, polypharmacology, and physiochemical activity. Evidence from the past strengthens the implementation of artificial intelligence and deep learning in this field. Moreover, novel data mining, curation, and management techniques provided critical support to recently developed modeling algorithms. In summary, artificial intelligence and deep learning advancements provide an excellent opportunity for rational drug design and discovery process, which will eventually impact mankind. The primary concern associated with drug design and development is time consumption and production cost. Further, inefficiency, inaccurate target delivery, and inappropriate dosage are other hurdles that inhibit the process of drug delivery and development. With advancements in technology, computer-aided drug design integrating artificial intelligence algorithms can eliminate the challenges and hurdles of traditional drug design and development. Artificial intelligence is referred to as superset comprising machine learning, whereas machine learning comprises supervised learning, unsupervised learning, and reinforcement learning. Further, deep learning, a subset of machine learning, has been extensively implemented in drug design and development. The artificial neural network, deep neural network, support vector machines, classification and regression, generative adversarial networks, symbolic learning, and meta-learning are examples of the algorithms applied to the drug design and discovery process. Artificial intelligence has been applied to different areas of drug design and development process, such as from peptide synthesis to molecule design, virtual screening to molecular docking, quantitative structure-activity relationship to drug repositioning, protein misfolding to protein-protein interactions, and molecular pathway identification to polypharmacology. Artificial intelligence principles have been applied to the classification of active and inactive, monitoring drug release, pre-clinical and clinical development, primary and secondary drug screening, biomarker development, pharmaceutical manufacturing, bioactivity identification and physiochemical properties, prediction of toxicity, and identification of mode of action.
Collapse
Affiliation(s)
- Rohan Gupta
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological University (Formerly DCE), Shahbad Daulatpur, Bawana Road, Delhi, 110042, India
| | - Devesh Srivastava
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological University (Formerly DCE), Shahbad Daulatpur, Bawana Road, Delhi, 110042, India
| | - Mehar Sahu
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological University (Formerly DCE), Shahbad Daulatpur, Bawana Road, Delhi, 110042, India
| | - Swati Tiwari
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological University (Formerly DCE), Shahbad Daulatpur, Bawana Road, Delhi, 110042, India
| | - Rashmi K Ambasta
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological University (Formerly DCE), Shahbad Daulatpur, Bawana Road, Delhi, 110042, India
| | - Pravir Kumar
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological University (Formerly DCE), Shahbad Daulatpur, Bawana Road, Delhi, 110042, India.
| |
Collapse
|
32
|
Gupta M, Wu H, Arora S, Gupta A, Chaudhary G, Hua Q. Gene Mutation Classification through Text Evidence Facilitating Cancer Tumour Detection. JOURNAL OF HEALTHCARE ENGINEERING 2021; 2021:8689873. [PMID: 34367540 PMCID: PMC8337154 DOI: 10.1155/2021/8689873] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Revised: 06/26/2021] [Accepted: 07/13/2021] [Indexed: 12/03/2022]
Abstract
A cancer tumour consists of thousands of genetic mutations. Even after advancement in technology, the task of distinguishing genetic mutations, which act as driver for the growth of tumour with passengers (Neutral Genetic Mutations), is still being done manually. This is a time-consuming process where pathologists interpret every genetic mutation from the clinical evidence manually. These clinical shreds of evidence belong to a total of nine classes, but the criterion of classification is still unknown. The main aim of this research is to propose a multiclass classifier to classify the genetic mutations based on clinical evidence (i.e., the text description of these genetic mutations) using Natural Language Processing (NLP) techniques. The dataset for this research is taken from Kaggle and is provided by the Memorial Sloan Kettering Cancer Center (MSKCC). The world-class researchers and oncologists contribute the dataset. Three text transformation models, namely, CountVectorizer, TfidfVectorizer, and Word2Vec, are utilized for the conversion of text to a matrix of token counts. Three machine learning classification models, namely, Logistic Regression (LR), Random Forest (RF), and XGBoost (XGB), along with the Recurrent Neural Network (RNN) model of deep learning, are applied to the sparse matrix (keywords count representation) of text descriptions. The accuracy score of all the proposed classifiers is evaluated by using the confusion matrix. Finally, the empirical results show that the RNN model of deep learning has performed better than other proposed classifiers with the highest accuracy of 70%.
Collapse
Affiliation(s)
- Meenu Gupta
- Department of Computer Science and Engineering, Chandigarh University, Ajitgarh, Punjab, India
| | - Hao Wu
- Digital Zhejiang Technology Operations Co., Ltd., Hangzhou, China
| | - Simrann Arora
- Bharati Vidyapeeth's College of Engineering, New Delhi, India
| | - Akash Gupta
- Bharati Vidyapeeth's College of Engineering, New Delhi, India
| | - Gopal Chaudhary
- Bharati Vidyapeeth's College of Engineering, New Delhi, India
| | - Qiaozhi Hua
- Computer School, Hubei University of Arts and Science, Xiangyang 441000, China
| |
Collapse
|
33
|
Lambert P, Pitz M, Singh H, Decker K. Evaluation of algorithms using administrative health and structured electronic medical record data to determine breast and colorectal cancer recurrence in a Canadian province : Using algorithms to determine breast and colorectal cancer recurrence. BMC Cancer 2021; 21:763. [PMID: 34210266 PMCID: PMC8252227 DOI: 10.1186/s12885-021-08526-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2020] [Accepted: 06/21/2021] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Algorithms that use administrative health and electronic medical record (EMR) data to determine cancer recurrence have the potential to replace chart reviews. This study evaluated algorithms to determine breast and colorectal cancer recurrence in a Canadian province with a universal health care system. METHODS Individuals diagnosed with stage I-III breast or colorectal cancer diagnosed from 2004 to 2012 in Manitoba, Canada were included. Pre-specified and conditional inference tree algorithms using administrative health and structured EMR data were developed. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) correct classification, and scaled Brier scores were measured. RESULTS The weighted pre-specified variable algorithm for the breast cancer validation cohort (N = 1181, 167 recurrences) demonstrated 81.1% sensitivity, 93.2% specificity, 61.4% PPV, 97.4% NPV, 91.8% correct classification, and scaled Brier score of 0.21. The weighted conditional inference tree algorithm demonstrated 68.5% sensitivity, 97.0% specificity, 75.4% PPV, 95.8% NPV, 93.6% correct classification, and scaled Brier score of 0.39. The weighted pre-specified variable algorithm for the colorectal validation cohort (N = 693, 136 recurrences) demonstrated 77.7% sensitivity, 92.8% specificity, 70.7% PPV, 94.9% NPV, 90.1% correct classification, and scaled Brier score of 0.33. The conditional inference tree algorithm demonstrated 62.6% sensitivity, 97.8% specificity, 86.4% PPV, 92.2% NPV, 91.4% correct classification, and scaled Brier score of 0.42. CONCLUSIONS Algorithms developed in this study using administrative health and structured EMR data to determine breast and colorectal cancer recurrence had moderate sensitivity and PPV, high specificity, NPV, and correct classification, but low accuracy. The accuracy is similar to other algorithms developed to classify recurrence only (i.e., distinguished from second primary) and inferior to algorithms that do not make this distinction. The accuracy of algorithms for determining cancer recurrence only must improve before replacing chart reviews.
Collapse
Affiliation(s)
- Pascal Lambert
- CancerCare Manitoba Research Institute, 675 McDermot Avenue, Winnipeg, Manitoba, R3E 0V9, Canada
- Department of Epidemiology and Cancer Registry, CancerCare Manitoba, 675 McDermot Avenue, Winnipeg, Manitoba, R3E 0V9, Canada
| | - Marshall Pitz
- CancerCare Manitoba Research Institute, 675 McDermot Avenue, Winnipeg, Manitoba, R3E 0V9, Canada
- Department of Medical Oncology, CancerCare Manitoba, 675 McDermot Avenue, Winnipeg, Manitoba, R3E 0V9, Canada
- Department of Internal Medicine, University of Manitoba, 820 Sherbrook Street, Winnipeg, Manitoba, R3A 1R9, Canada
- Department of Community Health Sciences, University of Manitoba, 750 Bannatyne Avenue, Winnipeg, Manitoba, R3E 0W3, Canada
| | - Harminder Singh
- CancerCare Manitoba Research Institute, 675 McDermot Avenue, Winnipeg, Manitoba, R3E 0V9, Canada
- Department of Internal Medicine, University of Manitoba, 820 Sherbrook Street, Winnipeg, Manitoba, R3A 1R9, Canada
- Department of Community Health Sciences, University of Manitoba, 750 Bannatyne Avenue, Winnipeg, Manitoba, R3E 0W3, Canada
| | - Kathleen Decker
- CancerCare Manitoba Research Institute, 675 McDermot Avenue, Winnipeg, Manitoba, R3E 0V9, Canada.
- Department of Epidemiology and Cancer Registry, CancerCare Manitoba, 675 McDermot Avenue, Winnipeg, Manitoba, R3E 0V9, Canada.
- Department of Community Health Sciences, University of Manitoba, 750 Bannatyne Avenue, Winnipeg, Manitoba, R3E 0W3, Canada.
| |
Collapse
|
34
|
Integrating Speculation Detection and Deep Learning to Extract Lung Cancer Diagnosis from Clinical Notes. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11020865] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Despite efforts to develop models for extracting medical concepts from clinical notes, there are still some challenges in particular to be able to relate concepts to dates. The high number of clinical notes written for each single patient, the use of negation, speculation, and different date formats cause ambiguity that has to be solved to reconstruct the patient’s natural history. In this paper, we concentrate on extracting from clinical narratives the cancer diagnosis and relating it to the diagnosis date. To address this challenge, a hybrid approach that combines deep learning-based and rule-based methods is proposed. The approach integrates three steps: (i) lung cancer named entity recognition, (ii) negation and speculation detection, and (iii) relating the cancer diagnosis to a valid date. In particular, we apply the proposed approach to extract the lung cancer diagnosis and its diagnosis date from clinical narratives written in Spanish. Results obtained show an F-score of 90% in the named entity recognition task, and a 89% F-score in the task of relating the cancer diagnosis to the diagnosis date. Our findings suggest that speculation detection is together with negation detection a key component to properly extract cancer diagnosis from clinical notes.
Collapse
|
35
|
Kersloot MG, van Putten FJP, Abu-Hanna A, Cornet R, Arts DL. Natural language processing algorithms for mapping clinical text fragments onto ontology concepts: a systematic review and recommendations for future studies. J Biomed Semantics 2020; 11:14. [PMID: 33198814 PMCID: PMC7670625 DOI: 10.1186/s13326-020-00231-z] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 11/03/2020] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Free-text descriptions in electronic health records (EHRs) can be of interest for clinical research and care optimization. However, free text cannot be readily interpreted by a computer and, therefore, has limited value. Natural Language Processing (NLP) algorithms can make free text machine-interpretable by attaching ontology concepts to it. However, implementations of NLP algorithms are not evaluated consistently. Therefore, the objective of this study was to review the current methods used for developing and evaluating NLP algorithms that map clinical text fragments onto ontology concepts. To standardize the evaluation of algorithms and reduce heterogeneity between studies, we propose a list of recommendations. METHODS Two reviewers examined publications indexed by Scopus, IEEE, MEDLINE, EMBASE, the ACM Digital Library, and the ACL Anthology. Publications reporting on NLP for mapping clinical text from EHRs to ontology concepts were included. Year, country, setting, objective, evaluation and validation methods, NLP algorithms, terminology systems, dataset size and language, performance measures, reference standard, generalizability, operational use, and source code availability were extracted. The studies' objectives were categorized by way of induction. These results were used to define recommendations. RESULTS Two thousand three hundred fifty five unique studies were identified. Two hundred fifty six studies reported on the development of NLP algorithms for mapping free text to ontology concepts. Seventy-seven described development and evaluation. Twenty-two studies did not perform a validation on unseen data and 68 studies did not perform external validation. Of 23 studies that claimed that their algorithm was generalizable, 5 tested this by external validation. A list of sixteen recommendations regarding the usage of NLP systems and algorithms, usage of data, evaluation and validation, presentation of results, and generalizability of results was developed. CONCLUSION We found many heterogeneous approaches to the reporting on the development and evaluation of NLP algorithms that map clinical text to ontology concepts. Over one-fourth of the identified publications did not perform an evaluation. In addition, over one-fourth of the included studies did not perform a validation, and 88% did not perform external validation. We believe that our recommendations, alongside an existing reporting standard, will increase the reproducibility and reusability of future studies and NLP algorithms in medicine.
Collapse
Affiliation(s)
- Martijn G. Kersloot
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
- Castor EDC, Amsterdam, The Netherlands
| | - Florentien J. P. van Putten
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
| | - Ameen Abu-Hanna
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
| | - Ronald Cornet
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
| | - Derk L. Arts
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
- Castor EDC, Amsterdam, The Netherlands
| |
Collapse
|
36
|
Jiang H, Mao H, Lu H, Lin P, Garry W, Lu H, Yang G, Rainer TH, Chen X. Machine learning-based models to support decision-making in emergency department triage for patients with suspected cardiovascular disease. Int J Med Inform 2020; 145:104326. [PMID: 33197878 DOI: 10.1016/j.ijmedinf.2020.104326] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2020] [Revised: 10/16/2020] [Accepted: 10/30/2020] [Indexed: 12/23/2022]
Abstract
BACKGROUND Accurate differentiation and prioritization in emergency department (ED) triage is important to identify high-risk patients and to efficiently allocate of finite resources. Using data available from patients with suspected cardiovascular disease presenting at ED triage, this study aimed to train and compare the performance of four common machine learning models to assist in decision making of triage levels. METHODS This cross-sectional study in the second Affiliated Hospital of Guangzhou Medical University was conducted from August 2015 to December 2018 inclusive. Demographic information, vital signs, blood glucose, and other available triage scores were collected. Four machine learning models - multinomial logistic regression (multinomial LR), eXtreme gradient boosting (XGBoost), random forest (RF) and gradient-boosted decision tree (GBDT) - were compared. For each model, 80 % of the data set was used for training and 20 % was used to test the models. The area under the receiver operating characteristic curve (AUC), accuracy and macro- F1 were calculated for each model. RESULTS In 17,661 patients presenting with suspected cardiovascular disease, the distribution of triage of level 1, level 2, level 3 and level 4 were 1.3 %, 18.6 %, 76.5 %, and 3.6 % respectively. The AUCs were: XGBoost (0.937), GBDT (0.921), RF (0.919) and multinomial LR (0.908). Based on feature importance generated by XGBoost, blood pressure, pulse rate, oxygen saturation, and age were the most significant variables for making decisions at triage. CONCLUSION Four machine learning models had good discriminative ability of triage. XGBoost demonstrated a slight advantage over other models. These models could be used for differential triage of low-risk patients and high-risk patients as a strategy to improve efficiency and allocation of finite resources.
Collapse
Affiliation(s)
- Huilin Jiang
- Emergency Department, The Second Affiliated Hospital of Guangzhou Medical University, Guangzhou, China.
| | - Haifeng Mao
- Emergency Department, The Second Affiliated Hospital of Guangzhou Medical University, Guangzhou, China.
| | - Huimin Lu
- Emergency Department, The Second Affiliated Hospital of Guangzhou Medical University, Guangzhou, China.
| | - Peiyi Lin
- Emergency Department, The Second Affiliated Hospital of Guangzhou Medical University, Guangzhou, China.
| | - Wei Garry
- Goodwill Hessian Health Technology Co., Ltd, Beijing, China.
| | - Huijing Lu
- Emergency Department, The Second Affiliated Hospital of Guangzhou Medical University, Guangzhou, China.
| | - Guangqian Yang
- Emergency Department, The Second Affiliated Hospital of Guangzhou Medical University, Guangzhou, China.
| | - Timothy H Rainer
- Accident and Emergency Medicine Academic Unit, Chinese University of Hong Kong, Prince of Wales Hospital, Hong Kong, China.
| | - Xiaohui Chen
- Emergency Department, The Second Affiliated Hospital of Guangzhou Medical University, Guangzhou, China.
| |
Collapse
|
37
|
Zeng Z, Amin A, Roy A, Pulliam NE, Karavites LC, Espino S, Helenowski I, Li X, Luo Y, Khan SA. Preoperative magnetic resonance imaging use and oncologic outcomes in premenopausal breast cancer patients. NPJ Breast Cancer 2020; 6:49. [PMID: 33083528 PMCID: PMC7532157 DOI: 10.1038/s41523-020-00192-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2020] [Accepted: 08/21/2020] [Indexed: 12/21/2022] Open
Abstract
Breast magnetic resonance imaging (MRI) delineates disease extent sensitively in newly diagnosed breast cancer patients, but improved cancer outcomes are uncertain. Young women, for whom mammography is less sensitive, are expected to benefit from MRI-based resection. We identified 512 women aged ≤50 years, undergoing breast-conserving treatment (BCT: tumor-free resection margins and radiotherapy) during 2006–2013 through Northwestern Medicine database queries; 64.5% received preoperative MRI and 35.5% did not. Tumor and treatment parameters were similar between groups. We estimated the adjusted hazard ratios (aHR) for local and distant recurrences (LR and DR), using multivariable regression models, accounting for important therapeutic and prognostic parameters. LR rate with MRI use was 7.9 vs. 8.2% without MRI, aHR = 1.03 (95% CI 0.53–1.99). DR rate was 6.4 vs. 6.6%, aHR = 0.89 (95% CI 0.43–1.84). In 119 women aged ≤40, results were similar to LR aHR = 1.82 (95% CI 0.43–7.76) and DR aHR = 0.93 (95% CI 0.26–3.34). Sensitivity analyses showed similar results. The use of preoperative MRI in women aged ≤50 years should be reconsidered until there is proof of benefit.
Collapse
Affiliation(s)
- Zexian Zeng
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL USA.,Department of Data Sciences, Dana-Farber Cancer Institute, Harvard T.H.Chan School of Public Health, Boston, MA USA
| | - Amanda Amin
- Department of Surgery, Kansas University Medical Center, Kansas City, KS USA
| | - Ankita Roy
- Department of Surgery, Northwestern University Feinberg School of Medicine, Chicago, IL USA
| | - Natalie E Pulliam
- Department of Surgery, Northwestern University Feinberg School of Medicine, Chicago, IL USA
| | - Lindsey C Karavites
- Department of Surgery, University of Illinois College of Medicine at Mt. Sinai Hospital, Chicago, IL USA
| | - Sasa Espino
- Department of Surgery, Kansas University Medical Center, Kansas City, KS USA
| | - Irene Helenowski
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL USA
| | | | - Yuan Luo
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL USA
| | - Seema A Khan
- Department of Surgery, Northwestern University Feinberg School of Medicine, Chicago, IL USA
| |
Collapse
|
38
|
Banerjee I, Bozkurt S, Caswell-Jin JL, Kurian AW, Rubin DL. Natural Language Processing Approaches to Detect the Timeline of Metastatic Recurrence of Breast Cancer. JCO Clin Cancer Inform 2020; 3:1-12. [PMID: 31584836 DOI: 10.1200/cci.19.00034] [Citation(s) in RCA: 36] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
PURPOSE Electronic medical records (EMRs) and population-based cancer registries contain information on cancer outcomes and treatment, yet rarely capture information on the timing of metastatic cancer recurrence, which is essential to understand cancer survival outcomes. We developed a natural language processing (NLP) system to identify patient-specific timelines of metastatic breast cancer recurrence. PATIENTS AND METHODS We used the OncoSHARE database, which includes merged data from the California Cancer Registry and EMRs of 8,956 women diagnosed with breast cancer in 2000 to 2018. We curated a comprehensive vocabulary by interviewing expert clinicians and processing radiology and pathology reports and progress notes. We developed and evaluated the following two distinct NLP approaches to analyze free-text notes: a traditional rule-based model, using rules for metastatic detection from the literature and curated by domain experts; and a contemporary neural network model. For each 3-month period (quarter) from 2000 to 2018, we applied both models to infer recurrence status for that quarter. We trained the NLP models using 894 randomly selected patient records that were manually reviewed by clinical experts and evaluated model performance using 179 hold-out patients (20%) as a test set. RESULTS The median follow-up time was 19 quarters (5 years) for the training set and 15 quarters (4 years) for the test set. The neural network model predicted the timing of distant metastatic recurrence with a sensitivity of 0.83 and specificity of 0.73, outperforming the rule-based model, which had a specificity of 0.35 and sensitivity of 0.88 (P < .001). CONCLUSION We developed an NLP method that enables identification of the occurrence and timing of metastatic breast cancer recurrence from EMRs. This approach may be adaptable to other cancer sites and could help to unlock the potential of EMRs for research on real-world cancer outcomes.
Collapse
Affiliation(s)
- Imon Banerjee
- Stanford University School of Medicine, Stanford, CA
| | - Selen Bozkurt
- Stanford University School of Medicine, Stanford, CA
| | | | | | | |
Collapse
|
39
|
Wang J, Deng H, Liu B, Hu A, Liang J, Fan L, Zheng X, Wang T, Lei J. Systematic Evaluation of Research Progress on Natural Language Processing in Medicine Over the Past 20 Years: Bibliometric Study on PubMed. J Med Internet Res 2020; 22:e16816. [PMID: 32012074 PMCID: PMC7005695 DOI: 10.2196/16816] [Citation(s) in RCA: 52] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2019] [Revised: 12/05/2019] [Accepted: 12/15/2019] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Natural language processing (NLP) is an important traditional field in computer science, but its application in medical research has faced many challenges. With the extensive digitalization of medical information globally and increasing importance of understanding and mining big data in the medical field, NLP is becoming more crucial. OBJECTIVE The goal of the research was to perform a systematic review on the use of NLP in medical research with the aim of understanding the global progress on NLP research outcomes, content, methods, and study groups involved. METHODS A systematic review was conducted using the PubMed database as a search platform. All published studies on the application of NLP in medicine (except biomedicine) during the 20 years between 1999 and 2018 were retrieved. The data obtained from these published studies were cleaned and structured. Excel (Microsoft Corp) and VOSviewer (Nees Jan van Eck and Ludo Waltman) were used to perform bibliometric analysis of publication trends, author orders, countries, institutions, collaboration relationships, research hot spots, diseases studied, and research methods. RESULTS A total of 3498 articles were obtained during initial screening, and 2336 articles were found to meet the study criteria after manual screening. The number of publications increased every year, with a significant growth after 2012 (number of publications ranged from 148 to a maximum of 302 annually). The United States has occupied the leading position since the inception of the field, with the largest number of articles published. The United States contributed to 63.01% (1472/2336) of all publications, followed by France (5.44%, 127/2336) and the United Kingdom (3.51%, 82/2336). The author with the largest number of articles published was Hongfang Liu (70), while Stéphane Meystre (17) and Hua Xu (33) published the largest number of articles as the first and corresponding authors. Among the first author's affiliation institution, Columbia University published the largest number of articles, accounting for 4.54% (106/2336) of the total. Specifically, approximately one-fifth (17.68%, 413/2336) of the articles involved research on specific diseases, and the subject areas primarily focused on mental illness (16.46%, 68/413), breast cancer (5.81%, 24/413), and pneumonia (4.12%, 17/413). CONCLUSIONS NLP is in a period of robust development in the medical field, with an average of approximately 100 publications annually. Electronic medical records were the most used research materials, but social media such as Twitter have become important research materials since 2015. Cancer (24.94%, 103/413) was the most common subject area in NLP-assisted medical research on diseases, with breast cancers (23.30%, 24/103) and lung cancers (14.56%, 15/103) accounting for the highest proportions of studies. Columbia University and the talents trained therein were the most active and prolific research forces on NLP in the medical field.
Collapse
Affiliation(s)
- Jing Wang
- School of Medical Informatics and Engineering, Southwest Medical University, Luzhou, China
| | - Huan Deng
- School of Medical Informatics and Engineering, Southwest Medical University, Luzhou, China
| | - Bangtao Liu
- School of Medical Informatics and Engineering, Southwest Medical University, Luzhou, China
| | - Anbin Hu
- School of Medical Informatics and Engineering, Southwest Medical University, Luzhou, China
| | - Jun Liang
- IT Center, Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
| | - Lingye Fan
- Affiliated Hospital, Southwest Medical University, Luzhou, China
| | - Xu Zheng
- Center for Medical Informatics, Peking University, Beijing, China
| | - Tong Wang
- School of Public Health, Jilin University, Jilin, China
| | - Jianbo Lei
- School of Medical Informatics and Engineering, Southwest Medical University, Luzhou, China.,Center for Medical Informatics, Peking University, Beijing, China.,Institute of Medical Technology, Health Science Center, Peking University, Beijing, China
| |
Collapse
|
40
|
Hughes KS, Zhou J, Bao Y, Singh P, Wang J, Yin K. Natural language processing to facilitate breast cancer research and management. Breast J 2019; 26:92-99. [PMID: 31854067 DOI: 10.1111/tbj.13718] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Accepted: 10/02/2019] [Indexed: 12/23/2022]
Abstract
The medical literature has been growing exponentially, and its size has become a barrier for physicians to locate and extract clinically useful information. As a promising solution, natural language processing (NLP), especially machine learning (ML)-based NLP is a technology that potentially provides a promising solution. ML-based NLP is based on training a computational algorithm with a large number of annotated examples to allow the computer to "learn" and "predict" the meaning of human language. Although NLP has been widely applied in industry and business, most physicians still are not aware of the huge potential of this technology in medicine, and the implementation of NLP in breast cancer research and management is fairly limited. With a real-world successful project of identifying penetrance papers for breast and other cancer susceptibility genes, this review illustrates how to train and evaluate an NLP-based medical abstract classifier, incorporate it into a semiautomatic meta-analysis procedure, and validate the effectiveness of this procedure. Other implementations of NLP technology in breast cancer research, such as parsing pathology reports and mining electronic healthcare records, are also discussed. We hope this review will help breast cancer physicians and researchers to recognize, understand, and apply this technology to meet their own clinical or research needs.
Collapse
Affiliation(s)
- Kevin S Hughes
- Division of Surgical Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA
| | - Jingan Zhou
- Division of Surgical Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA.,Department of General Surgery, Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Yujia Bao
- Computer Science & Artificial Intelligence, Massachusetts Institute of Technology, Boston, MA
| | - Preeti Singh
- Division of Surgical Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA
| | - Jin Wang
- Division of Surgical Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA.,Department of Breast Oncology, Sun Yat-sen University Cancer Center, State Key Laboratory of Oncology in South China, Collaborative Innovation Center of Cancer Medicine, Guangzhou, China
| | - Kanhua Yin
- Division of Surgical Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA
| |
Collapse
|
41
|
Petch J, Batt J, Murray J, Mamdani M. Extracting Clinical Features From Dictated Ambulatory Consult Notes Using a Commercially Available Natural Language Processing Tool: Pilot, Retrospective, Cross-Sectional Validation Study. JMIR Med Inform 2019; 7:e12575. [PMID: 31682579 PMCID: PMC6913750 DOI: 10.2196/12575] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2018] [Revised: 05/12/2019] [Accepted: 08/29/2019] [Indexed: 11/13/2022] Open
Abstract
Background The increasing adoption of electronic health records (EHRs) in clinical practice holds the promise of improving care and advancing research by serving as a rich source of data, but most EHRs allow clinicians to enter data in a text format without much structure. Natural language processing (NLP) may reduce reliance on manual abstraction of these text data by extracting clinical features directly from unstructured clinical digital text data and converting them into structured data. Objective This study aimed to assess the performance of a commercially available NLP tool for extracting clinical features from free-text consult notes. Methods We conducted a pilot, retrospective, cross-sectional study of the accuracy of NLP from dictated consult notes from our tuberculosis clinic with manual chart abstraction as the reference standard. Consult notes for 130 patients were extracted and processed using NLP. We extracted 15 clinical features from these consult notes and grouped them a priori into categories of simple, moderate, and complex for analysis. Results For the primary outcome of overall accuracy, NLP performed best for features classified as simple, achieving an overall accuracy of 96% (95% CI 94.3-97.6). Performance was slightly lower for features of moderate clinical and linguistic complexity at 93% (95% CI 91.1-94.4), and lowest for complex features at 91% (95% CI 87.3-93.1). Conclusions The findings of this study support the use of NLP for extracting clinical features from dictated consult notes in the setting of a tuberculosis clinic. Further research is needed to fully establish the validity of NLP for this and other purposes.
Collapse
Affiliation(s)
- Jeremy Petch
- Institute of Health Policy, Management and Evaluation, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.,Centre for Data Science and Digital Health, Hamilton Health Sciences, Hamilton, ON, Canada
| | - Jane Batt
- Division of Respirology, Department of Medicine, University of Toronto, Toronto, ON, Canada.,Keenan Research Centre for Biomedical Science, St. Michael's Hospital, Toronto, ON, Canada.,Department of Medicine, St. Michael's Hospital, Toronto, ON, Canada
| | - Joshua Murray
- Li Ka Shing Centre for Healthcare Analytics Research and Training, St. Michael's Hospital, Toronto, ON, Canada.,Department of Statistical Sciences, Faculty of Arts and Sciences, University of Toronto, Toronto, ON, Canada
| | - Muhammad Mamdani
- Institute of Health Policy, Management and Evaluation, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.,Li Ka Shing Centre for Healthcare Analytics Research and Training, St. Michael's Hospital, Toronto, ON, Canada.,Leslie Dan Faculty of Pharmacy, University of Toronto, Toronto, ON, Canada.,Department of Medicine, Faculty of Medicine, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
42
|
Kim E, Rubinstein SM, Nead KT, Wojcieszynski AP, Gabriel PE, Warner JL. The Evolving Use of Electronic Health Records (EHR) for Research. Semin Radiat Oncol 2019; 29:354-361. [DOI: 10.1016/j.semradonc.2019.05.010] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
|
43
|
Zeng Z, Zhao Y, Sun M, Vo AH, Starren J, Luo Y. Rich Text Formatted EHR Narratives: A Hidden and Ignored Trove. Stud Health Technol Inform 2019; 264:472-476. [PMID: 31437968 PMCID: PMC8060951 DOI: 10.3233/shti190266] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
This study presents an approach for mining structured information from clinical narratives in Electronic Health Records (EHRs) by using Rich Text Formatted (RTF) records. RTF is adopted by many medical information management systems. There is rich structural information in these files which can be extracted and interpreted, yet such information is largely ignored. We investigate multiple types of EHR narratives in the Enterprise Data Warehouse from a multisite large healthcare chain consisting of both, an academic medical center and community hospitals. We focus on the RTF constructs related to tables and sections that are not available in plain text EHR narratives. We show how to parse these RTF constructs, analyze their prevalence and characteristics in the context of multiple types of EHR narratives. Our case study demonstrates the additional utility of the features derived from RTF constructs over plain text oriented NLP.
Collapse
Affiliation(s)
- Zexian Zeng
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Yuan Zhao
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Mengxin Sun
- Hospital Medicine, Northwestern Memorial Hospital, Chicago, IL, USA
| | - Andy H Vo
- Committee on Developmental Biology and Regenerative Medicine, The University of Chicago, Chicago, IL, USA
| | - Justin Starren
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Yuan Luo
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| |
Collapse
|
44
|
Jiang X, Wells A, Brufsky A, Neapolitan R. A clinical decision support system learned from data to personalize treatment recommendations towards preventing breast cancer metastasis. PLoS One 2019; 14:e0213292. [PMID: 30849111 PMCID: PMC6407919 DOI: 10.1371/journal.pone.0213292] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Accepted: 02/18/2019] [Indexed: 12/18/2022] Open
Abstract
OBJECTIVE A Clinical Decision Support System (CDSS) that can amass Electronic Health Record (EHR) and other patient data holds promise to provide accurate classification and guide treatment choices. Our objective is to develop the Decision Support System for Making Personalized Assessments and Recommendations Concerning Breast Cancer Patients (DPAC), which is a CDSS learned from data that recommends the optimal treatment decisions based on a patient's features. METHOD We developed a Bayesian network architecture called Causal Modeling with Internal Layers (CAMIL), and an algorithm called Treatment Feature Interactions (TFI), which learns from data the interactions needed in a CAMIL model. Using the TFI algorithm, we learned interactions for six treatments from the LSDS-5YDM dataset. We created a CAMIL model using these interactions, resulting in a DPAC which recommends treatments towards preventing 5-year breast cancer metastasis. RESULTS In a 5-fold cross-validation analysis, we compared the probability of being metastasis free in 5 years for patients who made decisions recommended by DPAC to those who did not. These probabilities are (the probability for those making the decisions appears first): chemotherapy (.938, .872); breast/chest wall radiation (.939, .902); nodal field radiation (.940, .784); antihormone (.941, .906); HER2 inhibitors (.934, .880); neadjuvant therapy (.931, .837). In an application of DPAC to the independent METABRIC dataset, the probabilities for chemotherapy were (.845, .788). DISCUSSION Patients who took the advice of DPAC had, as a group, notably better outcomes than those who did not. We conclude that DPAC is effective at amassing and analyzing data towards treatment recommendations. Some of the findings in DPAC are controversial. For example, DPAC says that chemotherapy increases the chances of metastasis for many node negative patients. This controversy shows the importance of developing a conclusive version of DPAC to ensure we provide patients with the best patient-specific treatment recommendations.
Collapse
Affiliation(s)
- Xia Jiang
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Alan Wells
- Department of Pathology, University of Pittsburgh and Pittsburgh VA Health System, Pittsburgh, Pennsylvania, United States of America.,UPMC Hillman Cancer Center, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
| | - Adam Brufsky
- UPMC Hillman Cancer Center, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America.,Division of Hematology/Oncology, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania, United States of America
| | - Richard Neapolitan
- Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, United States of America
| |
Collapse
|
45
|
Liu X, Xie L, Wu Z, Wang K, Zhao Z, Ruan J, Zhi D. The International Conference on Intelligent Biology and Medicine (ICIBM) 2018: bioinformatics towards translational applications. BMC Bioinformatics 2018; 19:492. [PMID: 30591012 PMCID: PMC6309051 DOI: 10.1186/s12859-018-2460-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
The 2018 International Conference on Intelligent Biology and Medicine (ICIBM 2018) was held on June 10–12, 2018, in Los Angeles, California, USA. The conference consisted of a total of eleven scientific sessions, four tutorials, one poster session, four keynote talks and four eminent scholar talks, which covered a wild range of aspects of bioinformatics, medical informatics, systems biology and intelligent computing. Here, we summarize nine research articles selected for publishing in BMC Bioinformatics.
Collapse
Affiliation(s)
- Xiaoming Liu
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA. .,Present address: College of Public Health, University of South Florida, Tampa, FL, 33612, USA.
| | - Lei Xie
- Department of Computer Science, Hunter College & The Graduate Center, The City University of New York, New York, NY, 10065, USA
| | - Zhijin Wu
- Department of Biostatistics, Brown University, Providence, RI, 02912, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA, 19104, USA.,Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Jianhua Ruan
- Department of Computer Science, The University of Texas at San Antonio, San Antonio, TX, 78249, USA
| | - Degui Zhi
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
| |
Collapse
|