1
|
Itzhak N, Jaroszewicz S, Moskovitch R. Event prediction by estimating continuously the completion of a single temporal pattern's instances. J Biomed Inform 2024:104665. [PMID: 38852777 DOI: 10.1016/j.jbi.2024.104665] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2023] [Revised: 05/10/2024] [Accepted: 06/03/2024] [Indexed: 06/11/2024]
Abstract
OBJECTIVE Develop a new method for continuous prediction that utilizes a single temporal pattern ending with an event of interest and its multiple instances detected in the temporal data. METHODS Use temporal abstraction to transform time series, instantaneous events, and time intervals into a uniform representation using symbolic time intervals (STIs). Introduce a new approach to event prediction using a single time intervals-related pattern (TIRP), which can learn models to predict whether and when an event of interest will occur, based on multiple instances of a pattern that end with the event. RESULTS The proposed methods achieved an average improvement of 5% AUROC over LSTM-FCN, the best-performed baseline model, out of the evaluated baseline models (RawXGB, Resnet, LSTM-FCN, and ROCKET) that were applied to real-life datasets. CONCLUSION The proposed methods for predicting events continuously have the potential to be used in a wide range of real-world and real-time applications in diverse domains with heterogeneous multivariate temporal data. For example, it could be used to predict panic attacks early using wearable devices or to predict complications early in intensive care unit patients.
Collapse
Affiliation(s)
- Nevo Itzhak
- Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer Sheva, Israel.
| | - Szymon Jaroszewicz
- Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland; Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland.
| | - Robert Moskovitch
- Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer Sheva, Israel.
| |
Collapse
|
2
|
Bennis FC, Aussems C, Korevaar JC, Hoogendoorn M. The added value of temporal data and the best way to handle it: A use-case for atrial fibrillation using general practitioner data. Comput Biol Med 2024; 171:108097. [PMID: 38412689 DOI: 10.1016/j.compbiomed.2024.108097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Revised: 01/29/2024] [Accepted: 02/01/2024] [Indexed: 02/29/2024]
Abstract
INTRODUCTION Temporal data has numerous challenges for deep learning such as irregularity of sampling. New algorithms are being developed that can handle these temporal challenges better. However, it is unclear how the performance ranges from classical non-temporal models to newly developed algorithms. Therefore, this study compares different non-temporal and temporal algorithms for a relevant use case, the prediction of atrial fibrillation (AF) using general practitioner (GP) data. METHODS Three datasets with a 365-day observation window and prediction windows of 14, 180 and 360 days were used. Data consisted of medication, lab, symptom, and chronic diseases codings registered by the GP. The benchmark discarded temporality and used logistic regression, XGBoost models and neural networks on the presence of codings over the whole year. Pattern data extracted common patterns of GP codings and tested using the same algorithms. LSTM and CKConv models were trained as models incorporating temporality. RESULTS Algorithms which incorporated temporality (LSTM and CKConv, (max AUC 0.734 at 360 days prediction window) outperformed both benchmark and pattern algorithms (max AUC 0.723, with a significant improvement using the 360 days prediction window (p = 0.04). The difference between the benchmark and the LSTM or CKConv algorithm decreased with smaller prediction windows, indicating temporal importance for longer prediction windows. The CKConv and LSTM algorithm performed similarly, possibly due to limited sequence length. CONCLUSION Temporal models outperformed non-temporal models for the prediction of AF. For temporal models, CKConv is a promising algorithm to handle temporal data using GP data as it can handle irregular data.
Collapse
Affiliation(s)
- Frank C Bennis
- Quantitative Data Analytics Group, Department of Computer Science, VU Amsterdam, Amsterdam, the Netherlands; Nivel, Netherlands Institute for Health Services Research, Utrecht, the Netherlands.
| | - Claire Aussems
- Nivel, Netherlands Institute for Health Services Research, Utrecht, the Netherlands
| | - Joke C Korevaar
- Nivel, Netherlands Institute for Health Services Research, Utrecht, the Netherlands
| | - Mark Hoogendoorn
- Quantitative Data Analytics Group, Department of Computer Science, VU Amsterdam, Amsterdam, the Netherlands
| |
Collapse
|
3
|
Yin Y, Chou CA. Multi-event survival analysis through dynamic multi-modal learning for ICU mortality prediction. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 235:107545. [PMID: 37062155 DOI: 10.1016/j.cmpb.2023.107545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 01/03/2023] [Accepted: 04/08/2023] [Indexed: 05/08/2023]
Abstract
BACKGROUND AND OBJECTIVE Survival analysis is widely applied for assessing the expected duration of patient status towards event occurrences such as mortality in healthcare domain, which is generally considered as a time-to-event problem. Patients with multiple complications have high mortality risks and oftentimes require specific intensive care and clinical treatments. The progression of complications is time-varying according to disease development and intrinsic interactions between complications with respect to mortality are uncertain. Classical methods for mortality prediction and survival analysis in critical care, such as risk scoring systems and cause-specific survival models, were not designed for this multi-event survival analysis problem and able to measure the competing risks of death for mutually exclusive events. In addition, multivariate temporal information of complications is not taken into consideration while estimating differentiated mortality risks in the early stage. METHODS In this paper, we propose a novel multi-event survival analysis solution using a tree-based autoregressive survival model of multi-modal electronic health record data. Specifically, we focus on modeling the temporal trajectory of complications and estimating the mortality risk associated with multiple potential complications simultaneously. In dynamic modeling, no assumptions are made for the relationships between time-dependent variables and risk transition over time. RESULTS Validated with the eICU database, our model achieves a better prediction performance with C-index ranging in 74-80%, compared to state-of-the-art machine learning methods in the literature, for the complications of acute respiratory distress syndrome and cardiovascular disease cases. CONCLUSIONS Our model provides the distinguishable mortality risk curves over time for specific complications and the track of risk development that could potentially support the ICU resource reallocation.
Collapse
Affiliation(s)
- Yilin Yin
- Mechanical and Industrial Engineering, Northeastern University, 360 Huntington Ave, Boston, MA 02115, USA
| | - Chun-An Chou
- Mechanical and Industrial Engineering, Northeastern University, 360 Huntington Ave, Boston, MA 02115, USA.
| |
Collapse
|
4
|
Prediction of acute hypertensive episodes in critically ill patients. Artif Intell Med 2023; 139:102525. [PMID: 37100504 DOI: 10.1016/j.artmed.2023.102525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Revised: 01/19/2023] [Accepted: 03/06/2023] [Indexed: 03/16/2023]
Abstract
Prevention and treatment of complications are the backbone of medical care, particularly in critical care settings. Early detection and prompt intervention can potentially prevent complications from occurring and improve outcomes. In this study, we use four longitudinal vital signs variables of intensive care unit patients, focusing on predicting acute hypertensive episodes (AHEs). These episodes represent elevations in blood pressure and may result in clinical damage or indicate a change in a patient's clinical situation, such as an elevation in intracranial pressure or kidney failure. Prediction of AHEs may allow clinicians to anticipate changes in the patient's condition and respond early on to prevent these from occurring. Temporal abstraction was employed to transform the multivariate temporal data into a uniform representation of symbolic time intervals, from which frequent time-intervals-related patterns (TIRPs) are mined and used as features for AHE prediction. A novel TIRP metric for classification, called coverage, is introduced that measures the coverage of a TIRP's instances in a time window. For comparison, several baseline models were applied on the raw time series data, including logistic regression and sequential deep learning models, are used. Our results show that using frequent TIRPs as features outperforms the baseline models, and the use of the coverage, metric outperforms other TIRP metrics. Two approaches to predicting AHEs in real-life application conditions are evaluated: using a sliding window to continuously predict whether a patient would experience an AHE within a specific prediction time period ahead, our models produced an AUC-ROC of 82%, but with low AUPRC. Alternatively, predicting whether an AHE would generally occur during the entire admission resulted in an AUC-ROC of 74%.
Collapse
|
5
|
Qiu P, Gong Y, Zhao Y, Cao L, Zhang C, Dong X. An Efficient Method for Modeling Nonoccurring Behaviors by Negative Sequential Patterns With Loose Constraints. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:1864-1878. [PMID: 33729957 DOI: 10.1109/tnnls.2021.3063162] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The sequence analysis handles sequential discrete events and behaviors, which can be represented by temporal point processes (TPPs). However, TPP models only occurring events and behaviors. This article explores an efficient method for the negative sequential pattern (NSP) mining to leverage TPP in modeling both frequently occurring and nonoccurring events and behaviors. NSP mining is good at the challenging modeling of nonoccurrences of events and behaviors and their combinations with occurring events, with existing methods built on incorporating various constraints into NSP representations, e.g., simplifying NSP formulations and reducing computational costs. Such constraints restrict the flexibility of NSPs, and nonoccurring behaviors (NOBs) cannot be comprehensively exposed. This article addresses this issue by loosening some inflexible constraints in NSP mining and solves a series of consequent challenges. First, we provide a new definition of negative containment with the set theory according to the loose constraints. Second, an efficient method quickly calculates the supports of negative sequences. Our method only uses the information about the corresponding positive sequential patterns (PSPs) and avoids additional database scans. Finally, a novel and efficient algorithm, NegI-NSP, is proposed to efficiently identify highly valuable NSPs. Theoretical analyses, comparisons, and experiments on four synthetic and two real-life data sets clearly show that NegI-NSP can efficiently discover more useful NOBs.
Collapse
|
6
|
Novitski P, Cohen CM, Karasik A, Hodik G, Moskovitch R. Temporal patterns selection for All-Cause Mortality prediction in T2D with ANNs. J Biomed Inform 2022; 134:104198. [PMID: 36100163 DOI: 10.1016/j.jbi.2022.104198] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Revised: 08/10/2022] [Accepted: 09/03/2022] [Indexed: 01/02/2023]
Abstract
Mortality prevention in T2D elderly population having Chronic Kidney Disease (CKD) may be possible thorough risk assessment and predictive modeling. In this study we investigate the ability to predict mortality using heterogeneous Electronic Health Records data. Temporal abstraction is employed to transform the heterogeneous multivariate temporal data into a uniform representation of symbolic time intervals, from which then frequent Time Intervals Related Patterns (TIRPs) are discovered. However, in this study a novel representation of the TIRPs is introduced, which enables to incorporate them in Deep Learning Networks. We describe here the use of iTirps and bTirps, in which the TIRPs are represented by a integer and binary vector representing the time respectively. While bTirp represents whether a TIRP's instance was present, iTirp represents whether multiple instances were present. While the framework showed encouraging results, a major challenge is often the large number of TIRPs, which may cause the models to under-perform. We introduce a novel method for TIRPs' selection method, called TIRP Ranking Criteria (TRC), which is consists on the TIRP's metrics, such as the differences in its recurrences, its frequencies, and the average duration difference between the classes. Additionally, we introduce an advanced version, called TRC Redundant TIRP Removal (TRC-RTR), TIRPs that highly correlate are candidates for removal. Then the selected subset of iTirp/bTirps is fed into a Deep Learning architecture like a Recurrent Neural Network or a Convolutional Neural Network. Furthermore, a predictive committee is utilized in which raw data and iTirp data are both used as input. Our results show that iTirps-based models that use a subset of iTirps based on the TRC-RTR method outperform models that use raw data or models that use full set of discovered iTirps.
Collapse
Affiliation(s)
- Pavel Novitski
- Software and Information Systems Engineering, Ben Gurion University, Beer-Sheva, Israel.
| | - Cheli Melzer Cohen
- Maccabi Data Science Institute, Maccabi Healthcare Services, Tel-Aviv, Israel.
| | - Avraham Karasik
- Maccabi Data Science Institute, Maccabi Healthcare Services, Tel-Aviv, Israel.
| | - Gabriel Hodik
- Maccabi Data Science Institute, Maccabi Healthcare Services, Tel-Aviv, Israel.
| | - Robert Moskovitch
- Software and Information Systems Engineering, Ben Gurion University, Beer-Sheva, Israel; Population Health and Science, Ichan Medical School at Mount Sinai, NYC, USA.
| |
Collapse
|
7
|
Bennis FC, Hoogendoorn M, Aussems C, Korevaar JC. Prediction of heart failure 1 year before diagnosis in general practitioner patients using machine learning algorithms: a retrospective case-control study. BMJ Open 2022; 12:e060458. [PMID: 36041765 PMCID: PMC9438066 DOI: 10.1136/bmjopen-2021-060458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
OBJECTIVES Heart failure (HF) is a commonly occurring health problem with high mortality and morbidity. If potential cases could be detected earlier, it may be possible to intervene earlier, which may slow progression in some patients. Preferably, it is desired to reuse already measured data for screening of all persons in an age group, such as general practitioner (GP) data. Furthermore, it is essential to evaluate the number of people needed to screen to find one patient using true incidence rates, as this indicates the generalisability in the true population. Therefore, we aim to create a machine learning model for the prediction of HF using GP data and evaluate the number needed to screen with true incidence rates. DESIGN, SETTINGS AND PARTICIPANTS GP data from 8543 patients (-2 to -1 year before diagnosis) and controls aged 70+ years were obtained retrospectively from 01 January 2012 to 31 December 2019 from the Nivel Primary Care Database. Codes about chronic illness, complaints, diagnostics and medication were obtained. Data were split in a train/test set. Datasets describing demographics, the presence of codes (non-sequential) and upon each other following codes (sequential) were created. Logistic regression, random forest and XGBoost models were trained. Predicted outcome was the presence of HF after 1 year. The ratio case:control in the test set matched true incidence rates (1:45). RESULTS Sole demographics performed average (area under the curve (AUC) 0.692, CI 0.677 to 0.706). Adding non-sequential information combined with a logistic regression model performed best and significantly improved performance (AUC 0.772, CI 0.759 to 0.785, p<0.001). Further adding sequential information did not alter performance significantly (AUC 0.767, CI 0.754 to 0.780, p=0.07). The number needed to screen dropped from 14.11 to 5.99 false positives per true positive. CONCLUSION This study created a model able to identify patients with pending HF a year before diagnosis.
Collapse
Affiliation(s)
- Frank C Bennis
- Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
- Netherlands Institute for Health Services Research (Nivel), Utrecht, The Netherlands
| | - Mark Hoogendoorn
- Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| | - Claire Aussems
- Netherlands Institute for Health Services Research (Nivel), Utrecht, The Netherlands
| | - Joke C Korevaar
- Netherlands Institute for Health Services Research (Nivel), Utrecht, The Netherlands
| |
Collapse
|
8
|
Shitrit G, Tractinsky N, Moskovitch R. Visualization of Frequent Temporal Patterns in Single or Two Populations. J Biomed Inform 2022; 134:104169. [PMID: 36038065 DOI: 10.1016/j.jbi.2022.104169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Revised: 08/11/2022] [Accepted: 08/13/2022] [Indexed: 10/15/2022]
Abstract
Temporal knowledge discovery in clinical problems, is crucial to investigate problems in the data science era. Meaningful progress has been made computationally in the discovery of frequent temporal patterns, which may store potentially meaningful knowledge. However, for temporal knowledge discovery and acquisition, effective visualization is essential and still stores much room for contributions. While visualization of frequent temporal patterns was relatively under researched, it stores meaningful opportunities in facilitating usable ways to assist domain experts, or researchers, in exploring and acquiring temporal knowledge. In this paper, a novel approach for the visualization of an enumeration tree of frequent temporal patterns is introduced for, whether mined from a single population, or for the comparison of patterns that were discovered in two separate populations. While this approach is relevant to any sequence-based patterns, we demonstrate its use on the most complex scenario of time intervals related patterns (TIRPs). The interface enables users to browse an enumeration tree of frequent patterns, or search for specific patterns, as well as discover the most discriminating TIRPs among two populations. For that a novel visualization of the temporal patterns is introduced using a bubble chart, in which each bubble represents a temporal pattern, and the chart axes represent the various metrics of the patterns, such as their frequency, reoccurrence, and more, which provides a fast overview of the patterns as a whole, as well as access specific ones. We present a comprehensive and rigorous user study on two real-life datasets, demonstrating the usability advantages of the novel approaches.
Collapse
Affiliation(s)
- Guy Shitrit
- Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer Sheva, Israel.
| | - Noam Tractinsky
- Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer Sheva, Israel
| | - Robert Moskovitch
- Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer Sheva, Israel; Population Health and Science, Ichan Medical School at Mount Sinai, NYC, USA.
| |
Collapse
|
9
|
Estiri H, Strasser ZH, Murphy SN. High-throughput phenotyping with temporal sequences. J Am Med Inform Assoc 2021; 28:772-781. [PMID: 33313899 DOI: 10.1093/jamia/ocaa288] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Accepted: 11/04/2020] [Indexed: 12/15/2022] Open
Abstract
OBJECTIVE High-throughput electronic phenotyping algorithms can accelerate translational research using data from electronic health record (EHR) systems. The temporal information buried in EHRs is often underutilized in developing computational phenotypic definitions. This study aims to develop a high-throughput phenotyping method, leveraging temporal sequential patterns from EHRs. MATERIALS AND METHODS We develop a representation mining algorithm to extract 5 classes of representations from EHR diagnosis and medication records: the aggregated vector of the records (aggregated vector representation), the standard sequential patterns (sequential pattern mining), the transitive sequential patterns (transitive sequential pattern mining), and 2 hybrid classes. Using EHR data on 10 phenotypes from the Mass General Brigham Biobank, we train and validate phenotyping algorithms. RESULTS Phenotyping with temporal sequences resulted in a superior classification performance across all 10 phenotypes compared with the standard representations in electronic phenotyping. The high-throughput algorithm's classification performance was superior or similar to the performance of previously published electronic phenotyping algorithms. We characterize and evaluate the top transitive sequences of diagnosis records paired with the records of risk factors, symptoms, complications, medications, or vaccinations. DISCUSSION The proposed high-throughput phenotyping approach enables seamless discovery of sequential record combinations that may be difficult to assume from raw EHR data. Transitive sequences offer more accurate characterization of the phenotype, compared with its individual components, and reflect the actual lived experiences of the patients with that particular disease. CONCLUSION Sequential data representations provide a precise mechanism for incorporating raw EHR records into downstream machine learning. Our approach starts with user interpretability and works backward to the technology.
Collapse
Affiliation(s)
- Hossein Estiri
- Harvard Medical School, Boston, Massachusetts, USA.,Massachusetts General Hospital, Boston, Massachusetts, USA.,Mass General Brigham, Boston, Massachusetts, USA
| | - Zachary H Strasser
- Harvard Medical School, Boston, Massachusetts, USA.,Massachusetts General Hospital, Boston, Massachusetts, USA.,Mass General Brigham, Boston, Massachusetts, USA
| | - Shawn N Murphy
- Harvard Medical School, Boston, Massachusetts, USA.,Massachusetts General Hospital, Boston, Massachusetts, USA.,Mass General Brigham, Boston, Massachusetts, USA
| |
Collapse
|
10
|
Oei RW, Fang HSA, Tan WY, Hsu W, Lee ML, Tan NC. Using Domain Knowledge and Data-Driven Insights for Patient Similarity Analytics. J Pers Med 2021; 11:jpm11080699. [PMID: 34442343 PMCID: PMC8398126 DOI: 10.3390/jpm11080699] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 07/15/2021] [Accepted: 07/21/2021] [Indexed: 12/23/2022] Open
Abstract
Patient similarity analytics has emerged as an essential tool to identify cohorts of patients who have similar clinical characteristics to some specific patient of interest. In this study, we propose a patient similarity measure called D3K that incorporates domain knowledge and data-driven insights. Using the electronic health records (EHRs) of 169,434 patients with either diabetes, hypertension or dyslipidaemia (DHL), we construct patient feature vectors containing demographics, vital signs, laboratory test results, and prescribed medications. We discretize the variables of interest into various bins based on domain knowledge and make the patient similarity computation to be aligned with clinical guidelines. Key findings from this study are: (1) D3K outperforms baseline approaches in all seven sub-cohorts; (2) our domain knowledge-based binning strategy outperformed the traditional percentile-based binning in all seven sub-cohorts; (3) there is substantial agreement between D3K and physicians (κ = 0.746), indicating that D3K can be applied to facilitate shared decision making. This is the first study to use patient similarity analytics on a cardiometabolic syndrome-related dataset sourced from medical institutions in Singapore. We consider patient similarity among patient cohorts with the same medical conditions to develop localized models for personalized decision support to improve the outcomes of a target patient.
Collapse
Affiliation(s)
- Ronald Wihal Oei
- Institute of Data Science, National University of Singapore, Singapore 117602, Singapore; (W.-Y.T.); (W.H.); (M.-L.L.)
- Correspondence:
| | - Hao Sen Andrew Fang
- SingHealth Polyclinics, SingHealth, Singapore 150167, Singapore; (H.S.A.F.); (N.-C.T.)
| | - Wei-Ying Tan
- Institute of Data Science, National University of Singapore, Singapore 117602, Singapore; (W.-Y.T.); (W.H.); (M.-L.L.)
| | - Wynne Hsu
- Institute of Data Science, National University of Singapore, Singapore 117602, Singapore; (W.-Y.T.); (W.H.); (M.-L.L.)
- School of Computing, National University of Singapore, Singapore 117417, Singapore
| | - Mong-Li Lee
- Institute of Data Science, National University of Singapore, Singapore 117602, Singapore; (W.-Y.T.); (W.H.); (M.-L.L.)
- School of Computing, National University of Singapore, Singapore 117417, Singapore
| | - Ngiap-Chuan Tan
- SingHealth Polyclinics, SingHealth, Singapore 150167, Singapore; (H.S.A.F.); (N.-C.T.)
| |
Collapse
|
11
|
Schvetz M, Fuchs L, Novack V, Moskovitch R. Outcomes prediction in longitudinal data: Study designs evaluation, use case in ICU acquired sepsis. J Biomed Inform 2021; 117:103734. [PMID: 33711544 DOI: 10.1016/j.jbi.2021.103734] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2020] [Revised: 02/27/2021] [Accepted: 03/01/2021] [Indexed: 12/23/2022]
Abstract
Outcomes' prediction in Electronic Health Records (EHR) and specifically in Critical Care is increasingly attracting more exploration and research. In this study, we used clinical data from the Intensive Care Unit (ICU), focusing on ICU acquired sepsis. Looking at the current literature, several evaluation approaches are reported, inspired by epidemiological designs, in which some do not always reflect real-life application's conditions. This problem seems relevant generally to outcomes' prediction in longitudinal EHR data, or generally longitudinal data, while in this study we focused on ICU data. Unlike in most previous studies that investigated all sepsis admissions, we focused specifically on ICU-Acquired Sepsis. Due to the sparse nature of the longitudinal data, we employed the use of Temporal Abstraction and Time Interval-Related Patterns discovery, which are further used as classification features. Two experiments were designed using three different outcomes prediction study designs from the literature, implementing various levels of real-life conditions to evaluate the prediction models. The first experiment focused on predicting whether a patient would suffer from ICU-acquired sepsis and when during her admission, given a sliding observation time window, and the comparison of the three study designs behavior. The second experiment focused only on predicting whether the patient will suffer from ICU-acquired sepsis, based on data taken relatively to his admission start time. Our results show that using Temporal Discretization for Classification (TD4C) led to better performance than using the Equal-Width Discretization, Knowledge-Based, or SAX. Also, using two states abstraction was better than three or four. Using the default Binary TIRP representation method performed better than Mean Duration, Horizontal Support, and horizontally normalized horizontal support. Using XGBoost as a classifier performed better than Logistic Regression, Neural Net, or Random Forest. Additionally, it is demonstrated why the use of case-crossover-control is most appropriate for real life application conditions evaluation, unlike other incomplete designs that may even result in "better performance".
Collapse
Affiliation(s)
- Maya Schvetz
- Department of Software and Information Systems Engineering, Ben Gurion University of the Negev, Beer-Sheva, Israel.
| | - Lior Fuchs
- Medical Intensive Care Unit and Clinical Research Center, Soroka University Medical Center, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel.
| | - Victor Novack
- Clinical Research Center, Soroka University Medical Center, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel.
| | - Robert Moskovitch
- Department of Software and Information Systems Engineering, Ben Gurion University of the Negev, Beer-Sheva, Israel.
| |
Collapse
|
12
|
Lee JM, Hauskrecht M. Modeling multivariate clinical event time-series with recurrent temporal mechanisms. Artif Intell Med 2021; 112:102021. [PMID: 33581828 PMCID: PMC7943294 DOI: 10.1016/j.artmed.2021.102021] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Revised: 12/26/2020] [Accepted: 01/10/2021] [Indexed: 12/18/2022]
Abstract
In this work, we propose a novel autoregressive event time-series model that can predict future occurrences of multivariate clinical events. Our model represents multivariate event time-series using different temporal mechanisms aimed to fit different temporal characteristics of the time-series. In particular, information about distant past is modeled through the hidden state space defined by an LSTM-based model, information on recently observed clinical events is modeled through discriminative projections, and information about periodic (repeated) events is modeled using a special recurrent mechanism based on probability distributions of inter-event gaps compiled from past data. We evaluate our proposed model on electronic health record (EHRs) data derived from MIMIC-III dataset. We show that our new model equipped with the above temporal mechanisms leads to improved prediction performance compared to multiple baselines.
Collapse
Affiliation(s)
- Jeong Min Lee
- Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260, USA.
| | - Milos Hauskrecht
- Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260, USA.
| |
Collapse
|
13
|
Jane YN, Nehemiah HK, Kannan A. Classifying unevenly spaced clinical time series data using forecast error approximation based bottom-up (FeAB) segmented time delay neural network. COMPUTER METHODS IN BIOMECHANICS AND BIOMEDICAL ENGINEERING: IMAGING & VISUALIZATION 2021. [DOI: 10.1080/21681163.2020.1817791] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Y. Nancy Jane
- Department of Computer Technology, Anna University, Chennai, India
| | | | - Arputharaj Kannan
- School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, India
| |
Collapse
|
14
|
Morid MA, Sheng ORL, Kawamoto K, Abdelrahman S. Learning hidden patterns from patient multivariate time series data using convolutional neural networks: A case study of healthcare cost prediction. J Biomed Inform 2020; 111:103565. [DOI: 10.1016/j.jbi.2020.103565] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Revised: 08/27/2020] [Accepted: 09/07/2020] [Indexed: 01/20/2023]
|
15
|
Quantitative and temporal approach to utilising electronic medical records from general practices in mental health prediction. Comput Biol Med 2020; 125:103973. [DOI: 10.1016/j.compbiomed.2020.103973] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2020] [Revised: 08/11/2020] [Accepted: 08/11/2020] [Indexed: 01/06/2023]
|
16
|
Pokharel S, Zuccon G, Li X, Utomo CP, Li Y. Temporal tree representation for similarity computation between medical patients. Artif Intell Med 2020; 108:101900. [DOI: 10.1016/j.artmed.2020.101900] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2019] [Revised: 05/15/2020] [Accepted: 06/03/2020] [Indexed: 02/01/2023]
|
17
|
Dagliati A, Geifman N, Peek N, Holmes JH, Sacchi L, Bellazzi R, Sajjadi SE, Tucker A. Using topological data analysis and pseudo time series to infer temporal phenotypes from electronic health records. Artif Intell Med 2020; 108:101930. [PMID: 32972659 PMCID: PMC7536308 DOI: 10.1016/j.artmed.2020.101930] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2020] [Revised: 05/21/2020] [Accepted: 07/11/2020] [Indexed: 11/17/2022]
Abstract
Topological Data and Pseudo Time Series to discover Type 2 Diabetes temporal phenotypes. Temporal phenotypes inferred from state-space model based on hidden-states transitions. Study of states continuous transitions visually delivered in an easily explainable way. Mined phenotypes characterized by significant differences in disease deterioration.
Temporal phenotyping enables clinicians to better understand observable characteristics of a disease as it progresses. Modelling disease progression that captures interactions between phenotypes is inherently challenging. Temporal models that capture change in disease over time can identify the key features that characterize disease subtypes that underpin these trajectories. These models will enable clinicians to identify early warning signs of progression in specific sub-types and therefore to make informed decisions tailored to individual patients. In this paper, we explore two approaches to building temporal phenotypes based on the topology of data: topological data analysis and pseudo time-series. Using type 2 diabetes data, we show that the topological data analysis approach is able to identify disease trajectories and that pseudo time-series can infer a state space model characterized by transitions between hidden states that represent distinct temporal phenotypes. Both approaches highlight lipid profiles as key factors in distinguishing the phenotypes.
Collapse
Affiliation(s)
- Arianna Dagliati
- Centre for Health Informatics, University of Manchester, Manchester, United Kingdom; Manchester Molecular Pathology Innovation Centre, University of Manchester, United Kingdom; Department of Electrical, Computer & Biomedical Engineering University of Pavia, Italy.
| | - Nophar Geifman
- Centre for Health Informatics, University of Manchester, Manchester, United Kingdom
| | - Niels Peek
- Centre for Health Informatics, University of Manchester, Manchester, United Kingdom; NIHR Manchester Biomedical Research Centre, University of Manchester, United Kingdom
| | - John H Holmes
- Department of Biostatistics, Epidemiology, and Informatics, Penn Institute for Biomedical Informatics, University of Pennsylvania Perelman School of Medicine, USA
| | - Lucia Sacchi
- Department of Electrical, Computer & Biomedical Engineering University of Pavia, Italy
| | - Riccardo Bellazzi
- Department of Electrical, Computer & Biomedical Engineering University of Pavia, Italy
| | | | - Allan Tucker
- Department of Computer Science, Brunel University London, United Kingdom
| |
Collapse
|
18
|
Estiri H, Strasser ZH, Klann JG, McCoy TH, Wagholikar KB, Vasey S, Castro VM, Murphy ME, Murphy SN. Transitive Sequencing Medical Records for Mining Predictive and Interpretable Temporal Representations. PATTERNS (NEW YORK, N.Y.) 2020; 1:100051. [PMID: 32835307 PMCID: PMC7301790 DOI: 10.1016/j.patter.2020.100051] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/13/2020] [Revised: 04/27/2020] [Accepted: 05/26/2020] [Indexed: 12/13/2022]
Abstract
Electronic health records (EHRs) contain important temporal information about the progression of disease and treatment outcomes. This paper proposes a transitive sequencing approach for constructing temporal representations from EHR observations for downstream machine learning. Using clinical data from a cohort of patients with congestive heart failure, we mined temporal representations by transitive sequencing of EHR medication and diagnosis records for classification and prediction tasks. We compared the classification and prediction performances of the transitive sequential representations (bag-of-sequences approach) with the conventional approach of using aggregated vectors of EHR data (aggregated vector representation) across different classifiers. We found that the transitive sequential representations are better phenotype "differentiators" and predictors than the "atemporal" EHR records. Our results also demonstrated that data representations obtained from transitive sequencing of EHR observations can present novel insights about the progression of the disease that are difficult to discern when clinical data are treated independently of the patient's history.
Collapse
Affiliation(s)
- Hossein Estiri
- Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 02144, USA
- Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA
- Harvard Medical School, Boston, MA 02115, USA
| | - Zachary H. Strasser
- Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 02144, USA
- Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA
- Harvard Medical School, Boston, MA 02115, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
| | - Jeffery G. Klann
- Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 02144, USA
- Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA
- Harvard Medical School, Boston, MA 02115, USA
| | - Thomas H. McCoy
- Harvard Medical School, Boston, MA 02115, USA
- Center for Quantitative Health, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Kavishwar B. Wagholikar
- Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 02144, USA
- Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA
- Harvard Medical School, Boston, MA 02115, USA
| | - Sebastien Vasey
- Department of Mathematics, Harvard University, Cambridge, MA 02138, USA
| | - Victor M. Castro
- Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA
| | - MaryKate E. Murphy
- Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA
| | - Shawn N. Murphy
- Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 02144, USA
- Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA
- Harvard Medical School, Boston, MA 02115, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
- Department of Neurology, Massachusetts General Hospital, Boston, MA 02114, USA
| |
Collapse
|
19
|
Morid MA, Sheng ORL, Del Fiol G, Facelli JC, Bray BE, Abdelrahman S. Temporal Pattern Detection to Predict Adverse Events in Critical Care: Case Study With Acute Kidney Injury. JMIR Med Inform 2020; 8:e14272. [PMID: 32181753 PMCID: PMC7109618 DOI: 10.2196/14272] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2019] [Revised: 11/23/2019] [Accepted: 01/22/2020] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND More than 20% of patients admitted to the intensive care unit (ICU) develop an adverse event (AE). No previous study has leveraged patients' data to extract the temporal features using their structural temporal patterns, that is, trends. OBJECTIVE This study aimed to improve AE prediction methods by using structural temporal pattern detection that captures global and local temporal trends and to demonstrate these improvements in the detection of acute kidney injury (AKI). METHODS Using the Medical Information Mart for Intensive Care dataset, containing 22,542 patients, we extracted both global and local trends using structural pattern detection methods to predict AKI (ie, binary prediction). Classifiers were built on 17 input features consisting of vital signs and laboratory test results using state-of-the-art models; the optimal classifier was selected for comparisons with previous approaches. The classifier with structural pattern detection features was compared with two baseline classifiers that used different temporal feature extraction approaches commonly used in the literature: (1) symbolic temporal pattern detection, which is the most common approach for multivariate time series classification; and (2) the last recorded value before the prediction point, which is the most common approach to extract temporal data in the AKI prediction literature. Moreover, we assessed the individual contribution of global and local trends. Classifier performance was measured in terms of accuracy (primary outcome), area under the curve, and F-measure. For all experiments, we employed 20-fold cross-validation. RESULTS Random forest was the best classifier using structural temporal pattern detection. The accuracy of the classifier with local and global trend features was significantly higher than that while using symbolic temporal pattern detection and the last recorded value (81.3% vs 70.6% vs 58.1%; P<.001). Excluding local or global features reduced the accuracy to 74.4% or 78.1%, respectively (P<.001). CONCLUSIONS Classifiers using features obtained from structural temporal pattern detection significantly improved the prediction of AKI onset in ICU patients over two baselines based on common previous approaches. The proposed method is a generalizable approach to predict AEs in critical care that may be used to help clinicians intervene in a timely manner to prevent or mitigate AEs.
Collapse
Affiliation(s)
- Mohammad Amin Morid
- Department of Information Systems and Analytics, Leavey School of Business, Santa Clara University, Santa Clara, CA, United States
| | - Olivia R Liu Sheng
- Department of Operations and Information Systems, David Eccles School of Business, University of Utah, Salt Lake City, UT, United States
| | - Guilherme Del Fiol
- Department of Biomedical Informatics, School of Medicine, University of Utah, Salt Lake City, UT, United States
| | - Julio C Facelli
- Department of Biomedical Informatics, School of Medicine, University of Utah, Salt Lake City, UT, United States
- Center for Clinical and Translational Science, University of Utah, Salt Lake City, UT, United States
| | - Bruce E Bray
- Department of Biomedical Informatics, School of Medicine, University of Utah, Salt Lake City, UT, United States
- Division of Cardiovascular Medicine, School of Medicine, University of Utah, Salt Lake City, UT, United States
| | - Samir Abdelrahman
- Department of Biomedical Informatics, School of Medicine, University of Utah, Salt Lake City, UT, United States
- Computer Science Department, Faculty of Computers and Information, Cairo University, Cairo, Egypt
| |
Collapse
|
20
|
Estiri H, Vasey S, Murphy SN. Transitive Sequential Pattern Mining for Discrete Clinical Data. Artif Intell Med 2020. [DOI: 10.1007/978-3-030-59137-3_37] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
|
21
|
Lee EW, Ho JC. FuzzyGap: Sequential Pattern Mining for Predicting Chronic Heart Failure in Clinical Pathways. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2019; 2019:222-231. [PMID: 31258974 PMCID: PMC6568087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The rapid growth of electronic health records (EHRs) facilitates the use of clinical pathways, an actionable plan for patients which is represented as sequences of diagnostic records ordered by visit dates. We propose to extract discriminative and representative clinical pathways from EHRs using sequential pattern mining. However, existing sequential patterns cannot efficiently extract patterns due to patient variations in length and time period between visits. To resolve this problem, we propose FuzzyGap, a sequential pattern mining-based framework that extracts a discriminative subsequent pattern from the proper representation of the sequence of encounters which also emphasizes the last visit that is more significant than others. We demonstrate FuzzyGap using a case study of heart failure and show the effectiveness of sequential pattern mining.
Collapse
Affiliation(s)
- Eric W Lee
- Department of Computer Science, Emory University, Atlanta, GA, United States
| | - Joyce C Ho
- Department of Computer Science, Emory University, Atlanta, GA, United States
| |
Collapse
|
22
|
Luo G. A roadmap for semi-automatically extracting predictive and clinically meaningful temporal features from medical data for predictive modeling. GLOBAL TRANSITIONS 2019; 1:61-82. [PMID: 31032483 PMCID: PMC6482973 DOI: 10.1016/j.glt.2018.11.001] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Predictive modeling based on machine learning with medical data has great potential to improve healthcare and reduce costs. However, two hurdles, among others, impede its widespread adoption in hdealthcare. First, medical data are by nature longitudinal. Pre-processing them, particularly for feature engineering, is labor intensive and often takes 50-80% of the model building effort. Predictive temporal features are the basis of building accurate models, but are difficult to identify. This is problematic. Healthcare systems have limited resources for model building, while inaccurate models produce sub-optimal outcomes and are often useless. Second, most machine learning models provide no explanation of their prediction results. However, offering such explanations is essential for a model to be used in usual clinical practice. To address these two hurdles, this paper outlines: 1) a data-driven method for semi-automatically extracting predictive and clinically meaningful temporal features from medical data for predictive modeling; and 2) a method of using these features to automatically explain machine learning prediction results and suggest tailored interventions. This provides a roadmap for future research.
Collapse
Affiliation(s)
- Gang Luo
- Department of Biomedical Informatics and Medical Education, University of Washington, UW Medicine South Lake Union, 850 Republican Street, Building C, Box 358047, Seattle, WA, 98109, USA
| |
Collapse
|
23
|
Georga EI, Tachos NS, Sakellarios AI, Kigka VI, Exarchos TP, Pelosi G, Parodi O, Michalis LK, Fotiadis DI. Artificial Intelligence and Data Mining Methods for Cardiovascular Risk Prediction. ACTA ACUST UNITED AC 2019. [DOI: 10.1007/978-981-10-5092-3_14] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
24
|
Despins LA, Kim JH, Deroche C, Song X. Factors Influencing How Intensive Care Unit Nurses Allocate Their Time. West J Nurs Res 2019; 41:1551-1575. [DOI: 10.1177/0193945918824070] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Spending time with the patient is essential for intensive care unit (ICU) nurses to detect clinical change. This article reports on an examination of factors influencing nurses’ activity time allocation. Data were analyzed from a prospective time and motion study of medical ICU nurses. Nurse demographic data and observation, electronic locator technology, and electronic medical record log data were collected over 12 days from 11 registered nurses. Charlson Co-Morbidity Index and Sequential Organ Failure Assessment scores were calculated for patient assignments. Nurses averaged 78.04 ( SD = 47.85) min per patient on activities in the patient room. Years of ICU nursing experience and the patient’s Charlson Co-Morbidity Index was significantly associated with time spent in the patient’s room. Neither nursing education nor specialty certification was found to influence time spent in a patient’s room. Using technology can advance understanding of nurses’ time allocation leading to interventions optimizing time spent with the patient.
Collapse
|
25
|
Dagliati A, Geifman N, Peek N, Holmes JH, Sacchi L, Sajjadi SE, Tucker A. Inferring Temporal Phenotypes with Topological Data Analysis and Pseudo Time-Series. Artif Intell Med 2019. [DOI: 10.1007/978-3-030-21642-9_50] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
26
|
Selby PJ, Banks RE, Gregory W, Hewison J, Rosenberg W, Altman DG, Deeks JJ, McCabe C, Parkes J, Sturgeon C, Thompson D, Twiddy M, Bestall J, Bedlington J, Hale T, Dinnes J, Jones M, Lewington A, Messenger MP, Napp V, Sitch A, Tanwar S, Vasudev NS, Baxter P, Bell S, Cairns DA, Calder N, Corrigan N, Del Galdo F, Heudtlass P, Hornigold N, Hulme C, Hutchinson M, Lippiatt C, Livingstone T, Longo R, Potton M, Roberts S, Sim S, Trainor S, Welberry Smith M, Neuberger J, Thorburn D, Richardson P, Christie J, Sheerin N, McKane W, Gibbs P, Edwards A, Soomro N, Adeyoju A, Stewart GD, Hrouda D. Methods for the evaluation of biomarkers in patients with kidney and liver diseases: multicentre research programme including ELUCIDATE RCT. PROGRAMME GRANTS FOR APPLIED RESEARCH 2018. [DOI: 10.3310/pgfar06030] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
BackgroundProtein biomarkers with associations with the activity and outcomes of diseases are being identified by modern proteomic technologies. They may be simple, accessible, cheap and safe tests that can inform diagnosis, prognosis, treatment selection, monitoring of disease activity and therapy and may substitute for complex, invasive and expensive tests. However, their potential is not yet being realised.Design and methodsThe study consisted of three workstreams to create a framework for research: workstream 1, methodology – to define current practice and explore methodology innovations for biomarkers for monitoring disease; workstream 2, clinical translation – to create a framework of research practice, high-quality samples and related clinical data to evaluate the validity and clinical utility of protein biomarkers; and workstream 3, the ELF to Uncover Cirrhosis as an Indication for Diagnosis and Action for Treatable Event (ELUCIDATE) randomised controlled trial (RCT) – an exemplar RCT of an established test, the ADVIA Centaur® Enhanced Liver Fibrosis (ELF) test (Siemens Healthcare Diagnostics Ltd, Camberley, UK) [consisting of a panel of three markers – (1) serum hyaluronic acid, (2) amino-terminal propeptide of type III procollagen and (3) tissue inhibitor of metalloproteinase 1], for liver cirrhosis to determine its impact on diagnostic timing and the management of cirrhosis and the process of care and improving outcomes.ResultsThe methodology workstream evaluated the quality of recommendations for using prostate-specific antigen to monitor patients, systematically reviewed RCTs of monitoring strategies and reviewed the monitoring biomarker literature and how monitoring can have an impact on outcomes. Simulation studies were conducted to evaluate monitoring and improve the merits of health care. The monitoring biomarker literature is modest and robust conclusions are infrequent. We recommend improvements in research practice. Patients strongly endorsed the need for robust and conclusive research in this area. The clinical translation workstream focused on analytical and clinical validity. Cohorts were established for renal cell carcinoma (RCC) and renal transplantation (RT), with samples and patient data from multiple centres, as a rapid-access resource to evaluate the validity of biomarkers. Candidate biomarkers for RCC and RT were identified from the literature and their quality was evaluated and selected biomarkers were prioritised. The duration of follow-up was a limitation but biomarkers were identified that may be taken forward for clinical utility. In the third workstream, the ELUCIDATE trial registered 1303 patients and randomised 878 patients out of a target of 1000. The trial started late and recruited slowly initially but ultimately recruited with good statistical power to answer the key questions. ELF monitoring altered the patient process of care and may show benefits from the early introduction of interventions with further follow-up. The ELUCIDATE trial was an ‘exemplar’ trial that has demonstrated the challenges of evaluating biomarker strategies in ‘end-to-end’ RCTs and will inform future study designs.ConclusionsThe limitations in the programme were principally that, during the collection and curation of the cohorts of patients with RCC and RT, the pace of discovery of new biomarkers in commercial and non-commercial research was slower than anticipated and so conclusive evaluations using the cohorts are few; however, access to the cohorts will be sustained for future new biomarkers. The ELUCIDATE trial was slow to start and recruit to, with a late surge of recruitment, and so final conclusions about the impact of the ELF test on long-term outcomes await further follow-up. The findings from the three workstreams were used to synthesise a strategy and framework for future biomarker evaluations incorporating innovations in study design, health economics and health informatics.Trial registrationCurrent Controlled Trials ISRCTN74815110, UKCRN ID 9954 and UKCRN ID 11930.FundingThis project was funded by the NIHR Programme Grants for Applied Research programme and will be published in full inProgramme Grants for Applied Research; Vol. 6, No. 3. See the NIHR Journals Library website for further project information.
Collapse
Affiliation(s)
- Peter J Selby
- Clinical and Biomedical Proteomics Group, Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, UK
- Leeds Teaching Hospitals NHS Trust, Leeds, UK
| | - Rosamonde E Banks
- Clinical and Biomedical Proteomics Group, Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, UK
| | - Walter Gregory
- Leeds Institute of Clinical Trials Research, University of Leeds, Leeds, UK
| | - Jenny Hewison
- Leeds Institute of Health Sciences, University of Leeds, Leeds, UK
| | - William Rosenberg
- Institute for Liver and Digestive Health, Division of Medicine, University College London, London, UK
| | - Douglas G Altman
- Centre for Statistics in Medicine, University of Oxford, Oxford, UK
| | - Jonathan J Deeks
- Institute of Applied Health Research, University of Birmingham, Birmingham, UK
| | - Christopher McCabe
- Department of Emergency Medicine, University of Alberta Hospital, Edmonton, AB, Canada
| | - Julie Parkes
- Primary Care and Population Sciences Academic Unit, University of Southampton, Southampton, UK
| | | | | | - Maureen Twiddy
- Leeds Institute of Health Sciences, University of Leeds, Leeds, UK
| | - Janine Bestall
- Leeds Institute of Health Sciences, University of Leeds, Leeds, UK
| | | | - Tilly Hale
- LIVErNORTH Liver Patient Support, Newcastle upon Tyne, UK
| | - Jacqueline Dinnes
- Institute of Applied Health Research, University of Birmingham, Birmingham, UK
| | - Marc Jones
- Leeds Institute of Clinical Trials Research, University of Leeds, Leeds, UK
| | | | | | - Vicky Napp
- Leeds Institute of Clinical Trials Research, University of Leeds, Leeds, UK
| | - Alice Sitch
- Institute of Applied Health Research, University of Birmingham, Birmingham, UK
| | - Sudeep Tanwar
- Institute for Liver and Digestive Health, Division of Medicine, University College London, London, UK
| | - Naveen S Vasudev
- Clinical and Biomedical Proteomics Group, Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, UK
- Leeds Teaching Hospitals NHS Trust, Leeds, UK
| | - Paul Baxter
- Leeds Institute of Cardiovascular and Metabolic Medicine, University of Leeds, Leeds, UK
| | - Sue Bell
- Leeds Institute of Clinical Trials Research, University of Leeds, Leeds, UK
| | - David A Cairns
- Clinical and Biomedical Proteomics Group, Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, UK
| | | | - Neil Corrigan
- Leeds Institute of Clinical Trials Research, University of Leeds, Leeds, UK
| | - Francesco Del Galdo
- Leeds Institute of Rheumatic and Musculoskeletal Medicine, University of Leeds, Leeds, UK
| | - Peter Heudtlass
- Leeds Institute of Clinical Trials Research, University of Leeds, Leeds, UK
| | - Nick Hornigold
- Clinical and Biomedical Proteomics Group, Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, UK
| | - Claire Hulme
- Leeds Institute of Health Sciences, University of Leeds, Leeds, UK
| | - Michelle Hutchinson
- Clinical and Biomedical Proteomics Group, Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, UK
| | - Carys Lippiatt
- Department of Specialist Laboratory Medicine, Leeds Teaching Hospitals NHS Trust, Leeds, UK
| | | | - Roberta Longo
- Leeds Institute of Health Sciences, University of Leeds, Leeds, UK
| | - Matthew Potton
- Leeds Institute of Clinical Trials Research, University of Leeds, Leeds, UK
| | - Stephanie Roberts
- Clinical and Biomedical Proteomics Group, Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, UK
| | - Sheryl Sim
- Clinical and Biomedical Proteomics Group, Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, UK
| | - Sebastian Trainor
- Clinical and Biomedical Proteomics Group, Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, UK
| | - Matthew Welberry Smith
- Clinical and Biomedical Proteomics Group, Leeds Institute of Cancer and Pathology, University of Leeds, Leeds, UK
- Leeds Teaching Hospitals NHS Trust, Leeds, UK
| | - James Neuberger
- University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
| | | | - Paul Richardson
- Royal Liverpool and Broadgreen University Hospitals NHS Trust, Liverpool, UK
| | - John Christie
- Royal Devon and Exeter NHS Foundation Trust, Exeter, UK
| | - Neil Sheerin
- Newcastle upon Tyne Hospitals NHS Foundation Trust, Newcastle upon Tyne, UK
| | - William McKane
- Sheffield Teaching Hospitals NHS Foundation Trust, Sheffield, UK
| | - Paul Gibbs
- Portsmouth Hospitals NHS Trust, Portsmouth, UK
| | | | - Naeem Soomro
- Newcastle upon Tyne Hospitals NHS Foundation Trust, Newcastle upon Tyne, UK
| | | | - Grant D Stewart
- NHS Lothian, Edinburgh, UK
- Academic Urology Group, University of Cambridge, Cambridge, UK
| | - David Hrouda
- Charing Cross Hospital, Imperial College Healthcare NHS Trust, London, UK
| |
Collapse
|
27
|
Incorporating repeating temporal association rules in Naïve Bayes classifiers for coronary heart disease diagnosis. J Biomed Inform 2018; 81:74-82. [PMID: 29555443 DOI: 10.1016/j.jbi.2018.03.002] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2017] [Revised: 02/14/2018] [Accepted: 03/07/2018] [Indexed: 01/08/2023]
Abstract
In this paper, we develop a Naïve Bayes classification model integrated with temporal association rules (TARs). A temporal pattern mining algorithm is used to detect TARs by identifying the most frequent temporal relationships among the derived basic temporal abstractions (TA). We develop and compare three classifiers that use as features the most frequent TARs as follows: (i) representing the most frequent TARs detected within the target class ('Disease = Present'), (ii) representing the most frequent TARs from both classes ('Disease = Present', 'Disease = Absent'), (iii) representing the most frequent TARs, after removing the ones that are low-risk predictors for the disease. These classifiers incorporate the horizontal support of TARs, which defines the number of times that a particular temporal pattern is found in some patient's record, as their features. All of the developed classifiers are applied for diagnosis of coronary heart disease (CHD) using a longitudinal dataset. We compare two ways of feature representation, using horizontal support or the mean duration of each TAR, on a single patient. The results obtained from this comparison show that the horizontal support representation outperforms the mean duration. The main effort of our research is to demonstrate that where long time periods are of significance in some medical domain, such as the CHD domain, the detection of the repeated occurrences of the most frequent TARs can yield better performances. We compared the classifier that uses the horizontal support representation and has the best performance with a Baseline Classifier which uses the binary representation of the most frequent TARs. The results obtained illustrate the comparatively high performance of the classifier representing the horizontal support, over the Baseline Classifier.
Collapse
|
28
|
Liu L, Wang S, Su G, Hu B, Peng Y, Xiong Q, Wen J. A framework of mining semantic-based probabilistic event relations for complex activity recognition. Inf Sci (N Y) 2017. [DOI: 10.1016/j.ins.2017.07.022] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
29
|
Shknevsky A, Shahar Y, Moskovitch R. Consistent discovery of frequent interval-based temporal patterns in chronic patients' data. J Biomed Inform 2017; 75:83-95. [PMID: 28987378 DOI: 10.1016/j.jbi.2017.10.002] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2017] [Revised: 08/23/2017] [Accepted: 10/02/2017] [Indexed: 11/24/2022]
Abstract
Increasingly, frequent temporal patterns discovered in longitudinal patient records are proposed as features for classification and prediction, and as means to cluster patient clinical trajectories. However, to justify that, we must demonstrate that most frequent temporal patterns are indeed consistently discoverable within the records of different patient subsets within similar patient populations. We have developed several measures for the consistency of the discovery of temporal patterns. We focus on time-interval relations patterns (TIRPs) that can be discovered within different subsets of the same patient population. We expect the discovered TIRPs (1) to be frequent in each subset, (2) preserve their "local" metrics - the absolute frequency of each pattern, measured by a Proportion Test, and (3) preserve their "global" characteristics - their overall distribution, measured by a Kolmogorov-Smirnov test. We also wanted to examine the effect on consistency, over a variety of settings, of varying the minimal frequency threshold for TIRP discovery, and of using a TIRP-filtering criterion that we previously introduced, the Semantic Adjacency Criterion (SAC). We applied our methodology to three medical domains (oncology, infectious hepatitis, and diabetes). We found that, within the minimal frequency ranges we had examined, 70-95% of the discovered TIRPs were consistently discoverable; 40-48% of them maintained their local frequency. TIRP global distribution similarity varied widely, from 0% to 65%. Increasing the threshold usually increased the percentage of TIRPs that were repeatedly discovered across different patient subsets within the same domain, and the probability of a similar TIRP distribution. Using the SAC principle, enhanced, for most minimal support levels, the percentage of repeating TIRPs, their local consistency and their global consistency. The effect of using the SAC was further strengthened as the minimal frequency threshold was raised.
Collapse
Affiliation(s)
- Alexander Shknevsky
- Software and Information Systems Engineering, Ben-Gurion University, Beer Sheva, Israel.
| | - Yuval Shahar
- Software and Information Systems Engineering, Ben-Gurion University, Beer Sheva, Israel.
| | - Robert Moskovitch
- Software and Information Systems Engineering, Ben-Gurion University, Beer Sheva, Israel.
| |
Collapse
|
30
|
Moskovitch R, Polubriaginof F, Weiss A, Ryan P, Tatonetti N. Procedure prediction from symbolic Electronic Health Records via time intervals analytics. J Biomed Inform 2017; 75:70-82. [PMID: 28823923 DOI: 10.1016/j.jbi.2017.07.018] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2016] [Revised: 06/19/2017] [Accepted: 07/25/2017] [Indexed: 11/18/2022]
Abstract
Prediction of medical events, such as clinical procedures, is essential for preventing disease, understanding disease mechanism, and increasing patient quality of care. Although longitudinal clinical data from Electronic Health Records provides opportunities to develop predictive models, the use of these data faces significant challenges. Primarily, while the data are longitudinal and represent thousands of conceptual events having duration, they are also sparse, complicating the application of traditional analysis approaches. Furthermore, the framework presented here takes advantage of the events duration and gaps. International standards for electronic healthcare data represent data elements, such as procedures, conditions, and drug exposures, using eras, or time intervals. Such eras contain both an event and a duration and enable the application of time intervals mining - a relatively new subfield of data mining. In this study, we present Maitreya, a framework for time intervals analytics in longitudinal clinical data. Maitreya discovers frequent time intervals related patterns (TIRPs), which we use as prognostic markers for modelling clinical events. We introduce three novel TIRP metrics that are normalized versions of the horizontal-support, that represents the number of TIRP instances per patient. We evaluate Maitreya on 28 frequent and clinically important procedures, using the three novel TIRP representation metrics in comparison to no temporal representation and previous TIRPs metrics. We also evaluate the epsilon value that makes Allen's relations more flexible with several settings of 30, 60, 90 and 180days in comparison to the default zero. For twenty-two of these procedures, the use of temporal patterns as predictors was superior to non-temporal features, and the use of the vertically normalized horizontal support metric to represent TIRPs as features was most effective. The use of the epsilon value with thirty days was slightly better than the zero.
Collapse
Affiliation(s)
- Robert Moskovitch
- Department of Biomedical Informatics, Columbia University, NY, USA; Department of Systems Biology, Columbia University, NY, USA; Department of Medicine, Columbia University, NY, USA; Observational Health Data Sciences and Informations (OHDSI), NY, USA; Department of Software and Information Systems Engineering, Ben Gurion Univeristy, Beer Sheva, Israel.
| | - Fernanda Polubriaginof
- Department of Biomedical Informatics, Columbia University, NY, USA; Department of Systems Biology, Columbia University, NY, USA; Department of Medicine, Columbia University, NY, USA; Observational Health Data Sciences and Informations (OHDSI), NY, USA
| | - Aviram Weiss
- Department of Software and Information Systems Engineering, Ben Gurion Univeristy, Beer Sheva, Israel
| | - Patrick Ryan
- Department of Biomedical Informatics, Columbia University, NY, USA
| | - Nicholas Tatonetti
- Department of Biomedical Informatics, Columbia University, NY, USA; Department of Systems Biology, Columbia University, NY, USA; Department of Medicine, Columbia University, NY, USA; Observational Health Data Sciences and Informations (OHDSI), NY, USA.
| |
Collapse
|
31
|
Ghosh S. Predicting short-term ICU outcomes using a sequential contrast motif based classification framework. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2017; 2016:5612-5615. [PMID: 28269527 DOI: 10.1109/embc.2016.7591999] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Critical ICU events like acute hypotension and septic shock are dangerous complications, leading to multiple organ failures and eventual death. Previously, pattern mining algorithms have been employed for extracting interesting rules in various clinical domains. However, the extracted rules are directly investigated by clinicians for diagnosing a disease. Towards this purpose, there is a need to develop advanced prediction models which integrate dynamic patterns to learn a patient's physiological condition. In this study, a sequential contrast patterns-based classification framework is presented for detecting critical patient events, like hypotension and septic shock. Initially, a set of sequential patterns are obtained by using a contrast mining algorithm. Later, these patterns undergo post-processing, for conversion to two novel representations-(1) frequency-based feature space and (2) ordered sequences of patterns, which conserve positional information of a pattern in a time series sequence. Each of these representations are automatically used for developing classification models using SVM and HMM methods. Our results on hypotension and septic shock datasets from a large scale ICU database demonstrate better predictive capabilities, when sequential patterns are used as features.
Collapse
|
32
|
|
33
|
Kostakis O, Papapetrou P. On searching and indexing sequences of temporal intervals. Data Min Knowl Discov 2017. [DOI: 10.1007/s10618-016-0489-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
34
|
|
35
|
Sacchi L, Holmes JH. Progress in Biomedical Knowledge Discovery: A 25-year Retrospective. Yearb Med Inform 2016; Suppl 1:S117-29. [PMID: 27488403 PMCID: PMC5171499 DOI: 10.15265/iys-2016-s033] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
OBJECTIVES We sought to explore, via a systematic review of the literature, the state of the art of knowledge discovery in biomedical databases as it existed in 1992, and then now, 25 years later, mainly focused on supervised learning. METHODS We performed a rigorous systematic search of PubMed and latent Dirichlet allocation to identify themes in the literature and trends in the science of knowledge discovery in and between time periods and compare these trends. We restricted the result set using a bracket of five years previous, such that the 1992 result set was restricted to articles published between 1987 and 1992, and the 2015 set between 2011 and 2015. This was to reflect the current literature available at the time to researchers and others at the target dates of 1992 and 2015. The search term was framed as: Knowledge Discovery OR Data Mining OR Pattern Discovery OR Pattern Recognition, Automated. RESULTS A total 538 and 18,172 documents were retrieved for 1992 and 2015, respectively. The number and type of data sources increased dramatically over the observation period, primarily due to the advent of electronic clinical systems. The period 1992- 2015 saw the emergence of new areas of research in knowledge discovery, and the refinement and application of machine learning approaches that were nascent or unknown in 1992. CONCLUSIONS Over the 25 years of the observation period, we identified numerous developments that impacted the science of knowledge discovery, including the availability of new forms of data, new machine learning algorithms, and new application domains. Through a bibliometric analysis we examine the striking changes in the availability of highly heterogeneous data resources, the evolution of new algorithmic approaches to knowledge discovery, and we consider from legal, social, and political perspectives possible explanations of the growth of the field. Finally, we reflect on the achievements of the past 25 years to consider what the next 25 years will bring with regard to the availability of even more complex data and to the methods that could be, and are being now developed for the discovery of new knowledge in biomedical data.
Collapse
Affiliation(s)
| | - J H Holmes
- John H Holmes, Institute for Biomedical Informatics, University of Pennsylvania School of Medicine, 717 Blockley Hall, 423 Guardian Drive, Philadelphia, PA 19104, USA, Tel: 215-898-4833, Fax: 215-573-5325, E-Mail:
| |
Collapse
|
36
|
Li C, Rana S, Phung D, Venkatesh S. Hierarchical Bayesian nonparametric models for knowledge discovery from electronic medical records. Knowl Based Syst 2016. [DOI: 10.1016/j.knosys.2016.02.005] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
37
|
Ouyang L, Apley DW, Mehrotra S. A design of experiments approach to validation sampling for logistic regression modeling with error-prone medical records. J Am Med Inform Assoc 2016; 23:e71-8. [PMID: 26374705 PMCID: PMC4954627 DOI: 10.1093/jamia/ocv132] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2015] [Revised: 07/16/2015] [Accepted: 07/17/2015] [Indexed: 01/18/2023] Open
Abstract
BACKGROUND AND OBJECTIVE Electronic medical record (EMR) databases offer significant potential for developing clinical hypotheses and identifying disease risk associations by fitting statistical models that capture the relationship between a binary response variable and a set of predictor variables that represent clinical, phenotypical, and demographic data for the patient. However, EMR response data may be error prone for a variety of reasons. Performing a manual chart review to validate data accuracy is time consuming, which limits the number of chart reviews in a large database. The authors' objective is to develop a new design-of-experiments-based systematic chart validation and review (DSCVR) approach that is more powerful than the random validation sampling used in existing approaches. METHODS The DSCVR approach judiciously and efficiently selects the cases to validate (i.e., validate whether the response values are correct for those cases) for maximum information content, based only on their predictor variable values. The final predictive model will be fit using only the validation sample, ignoring the remainder of the unvalidated and unreliable error-prone data. A Fisher information based D-optimality criterion is used, and an algorithm for optimizing it is developed. RESULTS The authors' method is tested in a simulation comparison that is based on a sudden cardiac arrest case study with 23 041 patients' records. This DSCVR approach, using the Fisher information based D-optimality criterion, results in a fitted model with much better predictive performance, as measured by the receiver operating characteristic curve and the accuracy in predicting whether a patient will experience the event, than a model fitted using a random validation sample. CONCLUSIONS The simulation comparisons demonstrate that this DSCVR approach can produce predictive models that are significantly better than those produced from random validation sampling, especially when the event rate is low.
Collapse
Affiliation(s)
- Liwen Ouyang
- Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL 60208, USA
| | - Daniel W Apley
- Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL 60208, USA
| | - Sanjay Mehrotra
- Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL 60208, USA
| |
Collapse
|
38
|
Hoogendoorn M, Szolovits P, Moons LMG, Numans ME. Utilizing uncoded consultation notes from electronic medical records for predictive modeling of colorectal cancer. Artif Intell Med 2016; 69:53-61. [PMID: 27085847 DOI: 10.1016/j.artmed.2016.03.003] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2015] [Accepted: 03/23/2016] [Indexed: 12/15/2022]
Abstract
OBJECTIVE Machine learning techniques can be used to extract predictive models for diseases from electronic medical records (EMRs). However, the nature of EMRs makes it difficult to apply off-the-shelf machine learning techniques while still exploiting the rich content of the EMRs. In this paper, we explore the usage of a range of natural language processing (NLP) techniques to extract valuable predictors from uncoded consultation notes and study whether they can help to improve predictive performance. METHODS We study a number of existing techniques for the extraction of predictors from the consultation notes, namely a bag of words based approach and topic modeling. In addition, we develop a dedicated technique to match the uncoded consultation notes with a medical ontology. We apply these techniques as an extension to an existing pipeline to extract predictors from EMRs. We evaluate them in the context of predictive modeling for colorectal cancer (CRC), a disease known to be difficult to diagnose before performing an endoscopy. RESULTS Our results show that we are able to extract useful information from the consultation notes. The predictive performance of the ontology-based extraction method moves significantly beyond the benchmark of age and gender alone (area under the receiver operating characteristic curve (AUC) of 0.870 versus 0.831). We also observe more accurate predictive models by adding features derived from processing the consultation notes compared to solely using coded data (AUC of 0.896 versus 0.882) although the difference is not significant. The extracted features from the notes are shown be equally predictive (i.e. there is no significant difference in performance) compared to the coded data of the consultations. CONCLUSION It is possible to extract useful predictors from uncoded consultation notes that improve predictive performance. Techniques linking text to concepts in medical ontologies to derive these predictors are shown to perform best for predicting CRC in our EMR dataset.
Collapse
Affiliation(s)
- Mark Hoogendoorn
- Department of Computer Science, VU University Amsterdam, De Boelelaan 1081, 1081 HV Amsterdam, The Netherlands; Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar Street, Cambridge, MA 02139, USA.
| | - Peter Szolovits
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar Street, Cambridge, MA 02139, USA.
| | - Leon M G Moons
- Department of Gastroenterology and Hepatology, Utrecht University Medical Center, Heidelberglaan 100, 3584 CX Utrecht, The Netherlands.
| | - Mattijs E Numans
- Department of Public Health and Primary Care, Leiden University Medical Center, Hippocratespad 21, 2333 ZD Leiden, The Netherlands.
| |
Collapse
|
39
|
Analyzing health insurance claims on different timescales to predict days in hospital. J Biomed Inform 2016; 60:187-96. [PMID: 26827621 DOI: 10.1016/j.jbi.2016.01.002] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2015] [Revised: 01/05/2016] [Accepted: 01/05/2016] [Indexed: 11/21/2022]
Abstract
Health insurers maintain large databases containing information on medical services utilized by claimants, often spanning several healthcare services and providers. Proper use of these databases could facilitate better clinical and administrative decisions. In these data sets, there exists many unequally spaced events, such as hospital visits. However, data mining of temporal data and point processes is still a developing research area and extracting useful information from such data series is a challenging task. In this paper, we developed a time series data mining approach to predict the number of days in hospital in the coming year for individuals from a general insured population based on their insurance claim data. In the proposed method, the data were windowed at four different timescales (bi-monthly, quarterly, half-yearly and yearly) to construct regularly spaced time series features extracted from such events, resulting in four associated prediction models. A comparison of these models indicates models using a half-yearly windowing scheme delivers the best performance on all three populations (the whole population, a senior sub-population and a non-senior sub-population). The superiority of the half-yearly model was found to be particularly pronounced in the senior sub-population. A bagged decision tree approach was able to predict 'no hospitalization' versus 'at least one day in hospital' with a Matthews correlation coefficient (MCC) of 0.426. This was significantly better than the corresponding yearly model, which achieved 0.375 for this group of customers. Further reducing the length of the analysis windows to three or two months did not produce further improvements.
Collapse
|
40
|
Jane NY, Nehemiah KH, Arputharaj K. A Temporal Mining Framework for Classifying Un-Evenly Spaced Clinical Data: An Approach for Building Effective Clinical Decision-Making System. Appl Clin Inform 2016; 7:1-21. [PMID: 27081403 DOI: 10.4338/aci-2015-08-ra-0102] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2015] [Accepted: 11/08/2015] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Clinical time-series data acquired from electronic health records (EHR) are liable to temporal complexities such as irregular observations, missing values and time constrained attributes that make the knowledge discovery process challenging. OBJECTIVE This paper presents a temporal rough set induced neuro-fuzzy (TRiNF) mining framework that handles these complexities and builds an effective clinical decision-making system. TRiNF provides two functionalities namely temporal data acquisition (TDA) and temporal classification. METHOD In TDA, a time-series forecasting model is constructed by adopting an improved double exponential smoothing method. The forecasting model is used in missing value imputation and temporal pattern extraction. The relevant attributes are selected using a temporal pattern based rough set approach. In temporal classification, a classification model is built with the selected attributes using a temporal pattern induced neuro-fuzzy classifier. RESULT For experimentation, this work uses two clinical time series dataset of hepatitis and thrombosis patients. The experimental result shows that with the proposed TRiNF framework, there is a significant reduction in the error rate, thereby obtaining the classification accuracy on an average of 92.59% for hepatitis and 91.69% for thrombosis dataset. CONCLUSION The obtained classification results prove the efficiency of the proposed framework in terms of its improved classification accuracy.
Collapse
Affiliation(s)
| | | | - Kannan Arputharaj
- Department of Information Science and Technology, Anna University , Chennai, India
| |
Collapse
|
41
|
Batal I, Cooper G, Fradkin D, Harrison J, Moerchen F, Hauskrecht M. An Efficient Pattern Mining Approach for Event Detection in Multivariate Temporal Data. Knowl Inf Syst 2016; 46:115-150. [PMID: 26752800 PMCID: PMC4704806 DOI: 10.1007/s10115-015-0819-6] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2013] [Revised: 08/31/2014] [Accepted: 12/06/2014] [Indexed: 11/27/2022]
Abstract
This work proposes a pattern mining approach to learn event detection models from complex multivariate temporal data, such as electronic health records. We present Recent Temporal Pattern mining, a novel approach for efficiently finding predictive patterns for event detection problems. This approach first converts the time series data into time-interval sequences of temporal abstractions. It then constructs more complex time-interval patterns backward in time using temporal operators. We also present the Minimal Predictive Recent Temporal Patterns framework for selecting a small set of predictive and non-spurious patterns. We apply our methods for predicting adverse medical events in real-world clinical data. The results demonstrate the benefits of our methods in learning accurate event detection models, which is a key step for developing intelligent patient monitoring and decision support systems.
Collapse
Affiliation(s)
| | - Gregory Cooper
- Department of Biomedical Informatics, University of Pittsburgh,
| | | | - James Harrison
- Department of Public Health Sciences, University of Virginia,
| | | | | |
Collapse
|
42
|
|
43
|
Ullah MZ, Aono M, Seddiqui MH. Estimating a Ranked List of Human Genetic Diseases by Associating Phenotype-Gene with Gene-Disease Bipartite Graphs. ACM T INTEL SYST TEC 2015. [DOI: 10.1145/2700487] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
With vast amounts of medical knowledge available on the Internet, it is becoming increasingly practical to help doctors in clinical diagnostics by suggesting plausible diseases predicted by applying data and text mining technologies. Recently, Genome-Wide Association Studies (
GWAS
) have proved useful as a method for exploring phenotypic associations with diseases. However, since genetic diseases are difficult to diagnose because of their low prevalence, large number, and broad diversity of symptoms, genetic disease patients are often misdiagnosed or experience long diagnostic delays. In this article, we propose a method for ranking genetic diseases for a set of clinical phenotypes. In this regard, we associate a phenotype-gene bipartite graph (
PGBG
) with a gene-disease bipartite graph (
GDBG
) by producing a phenotype-disease bipartite graph (
PDBG
), and we estimate the candidate weights of diseases. In our approach, all paths from a phenotype to a disease are explored by considering causative genes to assign a weight based on path frequency, and the phenotype is linked to the disease in a new PDBG. We introduce the Bidirectionally induced Importance Weight (
BIW
) prediction method to
PDBG
for approximating the weights of the edges of diseases with phenotypes by considering link information from both sides of the bipartite graph. The performance of our system is compared to that of other known related systems by estimating Normalized Discounted Cumulative Gain (
NDCG
), Mean Average Precision (
MAP
), and Kendall’s tau metrics. Further experiments are conducted with well-known
TF · IDF
,
BM25
, and
Jenson-Shannon divergence
as baselines. The result shows that our proposed method outperforms the known related tool
Phenomizer
in terms of NDCG@10, NDCG@20, MAP@10, and MAP@20; however, it performs worse than
Phenomizer
in terms of Kendall’s tau-b metric at the top-10 ranks. It also turns out that our proposed method has overall better performance than the baseline methods.
Collapse
|
44
|
Antonelli D, Baralis E, Bruno G, Cagliero L, Cerquitelli T, Chiusano S, Garza P, Mahoto NA. MeTA. ACM T INTEL SYST TEC 2015. [DOI: 10.1145/2700479] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Physicians and health care organizations always collect large amounts of data during patient care. These large and high-dimensional datasets are usually characterized by an inherent sparseness. Hence, analyzing these datasets to figure out interesting and hidden knowledge is a challenging task. This article proposes a new data mining framework based on generalized association rules to discover multiple-level correlations among patient data. Specifically, correlations among prescribed examinations, drugs, and patient profiles are discovered and analyzed at different abstraction levels. The rule extraction process is driven by a taxonomy to generalize examinations and drugs into their corresponding categories. To ease the manual inspection of the result, a worthwhile subset of rules (i.e., nonredundant generalized rules) is considered. Furthermore, rules are classified according to the involved data features (medical treatments or patient profiles) and then explored in a top-down fashion: from the small subset of high-level rules, a drill-down is performed to target more specific rules. The experiments, performed on a real diabetic patient dataset, demonstrate the effectiveness of the proposed approach in discovering interesting rule groups at different abstraction levels.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Naeem A. Mahoto
- Mehran University of Engineering and Technology, Jamshoro, Pakistan
| |
Collapse
|
45
|
Liu Z, Hauskrecht M. A Regularized Linear Dynamical System Framework for Multivariate Time Series Analysis. PROCEEDINGS OF THE ... AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE. AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE 2015; 2015:1798-1804. [PMID: 25905027 PMCID: PMC4402162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
Linear Dynamical System (LDS) is an elegant mathematical framework for modeling and learning Multivariate Time Series (MTS). However, in general, it is difficult to set the dimension of an LDS's hidden state space. A small number of hidden states may not be able to model the complexities of a MTS, while a large number of hidden states can lead to overfitting. In this paper, we study learning methods that impose various regularization penalties on the transition matrix of the LDS model and propose a regularized LDS learning framework (rLDS) which aims to (1) automatically shut down LDSs' spurious and unnecessary dimensions, and consequently, address the problem of choosing the optimal number of hidden states; (2) prevent the overfitting problem given a small amount of MTS data; and (3) support accurate MTS forecasting. To learn the regularized LDS from data we incorporate a second order cone program and a generalized gradient descent method into the Maximum a Posteriori framework and use Expectation Maximization to obtain a low-rank transition matrix of the LDS model. We propose two priors for modeling the matrix which lead to two instances of our rLDS. We show that our rLDS is able to recover well the intrinsic dimensionality of the time series dynamics and it improves the predictive performance when compared to baselines on both synthetic and real-world MTS datasets.
Collapse
Affiliation(s)
- Zitao Liu
- Computer Science Department, University of Pittsburgh, 210 South Bouquet St., Pittsburgh, PA, 15260 USA
| | - Milos Hauskrecht
- Computer Science Department, University of Pittsburgh, 210 South Bouquet St., Pittsburgh, PA, 15260 USA
| |
Collapse
|
46
|
|
47
|
Classification of multivariate time series via temporal abstraction and time intervals mining. Knowl Inf Syst 2014. [DOI: 10.1007/s10115-014-0784-5] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|