1
|
Assaad CK, Devijver E, Gaussier E. Entropy-Based Discovery of Summary Causal Graphs in Time Series. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1156. [PMID: 36010820 PMCID: PMC9407574 DOI: 10.3390/e24081156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 08/05/2022] [Accepted: 08/14/2022] [Indexed: 06/15/2023]
Abstract
This study addresses the problem of learning a summary causal graph on time series with potentially different sampling rates. To do so, we first propose a new causal temporal mutual information measure for time series. We then show how this measure relates to an entropy reduction principle that can be seen as a special case of the probability raising principle. We finally combine these two ingredients in PC-like and FCI-like algorithms to construct the summary causal graph. There algorithm are evaluated on several datasets, which shows both their efficacy and efficiency.
Collapse
Affiliation(s)
- Charles K. Assaad
- R&D Department, EasyVista, 38000 Grenoble, France
- Department of Mathematics, Information and Communication Sciences, University of Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
| | - Emilie Devijver
- Department of Mathematics, Information and Communication Sciences, University of Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
| | - Eric Gaussier
- Department of Mathematics, Information and Communication Sciences, University of Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France
| |
Collapse
|
2
|
Seri R, Martinoli M. Asymptotic Properties of the Plug-in Estimator of the Discrete Entropy Under Dependence. IEEE TRANSACTIONS ON INFORMATION THEORY 2021; 67:7659-7683. [DOI: 10.1109/tit.2021.3109307] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
|
3
|
Estiri H, Strasser ZH, Murphy SN. High-throughput phenotyping with temporal sequences. J Am Med Inform Assoc 2021; 28:772-781. [PMID: 33313899 DOI: 10.1093/jamia/ocaa288] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Accepted: 11/04/2020] [Indexed: 12/15/2022] Open
Abstract
OBJECTIVE High-throughput electronic phenotyping algorithms can accelerate translational research using data from electronic health record (EHR) systems. The temporal information buried in EHRs is often underutilized in developing computational phenotypic definitions. This study aims to develop a high-throughput phenotyping method, leveraging temporal sequential patterns from EHRs. MATERIALS AND METHODS We develop a representation mining algorithm to extract 5 classes of representations from EHR diagnosis and medication records: the aggregated vector of the records (aggregated vector representation), the standard sequential patterns (sequential pattern mining), the transitive sequential patterns (transitive sequential pattern mining), and 2 hybrid classes. Using EHR data on 10 phenotypes from the Mass General Brigham Biobank, we train and validate phenotyping algorithms. RESULTS Phenotyping with temporal sequences resulted in a superior classification performance across all 10 phenotypes compared with the standard representations in electronic phenotyping. The high-throughput algorithm's classification performance was superior or similar to the performance of previously published electronic phenotyping algorithms. We characterize and evaluate the top transitive sequences of diagnosis records paired with the records of risk factors, symptoms, complications, medications, or vaccinations. DISCUSSION The proposed high-throughput phenotyping approach enables seamless discovery of sequential record combinations that may be difficult to assume from raw EHR data. Transitive sequences offer more accurate characterization of the phenotype, compared with its individual components, and reflect the actual lived experiences of the patients with that particular disease. CONCLUSION Sequential data representations provide a precise mechanism for incorporating raw EHR records into downstream machine learning. Our approach starts with user interpretability and works backward to the technology.
Collapse
Affiliation(s)
- Hossein Estiri
- Harvard Medical School, Boston, Massachusetts, USA.,Massachusetts General Hospital, Boston, Massachusetts, USA.,Mass General Brigham, Boston, Massachusetts, USA
| | - Zachary H Strasser
- Harvard Medical School, Boston, Massachusetts, USA.,Massachusetts General Hospital, Boston, Massachusetts, USA.,Mass General Brigham, Boston, Massachusetts, USA
| | - Shawn N Murphy
- Harvard Medical School, Boston, Massachusetts, USA.,Massachusetts General Hospital, Boston, Massachusetts, USA.,Mass General Brigham, Boston, Massachusetts, USA
| |
Collapse
|
4
|
Between-day repeatability of sensor-based in-home gait assessment among older adults: assessing the effect of frailty. Aging Clin Exp Res 2021; 33:1529-1537. [PMID: 32930988 DOI: 10.1007/s40520-020-01686-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2020] [Accepted: 08/14/2020] [Indexed: 01/10/2023]
Abstract
BACKGROUND While sensor-based daily physical activity (DPA) gait assessment has been demonstrated to be an effective measure of physical frailty and fall-risk, the repeatability of DPA gait parameters between different days of measurement is not clear. AIMS To evaluate test-retest reliability (repeatability) of DPA gait performance parameters, representing the quality of walking, and quantitative gait measures (e.g. number of steps) between two separate days of assessment among older adults. METHODS DPA was acquired for 48-h from older adults (age ≥ 65 years) using a tri-axial accelerometer. Continuous walking bouts (≥ 60 s) were identified from acceleration data and used to extract gait performance parameters, including time- and frequency-domain gait parameters, representing walking speed, variability, and irregularity. To assess repeatability, intraclass correlation coefficient (ICC) was calculated using two-way mixed effects F-test models for day-1 vs. day-2 as the independent random effect. Repeatability tests were performed for all participants and also within frailty groups (non-frail and pre-frail/frail identified using Fried phenotype). RESULTS Data was analyzed from 63 older adults (29 non-frail and 34 pre-frail/frail). Most of the time- and frequency-domain gait performance parameters showed good to excellent repeatability (ICC ≥ 0.70), while quantitative parameters, including number of steps and walking duration showed poor repeatability (ICC < 0.30). Among majority of the gait performance parameters, we observed higher repeatability among the pre-frail/frail group (ICC > 0.78) compared to non-frail individuals (0.39 < ICC < 0.55). CONCLUSION Gait performance parameters, showed higher repeatability compared to quantitative measures. Higher repeatability among pre-frail/frail individuals may be attributed to a reduced functional capacity for performing more intense and variable physical tasks. TRIAL REGISTRATION The clinical trial was retrospectively registered on June 18th, 2013 with ClinicalTrials.gov, identifier NCT01880229.
Collapse
|
5
|
Estiri H, Strasser ZH, Klann JG, McCoy TH, Wagholikar KB, Vasey S, Castro VM, Murphy ME, Murphy SN. Transitive Sequencing Medical Records for Mining Predictive and Interpretable Temporal Representations. PATTERNS (NEW YORK, N.Y.) 2020; 1:100051. [PMID: 32835307 PMCID: PMC7301790 DOI: 10.1016/j.patter.2020.100051] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/13/2020] [Revised: 04/27/2020] [Accepted: 05/26/2020] [Indexed: 12/13/2022]
Abstract
Electronic health records (EHRs) contain important temporal information about the progression of disease and treatment outcomes. This paper proposes a transitive sequencing approach for constructing temporal representations from EHR observations for downstream machine learning. Using clinical data from a cohort of patients with congestive heart failure, we mined temporal representations by transitive sequencing of EHR medication and diagnosis records for classification and prediction tasks. We compared the classification and prediction performances of the transitive sequential representations (bag-of-sequences approach) with the conventional approach of using aggregated vectors of EHR data (aggregated vector representation) across different classifiers. We found that the transitive sequential representations are better phenotype "differentiators" and predictors than the "atemporal" EHR records. Our results also demonstrated that data representations obtained from transitive sequencing of EHR observations can present novel insights about the progression of the disease that are difficult to discern when clinical data are treated independently of the patient's history.
Collapse
Affiliation(s)
- Hossein Estiri
- Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 02144, USA
- Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA
- Harvard Medical School, Boston, MA 02115, USA
| | - Zachary H. Strasser
- Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 02144, USA
- Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA
- Harvard Medical School, Boston, MA 02115, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
| | - Jeffery G. Klann
- Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 02144, USA
- Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA
- Harvard Medical School, Boston, MA 02115, USA
| | - Thomas H. McCoy
- Harvard Medical School, Boston, MA 02115, USA
- Center for Quantitative Health, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Kavishwar B. Wagholikar
- Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 02144, USA
- Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA
- Harvard Medical School, Boston, MA 02115, USA
| | - Sebastien Vasey
- Department of Mathematics, Harvard University, Cambridge, MA 02138, USA
| | - Victor M. Castro
- Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA
| | - MaryKate E. Murphy
- Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA
| | - Shawn N. Murphy
- Laboratory of Computer Science, Massachusetts General Hospital, Boston, MA 02144, USA
- Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, USA
- Harvard Medical School, Boston, MA 02115, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA
- Department of Neurology, Massachusetts General Hospital, Boston, MA 02114, USA
| |
Collapse
|
6
|
Pradeep Kumar D, Toosizadeh N, Mohler J, Ehsani H, Mannier C, Laksari K. Sensor-based characterization of daily walking: a new paradigm in pre-frailty/frailty assessment. BMC Geriatr 2020; 20:164. [PMID: 32375700 PMCID: PMC7203790 DOI: 10.1186/s12877-020-01572-1] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2019] [Accepted: 04/28/2020] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Frailty is a highly recognized geriatric syndrome resulting in decline in reserve across multiple physiological systems. Impaired physical function is one of the major indicators of frailty. The goal of this study was to evaluate an algorithm that discriminates between frailty groups (non-frail and pre-frail/frail) based on gait performance parameters derived from unsupervised daily physical activity (DPA). METHODS DPA was acquired for 48 h from older adults (≥65 years) using a tri-axial accelerometer motion-sensor. Continuous bouts of walking for 20s, 30s, 40s, 50s and 60s without pauses were identified from acceleration data. These were then used to extract qualitative measures (gait variability, gait asymmetry, and gait irregularity) and quantitative measures (total continuous walking duration and maximum number of continuous steps) to characterize gait performance. Association between frailty and gait performance parameters was assessed using multinomial logistic models with frailty as the dependent variable, and gait performance parameters along with demographic parameters as independent variables. RESULTS One hundred twenty-six older adults (44 non-frail, 60 pre-frail, and 22 frail, based on the Fried index) were recruited. Step- and stride-times, frequency domain gait variability, and continuous walking quantitative measures were significantly different between non-frail and pre-frail/frail groups (p < 0.05). Among the five different durations (20s, 30s, 40s, 50s and 60s), gait performance parameters extracted from 60s continuous walks provided the best frailty assessment results. Using the 60s gait performance parameters in the logistic model, pre-frail/frail group (vs. non-frail) was identified with 76.8% sensitivity and 80% specificity. DISCUSSION Everyday walking characteristics were found to be associated with frailty. Along with quantitative measures of physical activity, qualitative measures are critical elements representing the early stages of frailty. In-home gait assessment offers an opportunity to screen for and monitor frailty. TRIAL REGISTRATION The clinical trial was retrospectively registered on June 18th, 2013 with ClinicalTrials.gov, identifier NCT01880229.
Collapse
Affiliation(s)
- Danya Pradeep Kumar
- Department of Biomedical Engineering, University of Arizona, Tucson, AZ, USA
| | - Nima Toosizadeh
- Department of Biomedical Engineering, University of Arizona, Tucson, AZ, USA.
- Arizona Center on Aging, Department of Medicine, University of Arizona, Tucson, AZ, USA.
| | - Jane Mohler
- Department of Biomedical Engineering, University of Arizona, Tucson, AZ, USA
- Arizona Center on Aging, Department of Medicine, University of Arizona, Tucson, AZ, USA
| | - Hossein Ehsani
- Arizona Center on Aging, Department of Medicine, University of Arizona, Tucson, AZ, USA
| | - Cassidy Mannier
- Department of Biomedical Engineering, University of Arizona, Tucson, AZ, USA
| | - Kaveh Laksari
- Department of Biomedical Engineering, University of Arizona, Tucson, AZ, USA
- Department of Aerospace and Mechanical Engineering, University of Arizona, Tucson, AZ, USA
| |
Collapse
|
7
|
Estiri H, Vasey S, Murphy SN. Transitive Sequential Pattern Mining for Discrete Clinical Data. Artif Intell Med 2020. [DOI: 10.1007/978-3-030-59137-3_37] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
|
8
|
Levine ME, Albers DJ, Hripcsak G. Methodological variations in lagged regression for detecting physiologic drug effects in EHR data. J Biomed Inform 2018; 86:149-159. [PMID: 30172760 DOI: 10.1016/j.jbi.2018.08.014] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2018] [Revised: 07/20/2018] [Accepted: 08/29/2018] [Indexed: 12/22/2022]
Abstract
We studied how lagged linear regression can be used to detect the physiologic effects of drugs from data in the electronic health record (EHR). We systematically examined the effect of methodological variations ((i) time series construction, (ii) temporal parameterization, (iii) intra-subject normalization, (iv) differencing (lagged rates of change achieved by taking differences between consecutive measurements), (v) explanatory variables, and (vi) regression models) on performance of lagged linear methods in this context. We generated two gold standards (one knowledge-base derived, one expert-curated) for expected pairwise relationships between 7 drugs and 4 labs, and evaluated how the 64 unique combinations of methodological perturbations reproduce the gold standards. Our 28 cohorts included patients in the Columbia University Medical Center/NewYork-Presbyterian Hospital clinical database, and ranged from 2820 to 79,514 patients with between 8 and 209 average time points per patient. The most accurate methods achieved AUROC of 0.794 for knowledge-base derived gold standard (95%CI [0.741, 0.847]) and 0.705 for expert-curated gold standard (95% CI [0.629, 0.781]). We observed a mean AUROC of 0.633 (95%CI [0.610, 0.657], expert-curated gold standard) across all methods that re-parameterize time according to sequence and use either a joint autoregressive model with time-series differencing or an independent lag model without differencing. The complement of this set of methods achieved a mean AUROC close to 0.5, indicating the importance of these choices. We conclude that time-series analysis of EHR data will likely rely on some of the beneficial pre-processing and modeling methodologies identified, and will certainly benefit from continued careful analysis of methodological perturbations. This study found that methodological variations, such as pre-processing and representations, have a large effect on results, exposing the importance of thoroughly evaluating these components when comparing machine-learning methods.
Collapse
Affiliation(s)
- Matthew E Levine
- Department of Biomedical Informatics, Columbia University Medical Center, 622 W. 168th Street, Presbyterian Building 20th Floor, New York, NY 10032, United States; Observational Health Data Sciences and Informatics (OHDSI), New York, NY, United States.
| | - David J Albers
- Department of Biomedical Informatics, Columbia University Medical Center, 622 W. 168th Street, Presbyterian Building 20th Floor, New York, NY 10032, United States; Observational Health Data Sciences and Informatics (OHDSI), New York, NY, United States
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University Medical Center, 622 W. 168th Street, Presbyterian Building 20th Floor, New York, NY 10032, United States; Observational Health Data Sciences and Informatics (OHDSI), New York, NY, United States; NewYork-Presbyterian Hospital, 622 W. 168th Street, New York, NY 10032, United States
| |
Collapse
|
9
|
Hripcsak G, Albers DJ. High-fidelity phenotyping: richness and freedom from bias. J Am Med Inform Assoc 2018; 25:289-294. [PMID: 29040596 PMCID: PMC7282504 DOI: 10.1093/jamia/ocx110] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2017] [Revised: 08/07/2017] [Accepted: 09/06/2017] [Indexed: 01/14/2023] Open
Abstract
Electronic health record phenotyping is the use of raw electronic health record data to assert characterizations about patients. Researchers have been doing it since the beginning of biomedical informatics, under different names. Phenotyping will benefit from an increasing focus on fidelity, both in the sense of increasing richness, such as measured levels, degree or severity, timing, probability, or conceptual relationships, and in the sense of reducing bias. Research agendas should shift from merely improving binary assignment to studying and improving richer representations. The field is actively researching new temporal directions and abstract representations, including deep learning. The field would benefit from research in nonlinear dynamics, in combining mechanistic models with empirical data, including data assimilation, and in topology. The health care process produces substantial bias, and studying that bias explicitly rather than treating it as merely another source of noise would facilitate addressing it.
Collapse
Affiliation(s)
- George Hripcsak
- Department of Biomedical Informatics, Columbia University Medical Center, New York, NY, USA
| | - David J Albers
- Department of Biomedical Informatics, Columbia University Medical Center, New York, NY, USA
| |
Collapse
|
10
|
Albers DJ, Elhadad N, Claassen J, Perotte R, Goldstein A, Hripcsak G. Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms. J Biomed Inform 2018; 78:87-101. [PMID: 29369797 PMCID: PMC5856130 DOI: 10.1016/j.jbi.2018.01.004] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2017] [Revised: 12/05/2017] [Accepted: 01/14/2018] [Indexed: 01/12/2023]
Abstract
We study the question of how to represent or summarize raw laboratory data taken from an electronic health record (EHR) using parametric model selection to reduce or cope with biases induced through clinical care. It has been previously demonstrated that the health care process (Hripcsak and Albers, 2012, 2013), as defined by measurement context (Hripcsak and Albers, 2013; Albers et al., 2012) and measurement patterns (Albers and Hripcsak, 2010, 2012), can influence how EHR data are distributed statistically (Kohane and Weber, 2013; Pivovarov et al., 2014). We construct an algorithm, PopKLD, which is based on information criterion model selection (Burnham and Anderson, 2002; Claeskens and Hjort, 2008), is intended to reduce and cope with health care process biases and to produce an intuitively understandable continuous summary. The PopKLD algorithm can be automated and is designed to be applicable in high-throughput settings; for example, the output of the PopKLD algorithm can be used as input for phenotyping algorithms. Moreover, we develop the PopKLD-CAT algorithm that transforms the continuous PopKLD summary into a categorical summary useful for applications that require categorical data such as topic modeling. We evaluate our methodology in two ways. First, we apply the method to laboratory data collected in two different health care contexts, primary versus intensive care. We show that the PopKLD preserves known physiologic features in the data that are lost when summarizing the data using more common laboratory data summaries such as mean and standard deviation. Second, for three disease-laboratory measurement pairs, we perform a phenotyping task: we use the PopKLD and PopKLD-CAT algorithms to define high and low values of the laboratory variable that are used for defining a disease state. We then compare the relationship between the PopKLD-CAT summary disease predictions and the same predictions using empirically estimated mean and standard deviation to a gold standard generated by clinical review of patient records. We find that the PopKLD laboratory data summary is substantially better at predicting disease state. The PopKLD or PopKLD-CAT algorithms are not meant to be used as phenotyping algorithms, but we use the phenotyping task to show what information can be gained when using a more informative laboratory data summary. In the process of evaluation our method we show that the different clinical contexts and laboratory measurements necessitate different statistical summaries. Similarly, leveraging the principle of maximum entropy we argue that while some laboratory data only have sufficient information to estimate a mean and standard deviation, other laboratory data captured in an EHR contain substantially more information than can be captured in higher-parameter models.
Collapse
Affiliation(s)
- D J Albers
- Department of Biomedical Informatics, Columbia University, 622 West 168th Street, New York, NY, USA.
| | - N Elhadad
- Department of Biomedical Informatics, Columbia University, 622 West 168th Street, New York, NY, USA.
| | - J Claassen
- Department of Neurology, Columbia University, 710 West 168th Street, New York, NY 10032, USA.
| | - R Perotte
- Value Institute, New York Presbyterian Hospital, 601 West 168th Street New York, NY 10032, USA.
| | - A Goldstein
- Department of Biomedical Informatics, Columbia University, 622 West 168th Street, New York, NY, USA.
| | - G Hripcsak
- Department of Biomedical Informatics, Columbia University, 622 West 168th Street, New York, NY, USA.
| |
Collapse
|
11
|
Levine ME, Albers DJ, Hripcsak G. Comparing lagged linear correlation, lagged regression, Granger causality, and vector autoregression for uncovering associations in EHR data. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2017; 2016:779-788. [PMID: 28269874 PMCID: PMC5333294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Time series analysis methods have been shown to reveal clinical and biological associations in data collected in the electronic health record. We wish to develop reliable high-throughput methods for identifying adverse drug effects that are easy to implement and produce readily interpretable results. To move toward this goal, we used univariate and multivariate lagged regression models to investigate associations between twenty pairs of drug orders and laboratory measurements. Multivariate lagged regression models exhibited higher sensitivity and specificity than univariate lagged regression in the 20 examples, and incorporating autoregressive terms for labs and drugs produced more robust signals in cases of known associations among the 20 example pairings. Moreover, including inpatient admission terms in the model attenuated the signals for some cases of unlikely associations, demonstrating how multivariate lagged regression models' explicit handling of context-based variables can provide a simple way to probe for health-care processes that confound analyses of EHR data.
Collapse
Affiliation(s)
- Matthew E Levine
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - David J Albers
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, New York, New York, USA
| |
Collapse
|
12
|
Predictability Bounds of Electronic Health Records. Sci Rep 2015; 5:11865. [PMID: 26148751 PMCID: PMC4493571 DOI: 10.1038/srep11865] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2013] [Accepted: 06/04/2015] [Indexed: 01/25/2023] Open
Abstract
The ability to intervene in disease progression given a person's disease history has the potential to solve one of society's most pressing issues: advancing health care delivery and reducing its cost. Controlling disease progression is inherently associated with the ability to predict possible future diseases given a patient's medical history. We invoke an information-theoretic methodology to quantify the level of predictability inherent in disease histories of a large electronic health records dataset with over half a million patients. In our analysis, we progress from zeroth order through temporal informed statistics, both from an individual patient's standpoint and also considering the collective effects. Our findings confirm our intuition that knowledge of common disease progressions results in higher predictability bounds than treating disease histories independently. We complement this result by showing the point at which the temporal dependence structure vanishes with increasing orders of the time-correlated statistic. Surprisingly, we also show that shuffling individual disease histories only marginally degrades the predictability bounds. This apparent contradiction with respect to the importance of time-ordered information is indicative of the complexities involved in capturing the health-care process and the difficulties associated with utilising this information in universal prediction algorithms.
Collapse
|
13
|
Hripcsak G, Albers DJ, Perotte A. Parameterizing time in electronic health record studies. J Am Med Inform Assoc 2015; 22:794-804. [PMID: 25725004 PMCID: PMC6169471 DOI: 10.1093/jamia/ocu051] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2014] [Revised: 11/08/2014] [Accepted: 12/22/2014] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Fields like nonlinear physics offer methods for analyzing time series, but many methods require that the time series be stationary-no change in properties over time.Objective Medicine is far from stationary, but the challenge may be able to be ameliorated by reparameterizing time because clinicians tend to measure patients more frequently when they are ill and are more likely to vary. METHODS We compared time parameterizations, measuring variability of rate of change and magnitude of change, and looking for homogeneity of bins of temporal separation between pairs of time points. We studied four common laboratory tests drawn from 25 years of electronic health records on 4 million patients. RESULTS We found that sequence time-that is, simply counting the number of measurements from some start-produced more stationary time series, better explained the variation in values, and had more homogeneous bins than either traditional clock time or a recently proposed intermediate parameterization. Sequence time produced more accurate predictions in a single Gaussian process model experiment. CONCLUSIONS Of the three parameterizations, sequence time appeared to produce the most stationary series, possibly because clinicians adjust their sampling to the acuity of the patient. Parameterizing by sequence time may be applicable to association and clustering experiments on electronic health record data. A limitation of this study is that laboratory data were derived from only one institution. Sequence time appears to be an important potential parameterization.
Collapse
Affiliation(s)
- George Hripcsak
- Department of Biomedical Informatics, Columbia University Medical Center, New York, USA Medical Informatics Services, NewYork-Presbyterian Hospital, New York, USA
| | - David J Albers
- Department of Biomedical Informatics, Columbia University Medical Center, New York, USA
| | - Adler Perotte
- Department of Biomedical Informatics, Columbia University Medical Center, New York, USA
| |
Collapse
|
14
|
|
15
|
Hagar Y, Albers D, Pivovarov R, Chase H, Dukic V, Elhadad N. Survival Analysis with Electronic Health Record Data: Experiments with Chronic Kidney Disease. Stat Anal Data Min 2014; 7:385-403. [PMID: 33981381 PMCID: PMC8112603 DOI: 10.1002/sam.11236] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
This paper presents a detailed survival analysis for chronic kidney disease (CKD). The analysis is based on the EHR data comprising almost two decades of clinical observations collected at New York-Presbyterian, a large hospital in New York City with one of the oldest electronic health records in the United States. Our survival analysis approach centers around Bayesian multiresolution hazard modeling, with an objective to capture the changing hazard of CKD over time, adjusted for patient clinical covariates and kidney-related laboratory tests. Special attention is paid to statistical issues common to all EHR data, such as cohort definition, missing data and censoring, variable selection, and potential for joint survival and longitudinal modeling, all of which are discussed alone and within the EHR CKD context.
Collapse
Affiliation(s)
- Yolanda Hagar
- Yolanda Hagar is a postdoctoral researcher in applied mathematics at the University of Colorado at Boulder. David Albers is an associate research scientist in biomedical informatics at Columbia University. Rimma Pivovarov is a doctoral candidate in biomedical informatics at Columbia University. Herbert Chase is a professor of clinical medicine in biomedical informatics at Columbia University. Vanja Dukic is an associate professor in applied mathematics at the University of Colorado at Boulder. Noémie Elhadad is an assistant professor in biomedical informatics at Columbia University
| | - David Albers
- Yolanda Hagar is a postdoctoral researcher in applied mathematics at the University of Colorado at Boulder. David Albers is an associate research scientist in biomedical informatics at Columbia University. Rimma Pivovarov is a doctoral candidate in biomedical informatics at Columbia University. Herbert Chase is a professor of clinical medicine in biomedical informatics at Columbia University. Vanja Dukic is an associate professor in applied mathematics at the University of Colorado at Boulder. Noémie Elhadad is an assistant professor in biomedical informatics at Columbia University
| | - Rimma Pivovarov
- Yolanda Hagar is a postdoctoral researcher in applied mathematics at the University of Colorado at Boulder. David Albers is an associate research scientist in biomedical informatics at Columbia University. Rimma Pivovarov is a doctoral candidate in biomedical informatics at Columbia University. Herbert Chase is a professor of clinical medicine in biomedical informatics at Columbia University. Vanja Dukic is an associate professor in applied mathematics at the University of Colorado at Boulder. Noémie Elhadad is an assistant professor in biomedical informatics at Columbia University
| | - Herbert Chase
- Yolanda Hagar is a postdoctoral researcher in applied mathematics at the University of Colorado at Boulder. David Albers is an associate research scientist in biomedical informatics at Columbia University. Rimma Pivovarov is a doctoral candidate in biomedical informatics at Columbia University. Herbert Chase is a professor of clinical medicine in biomedical informatics at Columbia University. Vanja Dukic is an associate professor in applied mathematics at the University of Colorado at Boulder. Noémie Elhadad is an assistant professor in biomedical informatics at Columbia University
| | - Vanja Dukic
- Yolanda Hagar is a postdoctoral researcher in applied mathematics at the University of Colorado at Boulder. David Albers is an associate research scientist in biomedical informatics at Columbia University. Rimma Pivovarov is a doctoral candidate in biomedical informatics at Columbia University. Herbert Chase is a professor of clinical medicine in biomedical informatics at Columbia University. Vanja Dukic is an associate professor in applied mathematics at the University of Colorado at Boulder. Noémie Elhadad is an assistant professor in biomedical informatics at Columbia University
| | - Noémie Elhadad
- Yolanda Hagar is a postdoctoral researcher in applied mathematics at the University of Colorado at Boulder. David Albers is an associate research scientist in biomedical informatics at Columbia University. Rimma Pivovarov is a doctoral candidate in biomedical informatics at Columbia University. Herbert Chase is a professor of clinical medicine in biomedical informatics at Columbia University. Vanja Dukic is an associate professor in applied mathematics at the University of Colorado at Boulder. Noémie Elhadad is an assistant professor in biomedical informatics at Columbia University
| |
Collapse
|
16
|
Pivovarov R, Albers DJ, Sepulveda JL, Elhadad N. Identifying and mitigating biases in EHR laboratory tests. J Biomed Inform 2014; 51:24-34. [PMID: 24727481 DOI: 10.1016/j.jbi.2014.03.016] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2013] [Revised: 03/27/2014] [Accepted: 03/30/2014] [Indexed: 02/08/2023]
Abstract
Electronic health record (EHR) data show promise for deriving new ways of modeling human disease states. Although EHR researchers often use numerical values of laboratory tests as features in disease models, a great deal of information is contained in the context within which a laboratory test is taken. For example, the same numerical value of a creatinine test has different interpretation for a chronic kidney disease patient and a patient with acute kidney injury. We study whether EHR research studies are subject to biased results and interpretations if laboratory measurements taken in different contexts are not explicitly separated. We show that the context of a laboratory test measurement can often be captured by the way the test is measured through time. We perform three tasks to study the properties of these temporal measurement patterns. In the first task, we confirm that laboratory test measurement patterns provide additional information to the stand-alone numerical value. The second task identifies three measurement pattern motifs across a set of 70 laboratory tests performed for over 14,000 patients. Of these, one motif exhibits properties that can lead to biased research results. In the third task, we demonstrate the potential for biased results on a specific example. We conduct an association study of lipase test values to acute pancreatitis. We observe a diluted signal when using only a lipase value threshold, whereas the full association is recovered when properly accounting for lipase measurements in different contexts (leveraging the lipase measurement patterns to separate the contexts). Aggregating EHR data without separating distinct laboratory test measurement patterns can intermix patients with different diseases, leading to the confounding of signals in large-scale EHR analyses. This paper presents a methodology for leveraging measurement frequency to identify and reduce laboratory test biases.
Collapse
Affiliation(s)
- Rimma Pivovarov
- Department of Biomedical Informatics, Columbia University, 622 W. 168th Street, New York, NY, USA.
| | - David J Albers
- Department of Biomedical Informatics, Columbia University, 622 W. 168th Street, New York, NY, USA.
| | - Jorge L Sepulveda
- Department of Pathology and Cell Biology, Columbia University, 630 W. 168th Street, New York, NY, USA.
| | - Noémie Elhadad
- Department of Biomedical Informatics, Columbia University, 622 W. 168th Street, New York, NY, USA.
| |
Collapse
|
17
|
Albers DJ, Hripcsak G, Schmidt M. Population physiology: leveraging electronic health record data to understand human endocrine dynamics. PLoS One 2012; 7:e48058. [PMID: 23272040 PMCID: PMC3522687 DOI: 10.1371/journal.pone.0048058] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2011] [Accepted: 09/25/2012] [Indexed: 11/19/2022] Open
Abstract
Studying physiology and pathophysiology over a broad population for long periods of time is difficult primarily because collecting human physiologic data can be intrusive, dangerous, and expensive. One solution is to use data that have been collected for a different purpose. Electronic health record (EHR) data promise to support the development and testing of mechanistic physiologic models on diverse populations and allow correlation with clinical outcomes, but limitations in the data have thus far thwarted such use. For example, using uncontrolled population-scale EHR data to verify the outcome of time dependent behavior of mechanistic, constructive models can be difficult because: (i) aggregation of the population can obscure or generate a signal, (ii) there is often no control population with a well understood health state, and (iii) diversity in how the population is measured can make the data difficult to fit into conventional analysis techniques. This paper shows that it is possible to use EHR data to test a physiological model for a population and over long time scales. Specifically, a methodology is developed and demonstrated for testing a mechanistic, time-dependent, physiological model of serum glucose dynamics with uncontrolled, population-scale, physiological patient data extracted from an EHR repository. It is shown that there is no observable daily variation the normalized mean glucose for any EHR subpopulations. In contrast, a derived value, daily variation in nonlinear correlation quantified by the time-delayed mutual information (TDMI), did reveal the intuitively expected diurnal variation in glucose levels amongst a random population of humans. Moreover, in a population of continuously (tube) fed patients, there was no observable TDMI-based diurnal signal. These TDMI-based signals, via a glucose insulin model, were then connected with human feeding patterns. In particular, a constructive physiological model was shown to correctly predict the difference between the general uncontrolled population and a subpopulation whose feeding was controlled.
Collapse
Affiliation(s)
- D J Albers
- Department of Biomedical Informatics, Columbia University, New York, New York, United States of America.
| | | | | |
Collapse
|