1
|
Cohen AM, Kaner J, Miller R, Kopesky JW, Hersh W. Automatically pre-screening patients for the rare disease aromatic l-amino acid decarboxylase deficiency using knowledge engineering, natural language processing, and machine learning on a large EHR population. J Am Med Inform Assoc 2024; 31:692-704. [PMID: 38134953 PMCID: PMC10873832 DOI: 10.1093/jamia/ocad244] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2023] [Revised: 11/28/2023] [Accepted: 12/01/2023] [Indexed: 12/24/2023] Open
Abstract
OBJECTIVES Electronic health record (EHR) data may facilitate the identification of rare diseases in patients, such as aromatic l-amino acid decarboxylase deficiency (AADCd), an autosomal recessive disease caused by pathogenic variants in the dopa decarboxylase gene. Deficiency of the AADC enzyme results in combined severe reductions in monoamine neurotransmitters: dopamine, serotonin, epinephrine, and norepinephrine. This leads to widespread neurological complications affecting motor, behavioral, and autonomic function. The goal of this study was to use EHR data to identify previously undiagnosed patients who may have AADCd without available training cases for the disease. MATERIALS AND METHODS A multiple symptom and related disease annotated dataset was created and used to train individual concept classifiers on annotated sentence data. A multistep algorithm was then used to combine concept predictions into a single patient rank value. RESULTS Using an 8000-patient dataset that the algorithms had not seen before ranking, the top and bottom 200 ranked patients were manually reviewed for clinical indications of performing an AADCd diagnostic screening test. The top-ranked patients were 22.5% positively assessed for diagnostic screening, with 0% for the bottom-ranked patients. This result is statistically significant at P < .0001. CONCLUSION This work validates the approach that large-scale rare-disease screening can be accomplished by combining predictions for relevant individual symptoms and related conditions which are much more common and for which training data is easier to create.
Collapse
Affiliation(s)
- Aaron M Cohen
- Department of Medical Informatics and Clinical Epidemiology, School of Medicine, Oregon Health & Science University, Portland, OR 97239, United States
| | - Jolie Kaner
- Department of Medical Informatics and Clinical Epidemiology, School of Medicine, Oregon Health & Science University, Portland, OR 97239, United States
| | - Ryan Miller
- PTC Therapeutics, South Plainfield, NJ 07080, United States
| | | | - William Hersh
- Department of Medical Informatics and Clinical Epidemiology, School of Medicine, Oregon Health & Science University, Portland, OR 97239, United States
| |
Collapse
|
2
|
Rajendran S, Pan W, Sabuncu MR, Chen Y, Zhou J, Wang F. Learning across diverse biomedical data modalities and cohorts: Challenges and opportunities for innovation. PATTERNS (NEW YORK, N.Y.) 2024; 5:100913. [PMID: 38370129 PMCID: PMC10873158 DOI: 10.1016/j.patter.2023.100913] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
In healthcare, machine learning (ML) shows significant potential to augment patient care, improve population health, and streamline healthcare workflows. Realizing its full potential is, however, often hampered by concerns about data privacy, diversity in data sources, and suboptimal utilization of different data modalities. This review studies the utility of cross-cohort cross-category (C4) integration in such contexts: the process of combining information from diverse datasets distributed across distinct, secure sites. We argue that C4 approaches could pave the way for ML models that are both holistic and widely applicable. This paper provides a comprehensive overview of C4 in health care, including its present stage, potential opportunities, and associated challenges.
Collapse
Affiliation(s)
- Suraj Rajendran
- Tri-Institutional Computational Biology & Medicine Program, Cornell University, Ithaca, NY, USA
| | - Weishen Pan
- Division of Health Informatics, Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Mert R. Sabuncu
- School of Electrical and Computer Engineering, Cornell University, Ithaca, NY, USA
- Cornell Tech, Cornell University, New York, NY, USA
- Department of Radiology, Weill Cornell Medical School, New York, NY, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Jiayu Zhou
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Fei Wang
- Division of Health Informatics, Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| |
Collapse
|
3
|
Klann JG, Henderson DW, Morris M, Estiri H, Weber GM, Visweswaran S, Murphy SN. A broadly applicable approach to enrich electronic-health-record cohorts by identifying patients with complete data: a multisite evaluation. J Am Med Inform Assoc 2023; 30:1985-1994. [PMID: 37632234 PMCID: PMC10654861 DOI: 10.1093/jamia/ocad166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Revised: 07/25/2023] [Accepted: 08/08/2023] [Indexed: 08/27/2023] Open
Abstract
OBJECTIVE Patients who receive most care within a single healthcare system (colloquially called a "loyalty cohort" since they typically return to the same providers) have mostly complete data within that organization's electronic health record (EHR). Loyalty cohorts have low data missingness, which can unintentionally bias research results. Using proxies of routine care and healthcare utilization metrics, we compute a per-patient score that identifies a loyalty cohort. MATERIALS AND METHODS We implemented a computable program for the widely adopted i2b2 platform that identifies loyalty cohorts in EHRs based on a machine-learning model, which was previously validated using linked claims data. We developed a novel validation approach, which tests, using only EHR data, whether patients returned to the same healthcare system after the training period. We evaluated these tools at 3 institutions using data from 2017 to 2019. RESULTS Loyalty cohort calculations to identify patients who returned during a 1-year follow-up yielded a mean area under the receiver operating characteristic curve of 0.77 using the original model and 0.80 after calibrating the model at individual sites. Factors such as multiple medications or visits contributed significantly at all sites. Screening tests' contributions (eg, colonoscopy) varied across sites, likely due to coding and population differences. DISCUSSION This open-source implementation of a "loyalty score" algorithm had good predictive power. Enriching research cohorts by utilizing these low-missingness patients is a way to obtain the data completeness necessary for accurate causal analysis. CONCLUSION i2b2 sites can use this approach to select cohorts with mostly complete EHR data.
Collapse
Affiliation(s)
- Jeffrey G Klann
- Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, United States
- Department of Medicine, Harvard Medical School, Boston, MA 02115, United States
| | - Darren W Henderson
- Institute of Biomedical Informatics, University of Kentucky, Lexington, KY 40506, United States
| | - Michele Morris
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15260, United States
| | - Hossein Estiri
- Department of Medicine, Massachusetts General Hospital, Boston, MA 02114, United States
- Department of Medicine, Harvard Medical School, Boston, MA 02115, United States
| | - Griffin M Weber
- Beth Israel Deaconess Medical Center, Boston, MA 02115, United States
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States
| | - Shyam Visweswaran
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA 15260, United States
| | - Shawn N Murphy
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States
- Department of Neurology, Massachusetts General Hospital, Boston, MA 02114, United States
- Research Information Science and Computing, Mass General Brigham, Somerville, MA 02145, United States
| |
Collapse
|
4
|
Scheible R, Thomczyk F, Blum M, Rautenberg M, Prunotto A, Yazijy S, Boeker M. Integrating row level security in i2b2: segregation of medical records into data marts without data replication and synchronization. JAMIA Open 2023; 6:ooad068. [PMID: 37583654 PMCID: PMC10425194 DOI: 10.1093/jamiaopen/ooad068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Revised: 07/28/2023] [Accepted: 08/03/2023] [Indexed: 08/17/2023] Open
Abstract
Objective i2b2 offers the possibility to store biomedical data of different projects in subject oriented data marts of the data warehouse, which potentially requires data replication between different projects and also data synchronization in case of data changes. We present an approach that can save this effort and assess its query performance in a case study that reflects real-world scenarios. Material and Methods For data segregation, we used PostgreSQL's row level security (RLS) feature, the unit test framework pgTAP for validation and testing as well as the i2b2 application. No change of the i2b2 code was required. Instead, to leverage orchestration and deployment, we additionally implemented a command line interface (CLI). We evaluated performance using 3 different queries generated by i2b2, which we performed on an enlarged Harvard demo dataset. Results We introduce the open source Python CLI i2b2rls, which orchestrates and manages security roles to implement data marts so that they do not need to be replicated and synchronized as different i2b2 projects. Our evaluation showed that our approach is on average 3.55 and on median 2.71 times slower compared to classic i2b2 data marts, but has more flexibility and easier setup. Conclusion The RLS-based approach is particularly useful in a scenario with many projects, where data is constantly updated, user and group requirements change frequently or complex user authorization requirements have to be defined. The approach applies to both the i2b2 interface and direct database access.
Collapse
Affiliation(s)
- Raphael Scheible
- Institute of Artificial Intelligence and Informatics in Medicine (AIIM), Chair of Medical Informatics, University Hospital rechts der Isar, School of Medicine, Technical University of Munich, Munich, Germany
- Center for Chronic Immunodeficiency (CCI), Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany
| | - Fabian Thomczyk
- Data Inintegration Center (DIC), University of Freiburg, Freiburg, Germany
| | - Marco Blum
- Data Inintegration Center (DIC), University of Freiburg, Freiburg, Germany
| | - Micha Rautenberg
- Institute of Medical Biometry and Statistics, Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany
- Zentrum für Digitalisierung und Informationstechnologie (ZDI), Medical Center, University of Freiburg, Freiburg, Germany
| | - Andrea Prunotto
- Data Inintegration Center (DIC), University of Freiburg, Freiburg, Germany
| | - Suhail Yazijy
- Institute of Medical Biometry and Statistics, Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany
| | - Martin Boeker
- Institute of Artificial Intelligence and Informatics in Medicine (AIIM), Chair of Medical Informatics, University Hospital rechts der Isar, School of Medicine, Technical University of Munich, Munich, Germany
| |
Collapse
|
5
|
Sinaci AA, Gencturk M, Teoman HA, Laleci Erturkmen GB, Alvarez-Romero C, Martinez-Garcia A, Poblador-Plou B, Carmona-Pírez J, Löbe M, Parra-Calderon CL. A Data Transformation Methodology to Create Findable, Accessible, Interoperable, and Reusable Health Data: Software Design, Development, and Evaluation Study. J Med Internet Res 2023; 25:e42822. [PMID: 36884270 PMCID: PMC10034606 DOI: 10.2196/42822] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 01/04/2023] [Accepted: 01/31/2023] [Indexed: 03/09/2023] Open
Abstract
BACKGROUND Sharing health data is challenging because of several technical, ethical, and regulatory issues. The Findable, Accessible, Interoperable, and Reusable (FAIR) guiding principles have been conceptualized to enable data interoperability. Many studies provide implementation guidelines, assessment metrics, and software to achieve FAIR-compliant data, especially for health data sets. Health Level 7 (HL7) Fast Healthcare Interoperability Resources (FHIR) is a health data content modeling and exchange standard. OBJECTIVE Our goal was to devise a new methodology to extract, transform, and load existing health data sets into HL7 FHIR repositories in line with FAIR principles, develop a Data Curation Tool to implement the methodology, and evaluate it on health data sets from 2 different but complementary institutions. We aimed to increase the level of compliance with FAIR principles of existing health data sets through standardization and facilitate health data sharing by eliminating the associated technical barriers. METHODS Our approach automatically processes the capabilities of a given FHIR end point and directs the user while configuring mappings according to the rules enforced by FHIR profile definitions. Code system mappings can be configured for terminology translations through automatic use of FHIR resources. The validity of the created FHIR resources can be automatically checked, and the software does not allow invalid resources to be persisted. At each stage of our data transformation methodology, we used particular FHIR-based techniques so that the resulting data set could be evaluated as FAIR. We performed a data-centric evaluation of our methodology on health data sets from 2 different institutions. RESULTS Through an intuitive graphical user interface, users are prompted to configure the mappings into FHIR resource types with respect to the restrictions of selected profiles. Once the mappings are developed, our approach can syntactically and semantically transform existing health data sets into HL7 FHIR without loss of data utility according to our privacy-concerned criteria. In addition to the mapped resource types, behind the scenes, we create additional FHIR resources to satisfy several FAIR criteria. According to the data maturity indicators and evaluation methods of the FAIR Data Maturity Model, we achieved the maximum level (level 5) for being Findable, Accessible, and Interoperable and level 3 for being Reusable. CONCLUSIONS We developed and extensively evaluated our data transformation approach to unlock the value of existing health data residing in disparate data silos to make them available for sharing according to the FAIR principles. We showed that our method can successfully transform existing health data sets into HL7 FHIR without loss of data utility, and the result is FAIR in terms of the FAIR Data Maturity Model. We support institutional migration to HL7 FHIR, which not only leads to FAIR data sharing but also eases the integration with different research networks.
Collapse
Affiliation(s)
- A Anil Sinaci
- Software Research & Development and Consultancy Corporation (SRDC), Cankaya, Turkey
| | - Mert Gencturk
- Software Research & Development and Consultancy Corporation (SRDC), Cankaya, Turkey
- Department of Computer Engineering, Middle East Technical University, Cankaya, Turkey
| | - Huseyin Alper Teoman
- Software Research & Development and Consultancy Corporation (SRDC), Cankaya, Turkey
- Department of Computer Engineering, Middle East Technical University, Cankaya, Turkey
| | | | - Celia Alvarez-Romero
- Group of Computational Health Informatics, Institute of Biomedicine of Seville, Virgen del Rocío University Hospital, Spanish National Research Council, University of Seville, Seville, Spain
| | - Alicia Martinez-Garcia
- Group of Computational Health Informatics, Institute of Biomedicine of Seville, Virgen del Rocío University Hospital, Spanish National Research Council, University of Seville, Seville, Spain
| | - Beatriz Poblador-Plou
- EpiChron Research Group, Aragon Health Sciences Institute (IACS), Aragon Health Research Institute (IIS Aragon), Zaragoza, Spain
| | - Jonás Carmona-Pírez
- EpiChron Research Group, Aragon Health Sciences Institute (IACS), Aragon Health Research Institute (IIS Aragon), Zaragoza, Spain
| | - Matthias Löbe
- Institute for Medical Informatics, Statistics and Epidemiology (IMISE), University of Leipzig, Leipzig, Germany
| | - Carlos Luis Parra-Calderon
- Group of Computational Health Informatics, Institute of Biomedicine of Seville, Virgen del Rocío University Hospital, Spanish National Research Council, University of Seville, Seville, Spain
| |
Collapse
|
6
|
Synthetic data in health care: A narrative review. PLOS DIGITAL HEALTH 2023; 2:e0000082. [PMID: 36812604 PMCID: PMC9931305 DOI: 10.1371/journal.pdig.0000082] [Citation(s) in RCA: 24] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 12/06/2022] [Indexed: 01/09/2023]
Abstract
Data are central to research, public health, and in developing health information technology (IT) systems. Nevertheless, access to most data in health care is tightly controlled, which may limit innovation, development, and efficient implementation of new research, products, services, or systems. Using synthetic data is one of the many innovative ways that can allow organizations to share datasets with broader users. However, only a limited set of literature is available that explores its potentials and applications in health care. In this review paper, we examined existing literature to bridge the gap and highlight the utility of synthetic data in health care. We searched PubMed, Scopus, and Google Scholar to identify peer-reviewed articles, conference papers, reports, and thesis/dissertations articles related to the generation and use of synthetic datasets in health care. The review identified seven use cases of synthetic data in health care: a) simulation and prediction research, b) hypothesis, methods, and algorithm testing, c) epidemiology/public health research, d) health IT development, e) education and training, f) public release of datasets, and g) linking data. The review also identified readily and publicly accessible health care datasets, databases, and sandboxes containing synthetic data with varying degrees of utility for research, education, and software development. The review provided evidence that synthetic data are helpful in different aspects of health care and research. While the original real data remains the preferred choice, synthetic data hold possibilities in bridging data access gaps in research and evidence-based policymaking.
Collapse
|
7
|
Wagholikar KB, Ainsworth L, Zelle D, Chaney K, Mendis M, Klann J, Blood AJ, Miller A, Chulyadyo R, Oates M, Gordon WJ, Aronson SJ, Scirica BM, Murphy SN. I2b2-etl: Python application for importing electronic health data into the informatics for integrating biology and the bedside platform. Bioinformatics 2022; 38:4833-4836. [PMID: 36053173 PMCID: PMC9563689 DOI: 10.1093/bioinformatics/btac595] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 07/15/2022] [Accepted: 08/31/2022] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The i2b2 platform is used at major academic health institutions and research consortia for querying for electronic health data. However, a major obstacle for wider utilization of the platform is the complexity of data loading that entails a steep curve of learning the platform's complex data schemas. To address this problem, we have developed the i2b2-etl package that simplifies the data loading process, which will facilitate wider deployment and utilization of the platform. RESULTS We have implemented i2b2-etl as a Python application that imports ontology and patient data using simplified input file schemas and provides inbuilt record number de-identification and data validation. We describe a real-world deployment of i2b2-etl for a population-management initiative at MassGeneral Brigham. AVAILABILITY AND IMPLEMENTATION i2b2-etl is a free, open-source application implemented in Python available under the Mozilla 2 license. The application can be downloaded as compiled docker images. A live demo is available at https://i2b2clinical.org/demo-i2b2etl/ (username: demo, password: Etl@2021). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kavishwar B Wagholikar
- Harvard Medical School, Boston, MA 02115, USA.,Massachusetts General Hospital, Boston, MA 02114, USA
| | | | - David Zelle
- Brigham and Women's Hospital, Boston, MA 02115, USA
| | - Kira Chaney
- Brigham and Women's Hospital, Boston, MA 02115, USA
| | | | - Jeffery Klann
- Harvard Medical School, Boston, MA 02115, USA.,Massachusetts General Hospital, Boston, MA 02114, USA
| | - Alexander J Blood
- Harvard Medical School, Boston, MA 02115, USA.,Brigham and Women's Hospital, Boston, MA 02115, USA
| | | | | | | | - William J Gordon
- Harvard Medical School, Boston, MA 02115, USA.,Mass General Brigham, Boston, MA 02199, USA.,Brigham and Women's Hospital, Boston, MA 02115, USA
| | | | - Benjamin M Scirica
- Harvard Medical School, Boston, MA 02115, USA.,Brigham and Women's Hospital, Boston, MA 02115, USA
| | - Shawn N Murphy
- Harvard Medical School, Boston, MA 02115, USA.,Massachusetts General Hospital, Boston, MA 02114, USA
| |
Collapse
|
8
|
Lenert LA, Zhu V, Jennings L, McCauley JL, Obeid JS, Ward R, Hassanpour S, Marsch LA, Hogarth M, Shipman P, Harris DR, Talbert JC. Enhancing research data infrastructure to address the opioid epidemic: the Opioid Overdose Network (O2-Net). JAMIA Open 2022; 5:ooac055. [PMID: 35783072 PMCID: PMC9243402 DOI: 10.1093/jamiaopen/ooac055] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Revised: 02/11/2022] [Accepted: 06/17/2022] [Indexed: 02/05/2023] Open
Abstract
Opioid Overdose Network is an effort to generalize and adapt an existing research data network, the Accrual to Clinical Trials (ACT) Network, to support design of trials for survivors of opioid overdoses presenting to emergency departments (ED). Four institutions (Medical University of South Carolina [MUSC], Dartmouth Medical School [DMS], University of Kentucky [UK], and University of California San Diego [UCSD]) worked to adapt the ACT network. The approach that was taken to enhance the ACT network focused on 4 activities: cloning and extending the ACT infrastructure, developing an e-phenotype and corresponding registry, developing portable natural language processing tools to enhance data capture, and developing automated documentation templates to enhance extended data capture. Overall, initial results suggest that tailoring of existing multipurpose federated research networks to specific tasks is feasible; however, substantial efforts are required for coordination of the subnetwork and development of new tools for extension of available data. The initial output of the project was a new approach to decision support for the prescription of naloxone for home use in the ED, which is under further study within the network.
Collapse
Affiliation(s)
- Leslie A Lenert
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Vivienne Zhu
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Lindsey Jennings
- Department of Emergency Medicine, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Jenna L McCauley
- Department of Psychiatry, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Jihad S Obeid
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Ralph Ward
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Saeed Hassanpour
- Biomedical Data Science Department, Geisel School of Medicine, Dartmouth College, Lebanon, New Hampshire, USA
| | - Lisa A Marsch
- Center for Technology and Behavioral Health, Geisel School of Medicine, Dartmouth College, Lebanon, New Hampshire, USA
| | - Michael Hogarth
- Department of Biomedical Informatics, University of California San Diego, San Diego, California, USA
| | - Perry Shipman
- Altman Clinical and Translational Research Institute, University of California San Diego, San Diego, California, USA
| | - Daniel R Harris
- Institute for Biomedical Informatics, University of Kentucky, Lexington, Kentucky, USA
| | - Jeffery C Talbert
- Institute for Biomedical Informatics, University of Kentucky, Lexington, Kentucky, USA
| |
Collapse
|
9
|
Yu Y, Zong N, Wen A, Liu S, Stone DJ, Knaack D, Chamberlain AM, Pfaff E, Gabriel D, Chute CG, Shah N, Jiang G. Developing an ETL tool for converting the PCORnet CDM into the OMOP CDM to facilitate the COVID-19 data integration. J Biomed Inform 2022; 127:104002. [PMID: 35077901 PMCID: PMC8791245 DOI: 10.1016/j.jbi.2022.104002] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Revised: 01/17/2022] [Accepted: 01/18/2022] [Indexed: 11/01/2022]
Abstract
OBJECTIVE The large-scale collection of observational data and digital technologies could help curb the COVID-19 pandemic. However, the coexistence of multiple Common Data Models (CDMs) and the lack of data extract, transform, and load (ETL) tool between different CDMs causes potential interoperability issue between different data systems. The objective of this study is to design, develop, and evaluate an ETL tool that transforms the PCORnet CDM format data into the OMOP CDM. METHODS We developed an open-source ETL tool to facilitate the data conversion from the PCORnet CDM and the OMOP CDM. The ETL tool was evaluated using a dataset with 1000 patients randomly selected from the PCORnet CDM at Mayo Clinic. Information loss, data mapping accuracy, and gap analysis approaches were conducted to assess the performance of the ETL tool. We designed an experiment to conduct a real-world COVID-19 surveillance task to assess the feasibility of the ETL tool. We also assessed the capacity of the ETL tool for the COVID-19 data surveillance using data collection criteria of the MN EHR Consortium COVID-19 project. RESULTS After the ETL process, all the records of 1000 patients from 18 PCORnet CDM tables were successfully transformed into 12 OMOP CDM tables. The information loss for all the concept mapping was less than 0.61%. The string mapping process for the unit concepts lost 2.84% records. Almost all the fields in the manual mapping process achieved 0% information loss, except the specialty concept mapping. Moreover, the mapping accuracy for all the fields were 100%. The COVID-19 surveillance task collected almost the same set of cases (99.3% overlaps) from the original PCORnet CDM and target OMOP CDM separately. Finally, all the data elements for MN EHR Consortium COVID-19 project could be captured from both the PCORnet CDM and the OMOP CDM. CONCLUSION We demonstrated that our ETL tool could satisfy the data conversion requirements between the PCORnet CDM and the OMOP CDM. The outcome of the work would facilitate the data retrieval, communication, sharing, and analysis between different institutions for not only COVID-19 related project, but also other real-world evidence-based observational studies.
Collapse
Affiliation(s)
- Yue Yu
- Mayo Clinic, Rochester, MN, USA
| | | | | | | | | | | | | | - Emily Pfaff
- University of North Carolina, Chapel Hill, NC, USA
| | | | | | | | | |
Collapse
|
10
|
Wagholikar KB, Zelle D, Ainsworth L, Chaney K, Blood AJ, Miller A, Chulyadyo R, Oates M, Gordon WJ, Aronson SJ, Scirica BM, Murphy SN. Use of automatic SQL generation interface to enhance transparency and validity of health-data analysis. INFORMATICS IN MEDICINE UNLOCKED 2022; 31. [PMID: 35874460 PMCID: PMC9306316 DOI: 10.1016/j.imu.2022.100996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Analysis of health data typically requires development of queries using structured query language (SQL) by a data-analyst. As the SQL queries are manually created, they are prone to errors. In addition, accurate implementation of the queries depends on effective communication with clinical experts, that further makes the analysis error prone. As a potential resolution, we explore an alternative approach wherein a graphical interface that automatically generates the SQL queries is used to perform the analysis. The latter allows clinical experts to directly perform complex queries on the data, despite their unfamiliarity with SQL syntax. The interface provides an intuitive understanding of the query logic which makes the analysis transparent and comprehensible to the clinical study-staff, thereby enhancing the transparency and validity of the analysis. This study demonstrates the feasibility of using a user-friendly interface that automatically generate SQL for analysis of health data. It outlines challenges that will be useful for designing user-friendly tools to improve transparency and reproducibility of data analysis.
Collapse
|
11
|
Bahmani A, Alavi A, Buergel T, Upadhyayula S, Wang Q, Ananthakrishnan SK, Alavi A, Celis D, Gillespie D, Young G, Xing Z, Nguyen MHH, Haque A, Mathur A, Payne J, Mazaheri G, Li JK, Kotipalli P, Liao L, Bhasin R, Cha K, Rolnik B, Celli A, Dagan-Rosenfeld O, Higgs E, Zhou W, Berry CL, Van Winkle KG, Contrepois K, Ray U, Bettinger K, Datta S, Li X, Snyder MP. A scalable, secure, and interoperable platform for deep data-driven health management. Nat Commun 2021; 12:5757. [PMID: 34599181 PMCID: PMC8486823 DOI: 10.1038/s41467-021-26040-1] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Accepted: 08/23/2021] [Indexed: 11/08/2022] Open
Abstract
The large amount of biomedical data derived from wearable sensors, electronic health records, and molecular profiling (e.g., genomics data) is rapidly transforming our healthcare systems. The increasing scale and scope of biomedical data not only is generating enormous opportunities for improving health outcomes but also raises new challenges ranging from data acquisition and storage to data analysis and utilization. To meet these challenges, we developed the Personal Health Dashboard (PHD), which utilizes state-of-the-art security and scalability technologies to provide an end-to-end solution for big biomedical data analytics. The PHD platform is an open-source software framework that can be easily configured and deployed to any big data health project to store, organize, and process complex biomedical data sets, support real-time data analysis at both the individual level and the cohort level, and ensure participant privacy at every step. In addition to presenting the system, we illustrate the use of the PHD framework for large-scale applications in emerging multi-omics disease studies, such as collecting and visualization of diverse data types (wearable, clinical, omics) at a personal level, investigation of insulin resistance, and an infrastructure for the detection of presymptomatic COVID-19.
Collapse
Affiliation(s)
- Amir Bahmani
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Center for Genomics and Personalized Medicine, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
| | - Arash Alavi
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Center for Genomics and Personalized Medicine, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
| | - Thore Buergel
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
| | - Sushil Upadhyayula
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Qiwen Wang
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | | | - Amir Alavi
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
| | - Diego Celis
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Dan Gillespie
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
| | - Gregory Young
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
| | - Ziye Xing
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Center for Genomics and Personalized Medicine, Stanford University, Stanford, CA, USA
| | - Minh Hoang Huynh Nguyen
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Center for Genomics and Personalized Medicine, Stanford University, Stanford, CA, USA
| | - Audrey Haque
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Center for Genomics and Personalized Medicine, Stanford University, Stanford, CA, USA
| | - Ankit Mathur
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Josh Payne
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Ghazal Mazaheri
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
| | - Jason Kenichi Li
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Pramod Kotipalli
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Lisa Liao
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Rajat Bhasin
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
| | - Kexin Cha
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
| | - Benjamin Rolnik
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
| | | | | | - Emily Higgs
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Wenyu Zhou
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Center for Genomics and Personalized Medicine, Stanford University, Stanford, CA, USA
| | - Camille Lauren Berry
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
| | - Katherine Grace Van Winkle
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
| | | | - Utsab Ray
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Center for Genomics and Personalized Medicine, Stanford University, Stanford, CA, USA
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA
| | - Keith Bettinger
- Department of Genetics, Stanford University, Stanford, CA, USA
- Stanford Center for Genomics and Personalized Medicine, Stanford University, Stanford, CA, USA
| | - Somalee Datta
- Technology and Digital Solutions, Stanford Medicine, Stanford, CA, USA
| | - Xiao Li
- Department of Genetics, Stanford University, Stanford, CA, USA.
- Department of Biochemistry, The Center for RNA Science and Therapeutics, Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH, USA.
| | - Michael P Snyder
- Department of Genetics, Stanford University, Stanford, CA, USA.
- Stanford Center for Genomics and Personalized Medicine, Stanford University, Stanford, CA, USA.
- Stanford Healthcare Innovation Lab, Stanford University, Stanford, CA, USA.
| |
Collapse
|
12
|
Lenert LA, Ilatovskiy AV, Agnew J, Rudisill P, Jacobs J, Weatherston D, Deans KR. Automated production of research data marts from a canonical fast healthcare interoperability resource data repository: applications to COVID-19 research. J Am Med Inform Assoc 2021; 28:1605-1611. [PMID: 33993254 PMCID: PMC8243354 DOI: 10.1093/jamia/ocab108] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Accepted: 05/14/2021] [Indexed: 11/12/2022] Open
Abstract
OBJECTIVE The rapidly evolving COVID-19 pandemic has created a need for timely data from the healthcare systems for research. To meet this need, several large new data consortia have been developed that require frequent updating and sharing of electronic health record (EHR) data in different common data models (CDMs) to create multi-institutional databases for research. Traditionally, each CDM has had a custom pipeline for extract, transform, and load operations for production and incremental updates of data feeds to the networks from raw EHR data. However, the demands of COVID-19 research for timely data are far higher, and the requirements for updating faster than previous collaborative research using national data networks have increased. New approaches need to be developed to address these demands. METHODS In this article, we describe the use of the Fast Healthcare Interoperability Resource (FHIR) data model as a canonical data model and the automated transformation of clinical data to the Patient-Centered Outcomes Research Network (PCORnet) and Observational Medical Outcomes Partnership (OMOP) CDMs for data sharing and research collaboration on COVID-19. RESULTS FHIR data resources could be transformed to operational PCORnet and OMOP CDMs with minimal production delays through a combination of real-time and postprocessing steps, leveraging the FHIR data subscription feature. CONCLUSIONS The approach leverages evolving standards for the availability of EHR data developed to facilitate data exchange under the 21st Century Cures Act and could greatly enhance the availability of standardized datasets for research.
Collapse
Affiliation(s)
- Leslie A Lenert
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA.,Health Sciences South Carolina, Columbia, South Carolina, USA
| | - Andrey V Ilatovskiy
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA.,Health Sciences South Carolina, Columbia, South Carolina, USA
| | | | - Patricia Rudisill
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA.,Health Sciences South Carolina, Columbia, South Carolina, USA
| | - Jeff Jacobs
- Health Sciences South Carolina, Columbia, South Carolina, USA
| | | | - Kenneth R Deans
- Health Sciences South Carolina, Columbia, South Carolina, USA
| |
Collapse
|
13
|
Shang Y, Tian Y, Zhou M, Zhou T, Lyu K, Wang Z, Xin R, Liang T, Zhu S, Li J. EHR-Oriented Knowledge Graph System: Toward Efficient Utilization of Non-Used Information Buried in Routine Clinical Practice. IEEE J Biomed Health Inform 2021; 25:2463-2475. [PMID: 34057901 DOI: 10.1109/jbhi.2021.3085003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Non-used clinical information has negative implications on healthcare quality. Clinicians pay priority attention to clinical information relevant to their specialties during routine clinical practices but may be insensitive or less concerned about information showing disease risks beyond their specialties, resulting in delayed and missed diagnoses or improper management. In this study, we introduced an electronic health record (EHR)-oriented knowledge graph system to efficiently utilize non-used information buried in EHRs. EHR data were transformed into a semantic patient-centralized information model under the ontology structure of a knowledge graph. The knowledge graph then creates an EHR data trajectory and performs reasoning through semantic rules to identify important clinical findings within EHR data. A graphical reasoning pathway illustrates the reasoning footage and explains the clinical significance for clinicians to better understand the neglected information. An application study was performed to evaluate unconsidered chronic kidney disease (CKD) reminding for non-nephrology clinicians to identify important neglected information. The study covered 71,679 patients in non-nephrology departments. The system identified 2,774 patients meeting CKD diagnosis criteria and 10,377 patients requiring high attention. A follow-up study of 5,439 patients showed that 82.1% of patients who met the diagnosis criteria and 61.4% of patients requiring high attention were confirmed to be CKD positive during follow-up research. The application demonstrated that the proposed approach is feasible and effective in clinical information utilization. Additionally, it's valuable as an explainable artificial intelligence to provide interpretable recommendations for specialist physicians to understand the importance of non-used data and make comprehensive decisions.
Collapse
|
14
|
Kang B, Yoon J, Kim HY, Jo SJ, Lee Y, Kam HJ. Deep-learning-based automated terminology mapping in OMOP-CDM. J Am Med Inform Assoc 2021; 28:1489-1496. [PMID: 33987667 DOI: 10.1093/jamia/ocab030] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2020] [Revised: 01/07/2021] [Accepted: 02/05/2021] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE Accessing medical data from multiple institutions is difficult owing to the interinstitutional diversity of vocabularies. Standardization schemes, such as the common data model, have been proposed as solutions to this problem, but such schemes require expensive human supervision. This study aims to construct a trainable system that can automate the process of semantic interinstitutional code mapping. MATERIALS AND METHODS To automate mapping between source and target codes, we compute the embedding-based semantic similarity between corresponding descriptive sentences. We also implement a systematic approach for preparing training data for similarity computation. Experimental results are compared to traditional word-based mappings. RESULTS The proposed model is compared against the state-of-the-art automated matching system, which is called Usagi, of the Observational Medical Outcomes Partnership common data model. By incorporating multiple negative training samples per positive sample, our semantic matching method significantly outperforms Usagi. Its matching accuracy is at least 10% greater than that of Usagi, and this trend is consistent across various top-k measurements. DISCUSSION The proposed deep learning-based mapping approach outperforms previous simple word-level matching algorithms because it can account for contextual and semantic information. Additionally, we demonstrate that the manner in which negative training samples are selected significantly affects the overall performance of the system. CONCLUSION Incorporating the semantics of code descriptions more significantly increases matching accuracy compared to traditional text co-occurrence-based approaches. The negative training sample collection methodology is also an important component of the proposed trainable system that can be adopted in both present and future related systems.
Collapse
Affiliation(s)
- Byungkon Kang
- Department of Computer Science, State University of New York, Incheon, South Korea
| | - Jisang Yoon
- Graduate School of Information, Yonsei University, Seoul, South Korea
| | - Ha Young Kim
- Graduate School of Information, Yonsei University, Seoul, South Korea
| | - Sung Jin Jo
- Department of Industrial and Management Engineering, Pohang University of Science and Technology, Pohang, North Gyeongsang,South Korea
| | - Yourim Lee
- RWE Analytics, EvidNet, Seongnam-si, Gyeonggi-do, South Korea
| | - Hye Jin Kam
- Healthcare, Life Solution Cluster, New Business Unit, Hanwha Life, Seoul, South Korea
| |
Collapse
|
15
|
Lenert LA, Ilatovskiy AV, Agnew J, Rudsill P, Jacobs J, Weatherston D, Deans K. Automated Production of Research Data Marts from a Canonical Fast Healthcare Interoperability Resource (FHIR) Data Repository: Applications to COVID-19 Research. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2021. [PMID: 33758877 DOI: 10.1101/2021.03.11.21253384] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Objective Objective: The COVID-19 pandemic has enhanced the need for timely real-world data (RWD) for research. To meet this need, several large clinical consortia have developed networks for access to RWD from electronic health records (EHR), each with its own common data model (CDM) and custom pipeline for extraction, transformation, and load operations for production and incremental updating. However, the demands of COVID-19 research for timely RWD (e.g., 2-week delay) make this less feasible. Methods and Materials We describe the use of the Fast Healthcare Interoperability Resource (FHIR) data model as a canonical model for representation of clinical data for automated transformation to the Patient-Centered Outcomes Research Network (PCORnet) and Observational Medical Outcomes Partnership (OMOP) CDMs and the near automated production of linked clinical data repositories (CDRs) for COVID-19 research using the FHIR subscription standard. The approach was applied to healthcare data from a large academic institution and was evaluated using published quality assessment tools. Results Six years of data (1.07M patients, 10.1M encounters, 137M laboratory results), were loaded into the FHIR CDR producing 3 linked real-time linked repositories: FHIR, PCORnet, and OMOP. PCORnet and OMOP databases were refined in subsequent post processing steps into production releases and met published quality standards. The approach greatly reduced CDM production efforts. Conclusions FHIR and FHIR CDRs can play an important role in enhancing the availability of RWD from EHR systems. The above approach leverages 21 st Century Cures Act mandated standards and could greatly enhance the availability of datasets for research.
Collapse
|
16
|
Syed S, Baghal A, Prior F, Zozus M, Al-Shukri S, Syeda HB, Garza M, Begum S, Gates K, Syed M, Sexton KW. Toolkit to Compute Time-Based Elixhauser Comorbidity Indices and Extension to Common Data Models. Healthc Inform Res 2020; 26:193-200. [PMID: 32819037 PMCID: PMC7438698 DOI: 10.4258/hir.2020.26.3.193] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Accepted: 04/17/2020] [Indexed: 01/02/2023] Open
Abstract
Objectives The time-dependent study of comorbidities provides insight into disease progression and trajectory. We hypothesize that understanding longitudinal disease characteristics can lead to more timely intervention and improve clinical outcomes. As a first step, we developed an efficient and easy-to-install toolkit, the Time-based Elixhauser Comorbidity Index (TECI), which pre-calculates time-based Elixhauser comorbidities and can be extended to common data models (CDMs). Methods A Structured Query Language (SQL)-based toolkit, TECI, was built to pre-calculate time-specific Elixhauser comorbidity indices using data from a clinical data repository (CDR). Then it was extended to the Informatics for Integrating Biology and the Bedside (I2B2) and Observational Medical Outcomes Partnership (OMOP) CDMs. Results At the University of Arkansas for Medical Sciences (UAMS), the TECI toolkit was successfully installed to compute the indices from CDR data, and the scores were integrated into the I2B2 and OMOP CDMs. Comorbidity scores calculated by TECI were validated against: scores available in the 2015 quarter 1–3 Nationwide Readmissions Database (NRD) and scores calculated using the comorbidities using a previously validated algorithm on the 2015 quarter 4 NRD. Furthermore, TECI identified 18,846 UAMS patients that had changes in comorbidity scores over time (year 2013 to 2019). Comorbidities for a random sample of patients were independently reviewed, and in all cases, the results were found to be 100% accurate. Conclusions TECI facilitates the study of comorbidities within a time-dependent context, allowing better understanding of disease associations and trajectories, which has the potential to improve clinical outcomes.
Collapse
Affiliation(s)
- Shorabuddin Syed
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Ahmad Baghal
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Fred Prior
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Meredith Zozus
- Department of Population Health Sciences, University of Texas Health Science Center at San Antonio, San Antonio, TX, USA
| | - Shaymaa Al-Shukri
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Hafsa Bareen Syeda
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Maryam Garza
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Salma Begum
- Department of Information Technology, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Kim Gates
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Mahanazuddin Syed
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Kevin W Sexton
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA.,Department of Surgery, University of Arkansas for Medical Sciences, Little Rock, AR, USA.,Department of Health Policy and Management, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| |
Collapse
|
17
|
The promise of big data for precision population health management in the US. Public Health 2020; 185:110-116. [PMID: 32615477 DOI: 10.1016/j.puhe.2020.04.040] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2019] [Revised: 02/16/2020] [Accepted: 04/30/2020] [Indexed: 11/23/2022]
Abstract
OBJECTIVES As we enter the year 2020, health data in the United States (US) is still in the process of being curated into a usable format. With coordinated data systems, it becomes possible to answer, with relative certainty, what preventive and medical interventions work in the real world and for whom they might work. STUDY DESIGN This is a non-systematic expert review. METHODS A non-systematic expert review was undertaken to identify relevant scientific and gray literature on the current state and the limitations of evaluation of health interventions and the health data infrastructure in the US. This review also included the literature on nations with unified data systems. We coupled this review with non-structured interviews of data scientists to gain insight into the progress in establishing the components necessary to support a unified data system and to facilitate data exchange for evaluations, as well as further guide our review. Our goal was to produce a critical analysis of the existing attempts to standardize and use data collected during patient encounters with physicians for public health purposes. RESULTS Data obtained from electronic health records are produced in a way that is challenging to use and difficult to compile across platforms in the US. One response to this problem has been to encourage the exchange and standardization of health record information through Distributed Research Networks and Common Data Models (CDMs). These data can be combined with mobile health, social media, and other sources of data to radically transform what we know about the prevention and management of disease. However, issues with the variety of CDMs and growing sense of distrust of institutions that maintain data continue to impede medical progress. CONCLUSIONS We present a framework for data use that will allow public health to answer a swath of unanswered research questions that can improve public health practice.
Collapse
|
18
|
Ci B, Yang DM, Krailo M, Xia C, Yao B, Luo D, Zhou Q, Xiao G, Xu L, Skapek SX, Murray MJ, Amatruda JF, Klosterkemper L, Shaikh F, Faure-Conter C, Fresneau B, Volchenboum SL, Stoneham S, Lopes LF, Nicholson J, Frazier AL, Xie Y. Development of a Data Model and Data Commons for Germ Cell Tumors. JCO Clin Cancer Inform 2020; 4:555-566. [PMID: 32568554 PMCID: PMC7328105 DOI: 10.1200/cci.20.00025] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/29/2020] [Indexed: 11/20/2022] Open
Abstract
Germ cell tumors (GCTs) are considered a rare disease but are the most common solid tumors in adolescents and young adults, accounting for 15% of all malignancies in this age group. The rarity of GCTs in some groups, particularly children, has impeded progress in treatment and biologic understanding. The most effective GCT research will result from the interrogation of data sets from historical and prospective trials across institutions. However, inconsistent use of terminology among groups, different sample-labeling rules, and lack of data standards have hampered researchers' efforts in data sharing and across-study validation. To overcome the low interoperability of data and facilitate future clinical trials, we worked with the Malignant Germ Cell International Consortium (MaGIC) and developed a GCT clinical data model as a uniform standard to curate and harmonize GCT data sets. This data model will also be the standard for prospective data collection in future trials. Using the GCT data model, we developed a GCT data commons with data sets from both MaGIC and public domains as an integrated research platform. The commons supports functions, such as data query, management, sharing, visualization, and analysis of the harmonized data, as well as patient cohort discovery. This GCT data commons will facilitate future collaborative research to advance the biologic understanding and treatment of GCTs. Moreover, the framework of the GCT data model and data commons will provide insights for other rare disease research communities into developing similar collaborative research platforms.
Collapse
Affiliation(s)
- Bo Ci
- Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, TX
| | - Donghan M. Yang
- Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, TX
| | - Mark Krailo
- Keck School of Medicine, University of Southern California, Los Angeles, CA
- Children’s Oncology Group, Monrovia, CA
| | | | - Bo Yao
- Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, TX
| | - Danni Luo
- Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, TX
| | - Qinbo Zhou
- Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, TX
| | - Guanghua Xiao
- Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, TX
- Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX
| | - Lin Xu
- Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, TX
| | - Stephen X. Skapek
- Department of Pediatrics, University of Texas Southwestern Medical Center, Dallas, TX
- Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX
| | - Matthew J. Murray
- Department of Pathology, University of Cambridge, Cambridge, United Kingdom
| | - James F. Amatruda
- Keck School of Medicine, University of Southern California, Los Angeles, CA
- Cancer and Blood Disease Institute, Children’s Hospital Los Angeles, Los Angeles, CA
| | | | - Furqan Shaikh
- Hospital for Sick Children, University of Toronto, Toronto, ON, Canada
| | | | - Brice Fresneau
- Department of Pediatric Oncology, Gustave Roussy, University of Paris-Saclay, Villejuif, France
| | - Samuel L. Volchenboum
- Center for Research Informatics, Division of Medicine and Biological Sciences, University of Chicago, Chicago, IL
| | - Sara Stoneham
- Department of Paediatrics, University College London Hospitals, London, United Kingdom
| | | | - James Nicholson
- Department of Paediatric Haematology and Oncology, Cambridge University Hospitals National Health Service Foundation Trust, Cambridge, United Kingdom
| | - A. Lindsay Frazier
- Dana-Farber/Boston Children’s Blood and Cancer Disorders Center, Boston, MA
| | - Yang Xie
- Quantitative Biomedical Research Center, Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, TX
- Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX
- Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX
| |
Collapse
|
19
|
Danese MD, Halperin M, Duryea J, Duryea R. The Generalized Data Model for clinical research. BMC Med Inform Decis Mak 2019; 19:117. [PMID: 31234921 PMCID: PMC6591926 DOI: 10.1186/s12911-019-0837-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2017] [Accepted: 06/10/2019] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Most healthcare data sources store information within their own unique schemas, making reliable and reproducible research challenging. Consequently, researchers have adopted various data models to improve the efficiency of research. Transforming and loading data into these models is a labor-intensive process that can alter the semantics of the original data. Therefore, we created a data model with a hierarchical structure that simplifies the transformation process and minimizes data alteration. METHODS There were two design goals in constructing the tables and table relationships for the Generalized Data Model (GDM). The first was to focus on clinical codes in their original vocabularies to retain the original semantic representation of the data. The second was to retain hierarchical information present in the original data while retaining provenance. The model was tested by transforming synthetic Medicare data; Surveillance, Epidemiology, and End Results data linked to Medicare claims; and electronic health records from the Clinical Practice Research Datalink. We also tested a subsequent transformation from the GDM into the Sentinel data model. RESULTS The resulting data model contains 19 tables, with the Clinical Codes, Contexts, and Collections tables serving as the core of the model, and containing most of the clinical, provenance, and hierarchical information. In addition, a Mapping table allows users to apply an arbitrarily complex set of relationships among vocabulary elements to facilitate automated analyses. CONCLUSIONS The GDM offers researchers a simpler process for transforming data, clear data provenance, and a path for users to transform their data into other data models. The GDM is designed to retain hierarchical relationships among data elements as well as the original semantic representation of the data, ensuring consistency in protocol implementation as part of a complete data pipeline for researchers.
Collapse
Affiliation(s)
- Mark D. Danese
- Outcomes Insights, Inc., 2801 Townsgate Road, Suite 330, Westlake Village, CA 91361 USA
| | - Marc Halperin
- Outcomes Insights, Inc., 2801 Townsgate Road, Suite 330, Westlake Village, CA 91361 USA
| | - Jennifer Duryea
- Outcomes Insights, Inc., 2801 Townsgate Road, Suite 330, Westlake Village, CA 91361 USA
| | - Ryan Duryea
- Outcomes Insights, Inc., 2801 Townsgate Road, Suite 330, Westlake Village, CA 91361 USA
| |
Collapse
|
20
|
Kirkendall ES, Ni Y, Lingren T, Leonard M, Hall ES, Melton K. Data Challenges With Real-Time Safety Event Detection And Clinical Decision Support. J Med Internet Res 2019; 21:e13047. [PMID: 31120022 PMCID: PMC6549472 DOI: 10.2196/13047] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2018] [Revised: 03/04/2019] [Accepted: 04/05/2019] [Indexed: 12/03/2022] Open
Abstract
Background The continued digitization and maturation of health care information technology has made access to real-time data easier and feasible for more health care organizations. With this increased availability, the promise of using data to algorithmically detect health care–related events in real-time has become more of a reality. However, as more researchers and clinicians utilize real-time data delivery capabilities, it has become apparent that simply gaining access to the data is not a panacea, and some unique data challenges have emerged to the forefront in the process. Objective The aim of this viewpoint was to highlight some of the challenges that are germane to real-time processing of health care system–generated data and the accurate interpretation of the results. Methods Distinct challenges related to the use and processing of real-time data for safety event detection were compiled and reported by several informatics and clinical experts at a quaternary pediatric academic institution. The challenges were collated from the experiences of the researchers implementing real-time event detection on more than half a dozen distinct projects. The challenges have been presented in a challenge category-specific challenge-example format. Results In total, 8 major types of challenge categories were reported, with 13 specific challenges and 9 specific examples detailed to provide a context for the challenges. The examples reported are anchored to a specific project using medication order, medication administration record, and smart infusion pump data to detect discrepancies and errors between the 3 datasets. Conclusions The use of real-time data to drive safety event detection and clinical decision support is extremely powerful, but it presents its own set of challenges that include data quality and technical complexity. These challenges must be recognized and accommodated for if the full promise of accurate, real-time safety event clinical decision support is to be realized.
Collapse
Affiliation(s)
- Eric Steven Kirkendall
- Department of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, United States.,Department of Pediatrics, College of Medicine, University of Cincinnati, Cincinnati, OH, United States.,Division of Hospital Medicine, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, United States.,James M Anderson Center for Health Systems Excellence, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, United States.,Department of Pediatrics, Wake Forest School of Medicine, Winston-Salem, NC, United States
| | - Yizhao Ni
- Department of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, United States.,Department of Pediatrics, College of Medicine, University of Cincinnati, Cincinnati, OH, United States
| | - Todd Lingren
- Department of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, United States
| | - Matthew Leonard
- Division of Neonatology and Pulmonary Biology, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, United States
| | - Eric S Hall
- Department of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, United States.,Department of Pediatrics, College of Medicine, University of Cincinnati, Cincinnati, OH, United States.,Division of Neonatology and Pulmonary Biology, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, United States
| | - Kristin Melton
- Department of Pediatrics, College of Medicine, University of Cincinnati, Cincinnati, OH, United States.,Division of Neonatology and Pulmonary Biology, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, United States
| |
Collapse
|