176
|
Ding K, Shameer K, Jouni H, Masys DR, Jarvik GP, Kho AN, Ritchie MD, McCarty CA, Chute CG, Manolio TA, Kullo IJ. Genetic Loci implicated in erythroid differentiation and cell cycle regulation are associated with red blood cell traits. Mayo Clin Proc 2012; 87:461-74. [PMID: 22560525 PMCID: PMC3538470 DOI: 10.1016/j.mayocp.2012.01.016] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/09/2011] [Revised: 01/10/2012] [Accepted: 01/19/2012] [Indexed: 01/16/2023]
Abstract
OBJECTIVE To identify common genetic variants influencing red blood cell (RBC) traits. PATIENTS AND METHODS We performed a genomewide association study from June 2008 through July 2011 of hemoglobin, hematocrit, RBC count, mean corpuscular volume, mean corpuscular hemoglobin, and mean corpuscular hemoglobin concentration in 12,486 patients of European ancestry from the electronic MEdical Records and Genomics (eMERGE) network. We developed an electronic medical record-based algorithm that included individuals who had RBC measurements obtained for clinical care and excluded values measured in the setting of hematopoietic disorders, comorbid conditions, or medications known to affect RBC production or a recent history of blood loss. RESULTS We identified 4 new genetic loci and replicated 11 loci previously reported to be associated with one or more RBC traits in individuals of European ancestry. Notably, genes present in 3 of the 4 newly identified loci (THRB, PTPLAD1, CDT1) and in 6 of the 11 replicated loci (KLF1, ALDH8A1, CCND3, SPTA1, FBXO7, TFR2/EPO) are implicated in erythroid differentiation and regulation of cell cycle in hematopoietic stem cells. CONCLUSION Genes in the erythroid differentiation and cell cycle regulation pathways influence interindividual variation in RBC indices. Our results provide insights into the molecular basis underlying variation in RBC traits.
Collapse
Key Words
- emerge, electronic medical records and genomics
- emmax, mixed-model association-expedited
- emr, electronic medical record
- eqtl, expression quantitative trait locus
- ghc, group health cooperative--university of washington
- gwas, genomewide association study
- hct, hematocrit
- hgb, hemoglobin
- ibs, identity-by-state
- ld, linkage disequilibrium
- mc, marshfield clinic
- mch, mean corpuscular hemoglobin
- mchc, mean corpuscular hemoglobin concentration
- mcv, mean corpuscular volume
- mim, mendelian inheritance of man
- nu, northwestern university
- rbc, red blood cell
- snp, single-nucleotide polymorphism
- vumc, vanderbilt university medical center
Collapse
|
177
|
Jiang G, Solbrig HR, Chute CG. Quality evaluation of value sets from cancer study common data elements using the UMLS semantic groups. J Am Med Inform Assoc 2012; 19:e129-36. [PMID: 22511016 PMCID: PMC3392855 DOI: 10.1136/amiajnl-2011-000739] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE The objective of this study is to develop an approach to evaluate the quality of terminological annotations on the value set (ie, enumerated value domain) components of the common data elements (CDEs) in the context of clinical research using both unified medical language system (UMLS) semantic types and groups. MATERIALS AND METHODS The CDEs of the National Cancer Institute (NCI) Cancer Data Standards Repository, the NCI Thesaurus (NCIt) concepts and the UMLS semantic network were integrated using a semantic web-based framework for a SPARQL-enabled evaluation. First, the set of CDE-permissible values with corresponding meanings in external controlled terminologies were isolated. The corresponding value meanings were then evaluated against their NCI- or UMLS-generated semantic network mapping to determine whether all of the meanings fell within the same semantic group. RESULTS Of the enumerated CDEs in the Cancer Data Standards Repository, 3093 (26.2%) had elements drawn from more than one UMLS semantic group. A random sample (n=100) of this set of elements indicated that 17% of them were likely to have been misclassified. DISCUSSION The use of existing semantic web tools can support a high-throughput mechanism for evaluating the quality of large CDE collections. This study demonstrates that the involvement of multiple semantic groups in an enumerated value domain of a CDE is an effective anchor to trigger an auditing point for quality evaluation activities. CONCLUSION This approach produces a useful quality assurance mechanism for a clinical study CDE repository.
Collapse
|
178
|
Wu ST, Liu H, Li D, Tao C, Musen MA, Chute CG, Shah NH. Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis. J Am Med Inform Assoc 2012; 19:e149-56. [PMID: 22493050 PMCID: PMC3392861 DOI: 10.1136/amiajnl-2011-000744] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Objective To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources. Design Based on the occurrences of UMLS terms in a 51 million document corpus of Mayo Clinic clinical notes, this study computes statistics about the terms' string attributes, source terminologies, semantic types and syntactic categories. Term occurrences in 2010 i2b2/VA text were also mapped; eight example filters were designed from the Mayo-based statistics and applied to i2b2/VA data. Results For the corpus analysis, negligible numbers of mapped terms in the Mayo corpus had over six words or 55 characters. Of source terminologies in the UMLS, the Consumer Health Vocabulary and Systematized Nomenclature of Medicine—Clinical Terms (SNOMED-CT) had the best coverage in Mayo clinical notes at 106 426 and 94 788 unique terms, respectively. Of 15 semantic groups in the UMLS, seven groups accounted for 92.08% of term occurrences in Mayo data. Syntactically, over 90% of matched terms were in noun phrases. For the cross-institutional analysis, using five example filters on i2b2/VA data reduces the actual lexicon to 19.13% of the size of the UMLS and only sees a 2% reduction in matched terms. Conclusion The corpus statistics presented here are instructive for building lexicons from the UMLS. Features intrinsic to Metathesaurus terms (well formedness, length and language) generalise easily across clinical institutions, but term frequencies should be adapted with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but they differ greatly when moving to the biomedical literature domain.
Collapse
|
179
|
Chute CG, Huff SM, Ferguson JA, Walker JM, Halamka JD. There are important reasons for delaying implementation of the new ICD-10 coding system. Health Aff (Millwood) 2012; 31:836-42. [PMID: 22442180 DOI: 10.1377/hlthaff.2011.1258] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Federal authorities have recently signaled that they would consider delaying some aspects of implementation of the newest version of the International Classification of Diseases, known as ICD-10-CM, a coding system used to define health care charges and diagnoses. Some industry groups have reacted with dismay, and many providers with relief. We are concerned that adopting this new classification system for reimbursement will be disruptive and costly and will offer no material improvement over the current system. Because the health care community is also working to integrate health information technology and federal meaningful-use specifications that require the adoption of other complex coding standardization systems (such as the system called SNOMED CT), we recommend that the Centers for Medicare and Medicaid Services consider delaying the adoption of ICD-10-CM. Policy makers should also begin planning now for ways to make the coming transition to ICD-11 as tolerable as possible for the health care and payment community.
Collapse
|
180
|
Song D, Chute CG, Tao C. Semantator: annotating clinical narratives with semantic web ontologies. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2012; 2012:20-29. [PMID: 22779043 PMCID: PMC3392053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
To facilitate clinical research, clinical data needs to be stored in a machine processable and understandable way. Manual annotating clinical data is time consuming. Automatic approaches (e.g., Natural Language Processing systems) have been adopted to convert such data into structured formats; however, the quality of such automatically extracted data may not always be satisfying. In this paper, we propose Semantator, a semi-automatic tool for document annotation with Semantic Web ontologies. With a loaded free text document and an ontology, Semantator supports the creation/deletion of ontology instances for any document fragment, linking/disconnecting instances with the properties in the ontology, and also enables automatic annotation by connecting to the NCBO annotator and cTAKES. By representing annotations in Semantic Web standards, Semantator supports reasoning based upon the underlying semantics of the owl:disjointWith and owl:equivalentClass predicates. We present discussions based on user experiences of using Semantator.
Collapse
|
181
|
Pathak J, Kiefer RC, Chute CG. Using semantic web technologies for cohort identification from electronic health records for clinical research. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2012; 2012:10-9. [PMID: 22779040 PMCID: PMC3392057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Abstract
The ability to conduct genome-wide association studies (GWAS) has enabled new exploration of how genetic variations contribute to health and disease etiology. One of the key requirements to perform GWAS is the identification of subject cohorts with accurate classification of disease phenotypes. In this work, we study how emerging Semantic Web technologies can be applied in conjunction with clinical data stored in electronic health records (EHRs) to accurately identify subjects with specific diseases for inclusion in cohort studies. In particular, we demonstrate the role of using Resource Description Framework (RDF) for representing EHR data and enabling federated querying and inferencing via standardized Web protocols for identifying subjects with Diabetes Mellitus. Our study highlights the potential of using Web-scale data federation approaches to execute complex queries.
Collapse
|
182
|
Sohn S, Wu S, Chute CG. Dependency Parser-based Negation Detection in Clinical Narratives. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2012; 2012:1-8. [PMID: 22779038 PMCID: PMC3392064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Negation of clinical named entities is common in clinical documents and is a crucial factor to accurately compile patients' clinical conditions and to further support complex phenotype detection. In 2009, Mayo Clinic released the clinical Text Analysis and Knowledge Extraction System (cTAKES), which includes a negation annotator that identifies negation status of a named entity by searching for negation words within a fixed word distance. However, this negation strategy is not sophisticated enough to correctly identify complicated patterns of negation. This paper aims to investigate whether the dependency structure from the cTAKES dependency parser can improve the negation detection performance. Manually compiled negation rules, derived from dependency paths were tested. Dependency negation rules do not limit the negation scope to word distance; instead, they are based on syntactic context. We found that using a dependency-based negation proved a superior alternative to the current cTAKES negation annotator.
Collapse
|
183
|
Rea S, Pathak J, Savova G, Oniki TA, Westberg L, Beebe CE, Tao C, Parker CG, Haug PJ, Huff SM, Chute CG. Building a robust, scalable and standards-driven infrastructure for secondary use of EHR data: the SHARPn project. J Biomed Inform 2012; 45:763-71. [PMID: 22326800 DOI: 10.1016/j.jbi.2012.01.009] [Citation(s) in RCA: 122] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2011] [Revised: 01/23/2012] [Accepted: 01/25/2012] [Indexed: 02/08/2023]
Abstract
The Strategic Health IT Advanced Research Projects (SHARP) Program, established by the Office of the National Coordinator for Health Information Technology in 2010 supports research findings that remove barriers for increased adoption of health IT. The improvements envisioned by the SHARP Area 4 Consortium (SHARPn) will enable the use of the electronic health record (EHR) for secondary purposes, such as care process and outcomes improvement, biomedical research and epidemiologic monitoring of the nation's health. One of the primary informatics problem areas in this endeavor is the standardization of disparate health data from the nation's many health care organizations and providers. The SHARPn team is developing open source services and components to support the ubiquitous exchange, sharing and reuse or 'liquidity' of operational clinical data stored in electronic health records. One year into the design and development of the SHARPn framework, we demonstrated end to end data flow and a prototype SHARPn platform, using thousands of patient electronic records sourced from two large healthcare organizations: Mayo Clinic and Intermountain Healthcare. The platform was deployed to (1) receive source EHR data in several formats, (2) generate structured data from EHR narrative text, and (3) normalize the EHR data using common detailed clinical models and Consolidated Health Informatics standard terminologies, which were (4) accessed by a phenotyping service using normalized data specifications. The architecture of this prototype SHARPn platform is presented. The EHR data throughput demonstration showed success in normalizing native EHR data, both structured and narrative, from two independent organizations and EHR systems. Based on the demonstration, observed challenges for standardization of EHR data for interoperable secondary use are discussed.
Collapse
|
184
|
Wei WQ, Leibson CL, Ransom JE, Kho AN, Caraballo PJ, Chai HS, Yawn BP, Pacheco JA, Chute CG. Impact of data fragmentation across healthcare centers on the accuracy of a high-throughput clinical phenotyping algorithm for specifying subjects with type 2 diabetes mellitus. J Am Med Inform Assoc 2012; 19:219-24. [PMID: 22249968 DOI: 10.1136/amiajnl-2011-000597] [Citation(s) in RCA: 71] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE To evaluate data fragmentation across healthcare centers with regard to the accuracy of a high-throughput clinical phenotyping (HTCP) algorithm developed to differentiate (1) patients with type 2 diabetes mellitus (T2DM) and (2) patients with no diabetes. MATERIALS AND METHODS This population-based study identified all Olmsted County, Minnesota residents in 2007. We used provider-linked electronic medical record data from the two healthcare centers that provide >95% of all care to County residents (ie, Olmsted Medical Center and Mayo Clinic in Rochester, Minnesota, USA). Subjects were limited to residents with one or more encounter January 1, 2006 through December 31, 2007 at both healthcare centers. DM-relevant data on diagnoses, laboratory results, and medication from both centers were obtained during this period. The algorithm was first executed using data from both centers (ie, the gold standard) and then from Mayo Clinic alone. Positive predictive values and false-negative rates were calculated, and the McNemar test was used to compare categorization when data from the Mayo Clinic alone were used with the gold standard. Age and sex were compared between true-positive and false-negative subjects with T2DM. Statistical significance was accepted as p<0.05. RESULTS With data from both medical centers, 765 subjects with T2DM (4256 non-DM subjects) were identified. When single-center data were used, 252 T2DM subjects (1573 non-DM subjects) were missed; an additional false-positive 27 T2DM subjects (215 non-DM subjects) were identified. The positive predictive values and false-negative rates were 95.0% (513/540) and 32.9% (252/765), respectively, for T2DM subjects and 92.6% (2683/2898) and 37.0% (1573/4256), respectively, for non-DM subjects. Age and sex distribution differed between true-positive (mean age 62.1; 45% female) and false-negative (mean age 65.0; 56.0% female) T2DM subjects. CONCLUSION The findings show that application of an HTCP algorithm using data from a single medical center contributes to misclassification. These findings should be considered carefully by researchers when developing and executing HTCP algorithms.
Collapse
|
185
|
Jiang G, Solbrig HR, Chute CG. Quality evaluation of cancer study Common Data Elements using the UMLS Semantic Network. J Biomed Inform 2011; 44 Suppl 1:S78-S85. [DOI: 10.1016/j.jbi.2011.08.001] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2011] [Revised: 07/29/2011] [Accepted: 08/01/2011] [Indexed: 11/27/2022]
|
186
|
Savova GK, Olson JE, Murphy SP, Cafourek VL, Couch FJ, Goetz MP, Ingle JN, Suman VJ, Chute CG, Weinshilboum RM. Automated discovery of drug treatment patterns for endocrine therapy of breast cancer within an electronic medical record. J Am Med Inform Assoc 2011; 19:e83-9. [PMID: 22140207 DOI: 10.1136/amiajnl-2011-000295] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
OBJECTIVE To develop an algorithm for the discovery of drug treatment patterns for endocrine breast cancer therapy within an electronic medical record and to test the hypothesis that information extracted using it is comparable to the information found by traditional methods. MATERIALS The electronic medical charts of 1507 patients diagnosed with histologically confirmed primary invasive breast cancer. METHODS The automatic drug treatment classification tool consisted of components for: (1) extraction of drug treatment-relevant information from clinical narratives using natural language processing (clinical Text Analysis and Knowledge Extraction System); (2) extraction of drug treatment data from an electronic prescribing system; (3) merging information to create a patient treatment timeline; and (4) final classification logic. RESULTS Agreement between results from the algorithm and from a nurse abstractor is measured for categories: (0) no tamoxifen or aromatase inhibitor (AI) treatment; (1) tamoxifen only; (2) AI only; (3) tamoxifen before AI; (4) AI before tamoxifen; (5) multiple AIs and tamoxifen cycles in no specific order; and (6) no specific treatment dates. Specificity (all categories): 96.14%-100%; sensitivity (categories (0)-(4)): 90.27%-99.83%; sensitivity (categories (5)-(6)): 0-23.53%; positive predictive values: 80%-97.38%; negative predictive values: 96.91%-99.93%. DISCUSSION Our approach illustrates a secondary use of the electronic medical record. The main challenge is event temporality. CONCLUSION We present an algorithm for automated treatment classification within an electronic medical record to combine information extracted through natural language processing with that extracted from structured databases. The algorithm has high specificity for all categories, high sensitivity for five categories, and low sensitivity for two categories.
Collapse
|
187
|
Kho AN, Hayes MG, Rasmussen-Torvik L, Pacheco JA, Thompson WK, Armstrong LL, Denny JC, Peissig PL, Miller AW, Wei WQ, Bielinski SJ, Chute CG, Leibson CL, Jarvik GP, Crosslin DR, Carlson CS, Newton KM, Wolf WA, Chisholm RL, Lowe WL. Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study. J Am Med Inform Assoc 2011; 19:212-8. [PMID: 22101970 DOI: 10.1136/amiajnl-2011-000439] [Citation(s) in RCA: 213] [Impact Index Per Article: 16.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
OBJECTIVE Genome-wide association studies (GWAS) require high specificity and large numbers of subjects to identify genotype-phenotype correlations accurately. The aim of this study was to identify type 2 diabetes (T2D) cases and controls for a GWAS, using data captured through routine clinical care across five institutions using different electronic medical record (EMR) systems. MATERIALS AND METHODS An algorithm was developed to identify T2D cases and controls based on a combination of diagnoses, medications, and laboratory results. The performance of the algorithm was validated at three of the five participating institutions compared against clinician review. A GWAS was subsequently performed using cases and controls identified by the algorithm, with samples pooled across all five institutions. RESULTS The algorithm achieved 98% and 100% positive predictive values for the identification of diabetic cases and controls, respectively, as compared against clinician review. By standardizing and applying the algorithm across institutions, 3353 cases and 3352 controls were identified. Subsequent GWAS using data from five institutions replicated the TCF7L2 gene variant (rs7903146) previously associated with T2D. DISCUSSION By applying stringent criteria to EMR data collected through routine clinical care, cases and controls for a GWAS were identified that subsequently replicated a known genetic variant. The use of standard terminologies to define data elements enabled pooling of subjects and data across five different institutions to achieve the robust numbers required for GWAS. CONCLUSIONS An algorithm using commonly available data from five different EMR can accurately identify T2D cases and controls for genetic study across multiple institutions.
Collapse
|
188
|
Musen MA, Noy NF, Shah NH, Whetzel PL, Chute CG, Story MA, Smith B. The National Center for Biomedical Ontology. J Am Med Inform Assoc 2011; 19:190-5. [PMID: 22081220 DOI: 10.1136/amiajnl-2011-000523] [Citation(s) in RCA: 142] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
The National Center for Biomedical Ontology is now in its seventh year. The goals of this National Center for Biomedical Computing are to: create and maintain a repository of biomedical ontologies and terminologies; build tools and web services to enable the use of ontologies and terminologies in clinical and translational research; educate their trainees and the scientific community broadly about biomedical ontology and ontology-based technology and best practices; and collaborate with a variety of groups who develop and use ontologies and terminologies in biomedicine. The centerpiece of the National Center for Biomedical Ontology is a web-based resource known as BioPortal. BioPortal makes available for research in computationally useful forms more than 270 of the world's biomedical ontologies and terminologies, and supports a wide range of web services that enable investigators to use the ontologies to annotate and retrieve data, to generate value sets and special-purpose lexicons, and to perform advanced analytics on a wide range of biomedical data.
Collapse
|
189
|
Chute CG, Pathak J, Savova GK, Bailey KR, Schor MI, Hart LA, Beebe CE, Huff SM. The SHARPn project on secondary use of Electronic Medical Record data: progress, plans, and possibilities. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2011; 2011:248-256. [PMID: 22195076 PMCID: PMC3243296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
SHARPn is a collaboration among 16 academic and industry partners committed to the production and distribution of high-quality software artifacts that support the secondary use of EMR data. Areas of emphasis are data normalization, natural language processing, high-throughput phenotyping, and data quality metrics. Our work avails the industrial scalability afforded by the Unstructured Information Management Architecture (UIMA) from IBM Watson Research labs, the same framework which underpins the Watson Jeopardy demonstration. This descriptive paper outlines our present work and achievements, and presages our trajectory for the remainder of the funding period. The project is one of the four Strategic Health IT Advanced Research Projects (SHARP) projects funded by the Office of the National Coordinator in 2010.
Collapse
|
190
|
Pathak J, Murphy SP, Willaert BN, Kremers HM, Yawn BP, Rocca WA, Chute CG. Using RxNorm and NDF-RT to classify medication data extracted from electronic health records: experiences from the Rochester Epidemiology Project. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2011; 2011:1089-1098. [PMID: 22195170 PMCID: PMC3243205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
RxNorm and NDF-RT published by the National Library of Medicine (NLM) and Veterans Affairs (VA), respectively, are two publicly available federal medication terminologies. In this study, we evaluate the applicability of RxNorm and National Drug File-Reference Terminology (NDF-RT) for extraction and classification of medication data retrieved using structured querying and natural language processing techniques from electronic health records at two different medical centers within the Rochester Epidemiology Project (REP). Specifically, we explore how mappings between RxNorm concept codes and NDF-RT drug classes can be leveraged for hierarchical organization and grouping of REP medication data, identify gaps and coverage issues, and analyze the recently released NLM's NDF-RT Web service API. Our study concludes that RxNorm and NDF-RT can be applied together for classification of medication extracted from multiple EHR systems, although several issues and challenges remain to be addressed. We further conclude that the Web service APIs developed by the NLM provide useful functionalities for such activities.
Collapse
|
191
|
Conway M, Berg RL, Carrell D, Denny JC, Kho AN, Kullo IJ, Linneman JG, Pacheco JA, Peissig P, Rasmussen L, Weston N, Chute CG, Pathak J. Analyzing the heterogeneity and complexity of Electronic Health Record oriented phenotyping algorithms. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2011; 2011:274-283. [PMID: 22195079 PMCID: PMC3243189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The need for formal representations of eligibility criteria for clinical trials - and for phenotyping more generally - has been recognized for some time. Indeed, the availability of a formal computable representation that adequately reflects the types of data and logic evidenced in trial designs is a prerequisite for the automatic identification of study-eligible patients from Electronic Health Records. As part of the wider process of representation development, this paper reports on an analysis of fourteen Electronic Health Record oriented phenotyping algorithms (developed as part of the eMERGE project) in terms of their constituent data elements, types of logic used and temporal characteristics. We discovered that the majority of eMERGE algorithms analyzed include complex, nested boolean logic and negation, with several dependent on cardinality constraints and complex temporal logic. Insights gained from the study will be used to augment the CDISC Protocol Representation Model.
Collapse
|
192
|
Jiang G, Solbrig HR, Chute CG. ADEpedia: a scalable and standardized knowledge base of Adverse Drug Events using semantic web technology. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2011; 2011:607-616. [PMID: 22195116 PMCID: PMC3243176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
A source of semantically coded Adverse Drug Event (ADE) data can be useful for identifying common phenotypes related to ADEs. We proposed a comprehensive framework for building a standardized ADE knowledge base (called ADEpedia) through combining ontology-based approach with semantic web technology. The framework comprises four primary modules: 1) an XML2RDF transformation module; 2) a data normalization module based on NCBO Open Biomedical Annotator; 3) a RDF store based persistence module; and 4) a front-end module based on a Semantic Wiki for the review and curation. A prototype is successfully implemented to demonstrate the capability of the system to integrate multiple drug data and ontology resources and open web services for the ADE data standardization. A preliminary evaluation is performed to demonstrate the usefulness of the system, including the performance of the NCBO annotator. In conclusion, the semantic web technology provides a highly scalable framework for ADE data source integration and standard query service.
Collapse
|
193
|
Tao C, Parker CG, Oniki TA, Pathak J, Huff SM, Chute CG. An OWL meta-ontology for representing the Clinical Element Model. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2011; 2011:1372-1381. [PMID: 22195200 PMCID: PMC3243162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The Clinical Element Model (CEM) is a strategy designed to represent logical models for clinical data elements to ensure unambiguous data representation, interpretation, and exchange within and across heterogeneous sources and applications. The current representations of CEMs have limitations on expressing semantics and formal definitions of the structure and the semantics. Here we introduce our initial efforts on representing the CEM in OWL, so that the enrichment with OWL semantics and further semantic processing can be achieved in CEM. The focus of this paper is the CEM meta-ontology where the basic structures, the properties and their relationships, and the constraints are defined. These OWL representation specifications have been reviewed by CEM experts to ensure they capture the intended meaning of the model faithfully.
Collapse
|
194
|
Clark KK, Sharma DK, Chute CG, Tao C. Application of a temporal reasoning framework tool in analysis of medical device adverse events. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2011; 2011:1366-1371. [PMID: 22195199 PMCID: PMC3243139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The Clinical Narrative Temporal Relation Ontology (CNTRO)1 project offers a semantic-web based reasoning framework, which represents temporal events and relationships within clinical narrative texts, and infer new knowledge over them. In this paper, the CNTRO reasoning framework is applied to temporal analysis of medical device adverse event files. One specific adverse event was used as a test case: late stent thrombosis. Adverse event narratives were obtained from the Food and Drug Administration's (FDA) Manufacturing and User Facility Device Experience (MAUDE) database2. 15 adverse event files in which late stent thrombosis was confirmed were randomly selected across multiple drug eluting stent devices. From these files, 81 events and 72 temporal relations were annotated. 73 temporal questions were generated, of which 65 were correctly answered by the CNTRO system. This results in an overall accuracy of 89%. This system should be pursued further to continue assessing its potential benefits in temporal analysis of medical device adverse events.
Collapse
|
195
|
Denny JC, Crawford DC, Ritchie MD, Bielinski SJ, Basford MA, Bradford Y, Chai HS, Bastarache L, Zuvich R, Peissig P, Carrell D, Ramirez AH, Pathak J, Wilke RA, Rasmussen L, Wang X, Pacheco JA, Kho AN, Hayes MG, Weston N, Matsumoto M, Kopp PA, Newton KM, Jarvik GP, Li R, Manolio TA, Kullo IJ, Chute CG, Chisholm RL, Larson EB, McCarty CA, Masys DR, Roden DM, de Andrade M. Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: using electronic medical records for genome- and phenome-wide studies. Am J Hum Genet 2011; 89:529-42. [PMID: 21981779 PMCID: PMC3188836 DOI: 10.1016/j.ajhg.2011.09.008] [Citation(s) in RCA: 193] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2011] [Revised: 09/15/2011] [Accepted: 09/15/2011] [Indexed: 12/20/2022] Open
Abstract
We repurposed existing genotypes in DNA biobanks across the Electronic Medical Records and Genomics network to perform a genome-wide association study for primary hypothyroidism, the most common thyroid disease. Electronic selection algorithms incorporating billing codes, laboratory values, text queries, and medication records identified 1317 cases and 5053 controls of European ancestry within five electronic medical records (EMRs); the algorithms' positive predictive values were 92.4% and 98.5% for cases and controls, respectively. Four single-nucleotide polymorphisms (SNPs) in linkage disequilibrium at 9q22 near FOXE1 were associated with hypothyroidism at genome-wide significance, the strongest being rs7850258 (odds ratio [OR] 0.74, p = 3.96 × 10(-9)). This association was replicated in a set of 263 cases and 1616 controls (OR = 0.60, p = 5.7 × 10(-6)). A phenome-wide association study (PheWAS) that was performed on this locus with 13,617 individuals and more than 200,000 patient-years of billing data identified associations with additional phenotypes: thyroiditis (OR = 0.58, p = 1.4 × 10(-5)), nodular (OR = 0.76, p = 3.1 × 10(-5)) and multinodular (OR = 0.69, p = 3.9 × 10(-5)) goiters, and thyrotoxicosis (OR = 0.76, p = 1.5 × 10(-3)), but not Graves disease (OR = 1.03, p = 0.82). Thyroid cancer, previously associated with this locus, was not significantly associated in the PheWAS (OR = 1.29, p = 0.09). The strongest association in the PheWAS was hypothyroidism (OR = 0.76, p = 2.7 × 10(-13)), which had an odds ratio that was nearly identical to that of the curated case-control population in the primary analysis, providing further validation of the PheWAS method. Our findings indicate that EMR-linked genomic data could allow discovery of genes associated with many diseases without additional genotyping cost.
Collapse
|
196
|
Sohn S, Kocher JPA, Chute CG, Savova GK. Drug side effect extraction from clinical narratives of psychiatry and psychology patients. J Am Med Inform Assoc 2011; 18 Suppl 1:i144-9. [PMID: 21946242 DOI: 10.1136/amiajnl-2011-000351] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
OBJECTIVE To extract physician-asserted drug side effects from electronic medical record clinical narratives. MATERIALS AND METHODS Pattern matching rules were manually developed through examining keywords and expression patterns of side effects to discover an individual side effect and causative drug relationship. A combination of machine learning (C4.5) using side effect keyword features and pattern matching rules was used to extract sentences that contain side effect and causative drug pairs, enabling the system to discover most side effect occurrences. Our system was implemented as a module within the clinical Text Analysis and Knowledge Extraction System. RESULTS The system was tested in the domain of psychiatry and psychology. The rule-based system extracting side effects and causative drugs produced an F score of 0.80 (0.55 excluding allergy section). The hybrid system identifying side effect sentences had an F score of 0.75 (0.56 excluding allergy section) but covered more side effect and causative drug pairs than individual side effect extraction. DISCUSSION The rule-based system was able to identify most side effects expressed by clear indication words. More sophisticated semantic processing is required to handle complex side effect descriptions in the narrative. We demonstrated that our system can be trained to identify sentences with complex side effect descriptions that can be submitted to a human expert for further abstraction. CONCLUSION Our system was able to extract most physician-asserted drug side effects. It can be used in either an automated mode for side effect extraction or semi-automated mode to identify side effect sentences that can significantly simplify abstraction by a human expert.
Collapse
|
197
|
Bielinski SJ, Chai HS, Pathak J, Talwalkar JA, Limburg PJ, Gullerud RE, Sicotte H, Klee EW, Ross JL, Kocher JPA, Kullo IJ, Heit JA, Petersen GM, de Andrade M, Chute CG. Mayo Genome Consortia: a genotype-phenotype resource for genome-wide association studies with an application to the analysis of circulating bilirubin levels. Mayo Clin Proc 2011; 86:606-14. [PMID: 21646302 PMCID: PMC3127556 DOI: 10.4065/mcp.2011.0178] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
OBJECTIVE To create a cohort for cost-effective genetic research, the Mayo Genome Consortia (MayoGC) has been assembled with participants from research studies across Mayo Clinic with high-throughput genetic data and electronic medical record (EMR) data for phenotype extraction. PARTICIPANTS AND METHODS Eligible participants include those who gave general research consent in the contributing studies to share high-throughput genotyping data with other investigators. Herein, we describe the design of the MayoGC, including the current participating cohorts, expansion efforts, data processing, and study management and organization. A genome-wide association study to identify genetic variants associated with total bilirubin levels was conducted to test the genetic research capability of the MayoGC. RESULTS Genome-wide significant results were observed on 2q37 (top single nucleotide polymorphism, rs4148325; P=5.0 × 10(-62)) and 12p12 (top single nucleotide polymorphism, rs4363657; P=5.1 × 10(-8)) corresponding to a gene cluster of uridine 5'-diphospho-glucuronosyltransferases (the UGT1A cluster) and solute carrier organic anion transporter family, member 1B1 (SLCO1B1), respectively. CONCLUSION Genome-wide association studies have identified genetic variants associated with numerous phenotypes but have been historically limited by inadequate sample size due to costly genotyping and phenotyping. Large consortia with harmonized genotype data have been assembled to attain sufficient statistical power, but phenotyping remains a rate-limiting factor in gene discovery research efforts. The EMR consists of an abundance of phenotype data that can be extracted in a relatively quick and systematic manner. The MayoGC provides a model of a unique collaborative effort in the environment of a common EMR for the investigation of genetic determinants of diseases.
Collapse
|
198
|
Pathak J, Wang J, Kashyap S, Basford M, Li R, Masys DR, Chute CG. Mapping clinical phenotype data elements to standardized metadata repositories and controlled terminologies: the eMERGE Network experience. J Am Med Inform Assoc 2011; 18:376-86. [PMID: 21597104 DOI: 10.1136/amiajnl-2010-000061] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
BACKGROUND Systematic study of clinical phenotypes is important for a better understanding of the genetic basis of human diseases and more effective gene-based disease management. A key aspect in facilitating such studies requires standardized representation of the phenotype data using common data elements (CDEs) and controlled biomedical vocabularies. In this study, the authors analyzed how a limited subset of phenotypic data is amenable to common definition and standardized collection, as well as how their adoption in large-scale epidemiological and genome-wide studies can significantly facilitate cross-study analysis. METHODS The authors mapped phenotype data dictionaries from five different eMERGE (Electronic Medical Records and Genomics) Network sites studying multiple diseases such as peripheral arterial disease and type 2 diabetes. For mapping, standardized terminological and metadata repository resources, such as the caDSR (Cancer Data Standards Registry and Repository) and SNOMED CT (Systematized Nomenclature of Medicine), were used. The mapping process comprised both lexical (via searching for relevant pre-coordinated concepts and data elements) and semantic (via post-coordination) techniques. Where feasible, new data elements were curated to enhance the coverage during mapping. A web-based application was also developed to uniformly represent and query the mapped data elements from different eMERGE studies. RESULTS Approximately 60% of the target data elements (95 out of 157) could be mapped using simple lexical analysis techniques on pre-coordinated terms and concepts before any additional curation of terminology and metadata resources was initiated by eMERGE investigators. After curation of 54 new caDSR CDEs and nine new NCI thesaurus concepts and using post-coordination, the authors were able to map the remaining 40% of data elements to caDSR and SNOMED CT. A web-based tool was also implemented to assist in semi-automatic mapping of data elements. CONCLUSION This study emphasizes the requirement for standardized representation of clinical research data using existing metadata and terminology resources and provides simple techniques and software for data element mapping using experiences from the eMERGE Network.
Collapse
|
199
|
Kho AN, Pacheco JA, Peissig PL, Rasmussen L, Newton KM, Weston N, Crane PK, Pathak J, Chute CG, Bielinski SJ, Kullo IJ, Li R, Manolio TA, Chisholm RL, Denny JC. Electronic medical records for genetic research: results of the eMERGE consortium. Sci Transl Med 2011; 3:79re1. [PMID: 21508311 PMCID: PMC3690272 DOI: 10.1126/scitranslmed.3001807] [Citation(s) in RCA: 246] [Impact Index Per Article: 18.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Clinical data in electronic medical records (EMRs) are a potential source of longitudinal clinical data for research. The Electronic Medical Records and Genomics Network (eMERGE) investigates whether data captured through routine clinical care using EMRs can identify disease phenotypes with sufficient positive and negative predictive values for use in genome-wide association studies (GWAS). Using data from five different sets of EMRs, we have identified five disease phenotypes with positive predictive values of 73 to 98% and negative predictive values of 98 to 100%. Most EMRs captured key information (diagnoses, medications, laboratory tests) used to define phenotypes in a structured format. We identified natural language processing as an important tool to improve case identification rates. Efforts and incentives to increase the implementation of interoperable EMRs will markedly improve the availability of clinical data for genomics research.
Collapse
|
200
|
Pathak J, Chute CG. Further revamping VA's NDF-RT drug terminology for clinical research. J Am Med Inform Assoc 2011; 18:347-8. [PMID: 21415066 DOI: 10.1136/amiajnl-2011-000161] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
|