1
|
Yap SHA, Philip S, Graveling AJ, Abraham P, Downs D. Creating a SNOMED CT reference set for common endocrine disorders based on routine clinic correspondence. Clin Endocrinol (Oxf) 2024; 100:343-349. [PMID: 37555365 DOI: 10.1111/cen.14951] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/04/2023] [Revised: 06/13/2023] [Accepted: 07/13/2023] [Indexed: 08/10/2023]
Abstract
BACKGROUND Routine clinical coding of clinical outcomes in outpatient consultations still lags behind the coding of episodes of inpatient care. Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) offers an opportunity for standardised coding of key clinical information. Identifying the most commonly required SNOMED terms and grouping these into a reference set will aid future adoption in routine clinical care. OBJECTIVE To create a common endocrinology reference set to standardise the coding for outcomes of outpatient endocrine consultations, using a semi-automated extraction of information from existing clinical correspondence. METHODS Retrospective review of data from an adult tertiary outpatient endocrine clinic between 2018 and 2019. A total of 1870 patients from postcodes within two regional areas of NHS Grampian (Aberdeen City and Aberdeenshire) attended the clinic. Following consultation, an automated script extracted each problem statement which was manually coded using the 'disorder' concepts from SNOMED CT (UK edition). RESULTS The review identified 298 relevant endocrine diagnoses, 99 findings and 142 procedures. There were a total of 88 (29.5%) commonly seen endocrine conditions (e.g., Graves' disease, anterior hypopituitarism and Addison's disease) and 210 (70.5%) less commonly seen endocrine conditions. Subsequently, consultant endocrinologists completed a survey regarding the common endocrine conditions; 28 conditions have 100% agreement, 25 have 90%-99% agreement, 31 have 50%-89% agreement and 4 have less than 59% agreement (which were excluded). CONCLUSION Automated text parsing of structured endocrine correspondence allowed the creation of a SNOMED CT reference set for common endocrine disorders. This will facilitate funding and planning of service provision in endocrinology by allowing more accurate characterisation of the patient cohorts needing specialist endocrine care.
Collapse
Affiliation(s)
- Shao Hao Alan Yap
- JJR Macleod Centre for Diabetes & Endocrinology, Aberdeen Royal Infirmary, Aberdeen, UK
| | - Sam Philip
- JJR Macleod Centre for Diabetes & Endocrinology, Aberdeen Royal Infirmary, Aberdeen, UK
| | - Alex J Graveling
- JJR Macleod Centre for Diabetes & Endocrinology, Aberdeen Royal Infirmary, Aberdeen, UK
| | - Prakash Abraham
- JJR Macleod Centre for Diabetes & Endocrinology, Aberdeen Royal Infirmary, Aberdeen, UK
| | | |
Collapse
|
2
|
Vinnikov M, Chaudhari V, Geller J. Usability and Recall Evaluation of Virtual Reality Ontology Object Manipulation (VROOM) System. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2024; 2023:726-735. [PMID: 38222384 PMCID: PMC10785858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
Biomedical ontologies are repositories of knowledge that encapsulate biomedical terms and the relationships between them. When visualized, ontologies are complex graphs, where each node represents one biomedical concept, and links express binary relationships between pairs of concepts. Such a network can have thousands of nodes, making visualization and manipulation difficult. This paper presents a novel Virtual Reality Ontology Object Manipulation (VROOM) system that supports browsing and interaction with a biomedical ontology in a virtual 3-D space and a complementary usability evaluation of VROOM. VROOM provides editing tools such as scissors and a glue stick that can be used to reconnect concepts by direct manipulation. The study compares the recall process of information in a traditional 2-D ontology editor such as Prot´eg´e with the virtual reality setting. Our results show that virtual reality ontology manipulation is preferred over a more conventional graphical ontology browser on many usability aspects.
Collapse
Affiliation(s)
| | | | - James Geller
- New Jersey Institute of Technology (NJIT), Newark, NJ
| |
Collapse
|
3
|
Zheng L, Perl Y, He Y, Ochs C, Geller J, Liu H, Keloth VK. Visual comprehension and orientation into the COVID-19 CIDO ontology. J Biomed Inform 2021; 120:103861. [PMID: 34224898 PMCID: PMC8252699 DOI: 10.1016/j.jbi.2021.103861] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Revised: 05/11/2021] [Accepted: 06/30/2021] [Indexed: 12/12/2022]
Abstract
The current intensive research on potential remedies and vaccinations for COVID-19 would greatly benefit from an ontology of standardized COVID terms. The Coronavirus Infectious Disease Ontology (CIDO) is the largest among several COVID ontologies, and it keeps growing, but it is still a medium sized ontology. Sophisticated CIDO users, who need more than searching for a specific concept, require orientation and comprehension of CIDO. In previous research, we designed a summarization network called "partial-area taxonomy" to support comprehension of ontologies. The partial-area taxonomy for CIDO is of smaller magnitude than CIDO, but is still too large for comprehension. We present here the "weighted aggregate taxonomy" of CIDO, designed to provide compact views at various granularities of our partial-area taxonomy (and the CIDO ontology). Such a compact view provides a "big picture" of the content of an ontology. In previous work, in the visualization patterns used for partial-area taxonomies, the nodes were arranged in levels according to the numbers of relationships of their concepts. Applying this visualization pattern to CIDO's weighted aggregate taxonomy resulted in an overly long and narrow layout that does not support orientation and comprehension since the names of nodes are barely readable. Thus, we introduce in this paper an innovative visualization of the weighted aggregate taxonomy for better orientation and comprehension of CIDO (and other ontologies). A measure for the efficiency of a layout is introduced and is used to demonstrate the advantage of the new layout over the previous one. With this new visualization, the user can "see the forest for the trees" of the ontology. Benefits of this visualization in highlighting insights into CIDO's content are provided. Generality of the new layout is demonstrated.
Collapse
Affiliation(s)
- Ling Zheng
- Computer Science and Software Engineering Department, Monmouth University, West Long Branch, NJ, USA.
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | - Yongqun He
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, USA
| | | | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | - Hao Liu
- Columbia University Irving Medical Center, New York, NY, USA
| | - Vipina K Keloth
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| |
Collapse
|
4
|
Zheng L, Min H, Chen Y, Keloth V, Geller J, Perl Y, Hripcsak G. Outlier concepts auditing methodology for a large family of biomedical ontologies. BMC Med Inform Decis Mak 2020; 20:296. [PMID: 33319713 PMCID: PMC7737254 DOI: 10.1186/s12911-020-01311-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Accepted: 10/28/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Summarization networks are compact summaries of ontologies. The "Big Picture" view offered by summarization networks enables to identify sets of concepts that are more likely to have errors than control concepts. For ontologies that have outgoing lateral relationships, we have developed the "partial-area taxonomy" summarization network. Prior research has identified one kind of outlier concepts, concepts of small partials-areas within partial-area taxonomies. Previously we have shown that the small partial-area technique works successfully for four ontologies (or their hierarchies). METHODS To improve the Quality Assurance (QA) scalability, a family-based QA framework, where one QA technique is potentially applicable to a whole family of ontologies with similar structural features, was developed. The 373 ontologies hosted at the NCBO BioPortal in 2015 were classified into a collection of families based on structural features. A meta-ontology represents this family collection, including one family of ontologies having outgoing lateral relationships. The process of updating the current meta-ontology is described. To conclude that one QA technique is applicable for at least half of the members for a family F, this technique should be demonstrated as successful for six out of six ontologies in F. We describe a hypothesis setting the condition required for a technique to be successful for a given ontology. The process of a study to demonstrate such success is described. This paper intends to prove the scalability of the small partial-area technique. RESULTS We first updated the meta-ontology classifying 566 BioPortal ontologies. There were 371 ontologies in the family with outgoing lateral relationships. We demonstrated the success of the small partial-area technique for two ontology hierarchies which belong to this family, SNOMED CT's Specimen hierarchy and NCIt's Gene hierarchy. Together with the four previous ontologies from the same family, we fulfilled the "six out of six" condition required to show the scalability for the whole family. CONCLUSIONS We have shown that the small partial-area technique can be potentially successful for the family of ontologies with outgoing lateral relationships in BioPortal, thus improve the scalability of this QA technique.
Collapse
Affiliation(s)
- Ling Zheng
- Computer Science and Software Engineering Department, Monmouth University, West Long Branch, NJ, 07764, USA.
| | - Hua Min
- Department of Health Administration and Policy, George Mason University, Fairfax, VA, 22030, USA
| | - Yan Chen
- CIS Department, Borough of Manhattan Community College, CUNY, New York, NY, 10007, USA
| | - Vipina Keloth
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA
| | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, New York, NY, 10032, USA
| |
Collapse
|
5
|
Zheng L, Chen Y, Min H, Hildebrand PL, Liu H, Halper M, Geller J, de Coronado S, Perl Y. Missing lateral relationships in top-level concepts of an ontology. BMC Med Inform Decis Mak 2020; 20:305. [PMID: 33319709 PMCID: PMC7737264 DOI: 10.1186/s12911-020-01319-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2020] [Accepted: 11/09/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Ontologies house various kinds of domain knowledge in formal structures, primarily in the form of concepts and the associative relationships between them. Ontologies have become integral components of many health information processing environments. Hence, quality assurance of the conceptual content of any ontology is critical. Relationships are foundational to the definition of concepts. Missing relationship errors (i.e., unintended omissions of important definitional relationships) can have a deleterious effect on the quality of an ontology. An abstraction network is a structure that overlays an ontology and provides an alternate, summarization view of its contents. One kind of abstraction network is called an area taxonomy, and a variation of it is called a subtaxonomy. A methodology based on these taxonomies for more readily finding missing relationship errors is explored. METHODS The area taxonomy and the subtaxonomy are deployed to help reveal concepts that have a high likelihood of exhibiting missing relationship errors. A specific top-level grouping unit found within the area taxonomy and subtaxonomy, when deemed to be anomalous, is used as an indicator that missing relationship errors are likely to be found among certain concepts. Two hypotheses pertaining to the effectiveness of our Quality Assurance approach are studied. RESULTS Our Quality Assurance methodology was applied to the Biological Process hierarchy of the National Cancer Institute thesaurus (NCIt) and SNOMED CT's Eye/vision finding subhierarchy within its Clinical finding hierarchy. Many missing relationship errors were discovered and confirmed in our analysis. For both test-bed hierarchies, our Quality Assurance methodology yielded a statistically significantly higher number of concepts with missing relationship errors in comparison to a control sample of concepts. Two hypotheses are confirmed by these findings. CONCLUSIONS Quality assurance is a critical part of an ontology's lifecycle, and automated or semi-automated tools for supporting this process are invaluable. We introduced a Quality Assurance methodology targeted at missing relationship errors. Its successful application to the NCIt's Biological Process hierarchy and SNOMED CT's Eye/vision finding subhierarchy indicates that it can be a useful addition to the arsenal of tools available to ontology maintenance personnel.
Collapse
Affiliation(s)
- Ling Zheng
- Computer Science and Software Engineering Department, Monmouth University, West Long Branch, NJ, 07764, USA.
| | - Yan Chen
- CIS Department, Borough of Manhattan Community College, CUNY, New York, NY, 10007, USA
| | - Hua Min
- Department of Health Administration and Policy, George Mason University, Fairfax, VA, 22030, USA
| | | | - Hao Liu
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA
| | - Michael Halper
- Department of Informatics, New Jersey Institute of Technology, Newark, NJ, 07102, USA
| | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA
| | - Sherri de Coronado
- National Cancer Institute, Center for Biomedical Informatics and Information Technology, National Institutes of Health, Rockville, MD, 20850, USA
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA
| |
Collapse
|
6
|
Liu H, Perl Y, Geller J. Concept placement using BERT trained by transforming and summarizing biomedical ontology structure. J Biomed Inform 2020; 112:103607. [PMID: 33098987 DOI: 10.1016/j.jbi.2020.103607] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Revised: 09/07/2020] [Accepted: 10/17/2020] [Indexed: 11/17/2022]
Abstract
The comprehensive modeling and hierarchical positioning of a new concept in an ontology heavily relies on its set of proper subsumption relationships (IS-As) to other concepts. Identifying a concept's IS-A relationships is a laborious task requiring curators to have both domain knowledge and terminology skills. In this work, we propose a method to automatically predict the presence of IS-A relationships between a new concept and pre-existing concepts based on the language representation model BERT. This method converts the neighborhood network of a concept into "sentences" and harnesses BERT's Next Sentence Prediction (NSP) capability of predicting the adjacency of two sentences. To augment our method's performance, we refined the training data by employing an ontology summarization technique. We trained our model with the two largest hierarchies of the SNOMED CT 2017 July release and applied it to predicting the parents of new concepts added in the SNOMED CT 2018 January release. The results showed that our method achieved an average F1 score of 0.88, and the average Recall score improves slightly from 0.94 to 0.96 by using the ontology summarization technique.
Collapse
Affiliation(s)
- Hao Liu
- Dept of Computer Science, NJIT, Newark, NJ, USA.
| | | | | |
Collapse
|
7
|
Zheng L, Liu H, Perl Y, Geller J. Training a Convolutional Neural Network with Terminology Summarization Data Improves SNOMED CT Enrichment. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2020; 2019:972-981. [PMID: 32308894 PMCID: PMC7153126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
As a step toward learning to automatically insert new concepts into a large biomedical ontology, we are studying the easier problem of automatically verifying that an IS-A link should exist between a new child concept and an existing parent concept. We are using a Convolutional Neural Network, a powerful machine learning method. However, results depend on the quality of the training data. We use SNOMED CT (July 2017) for training and the subsequent release for testing. The main problem is to find a good set of negative training data. We experiment with two approaches, based on uncle-nephew (not connected) pairs of concepts. We contrast using the complete Clinical Finding hierarchy of SNOMED CT with using the powerful Area Taxonomy ontology summarization mechanism to constrain the training data. The results for the task of verifying IS-A links are improved by 8.6% when going from the complete hierarchy to the Area Taxonomy.
Collapse
Affiliation(s)
- Ling Zheng
- CSSE Department, Monmouth University, West Long Branch, NJ, USA
| | - Hao Liu
- SABOC Center, Department of Computer Science, NJIT, Newark, NJ, USA
| | - Yehoshua Perl
- SABOC Center, Department of Computer Science, NJIT, Newark, NJ, USA
| | - James Geller
- SABOC Center, Department of Computer Science, NJIT, Newark, NJ, USA
| |
Collapse
|
8
|
Zheng L, Liu H, Perl Y, Geller J, Ochs C, Case JT. Overlapping Complex Concepts Have More Commission Errors, Especially in Intensive Terminology Auditing. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2018:1157-1166. [PMID: 30815158 PMCID: PMC6371375] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
SNOMED CT is a large, complex and widely-used terminology. Auditing is part of the life cycle of terminologies. A review of terminologies' content can identify two error categories: commission errors, such as an incorrect parent or attribute relationship, indicating errors in a concept's modeling, and omission errors, such as missing a parent or attribute relationship, representing incomplete modeling of a concept. According to our experience, terminology curators are mostly interested in commission errors. In recent years, a long-term remodeling project has addressed modeling issues in SNOMED CT's Infectious disease and Congenital disease subhierarchies. In this longitudinal study, we investigated a posteriori the efficacy of complex concepts, called overlapping concepts, to identify commission errors during intensive auditing periods and during maintenance periods over several releases. The algorithmic implication is that when auditing resources are scarce, a methodology of auditing first, or only, the overlapping concepts will obtain a higher auditing yield.
Collapse
Affiliation(s)
- Ling Zheng
- Monmouth University, West Long Branch, NJ, US
| | - Hao Liu
- New Jersey Institute of Technology, Newark, NJ, US
| | | | - James Geller
- New Jersey Institute of Technology, Newark, NJ, US
| | | | | |
Collapse
|
9
|
Zheng L, Chen Y, Elhanan G, Perl Y, Geller J, Ochs C. Complex overlapping concepts: An effective auditing methodology for families of similarly structured BioPortal ontologies. J Biomed Inform 2018; 83:135-149. [PMID: 29852316 DOI: 10.1016/j.jbi.2018.05.015] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2017] [Revised: 05/25/2018] [Accepted: 05/26/2018] [Indexed: 11/30/2022]
Abstract
In previous research, we have demonstrated for a number of ontologies that structurally complex concepts (for different definitions of "complex") in an ontology are more likely to exhibit errors than other concepts. Thus, such complex concepts often become fertile ground for quality assurance (QA) in ontologies. They should be audited first. One example of complex concepts is given by "overlapping concepts" (to be defined below.) Historically, a different auditing methodology had to be developed for every single ontology. For better scalability and efficiency, it is desirable to identify family-wide QA methodologies. Each such methodology would be applicable to a whole family of similar ontologies. In past research, we had divided the 685 ontologies of BioPortal into families of structurally similar ontologies. We showed for four ontologies of the same large family in BioPortal that "overlapping concepts" are indeed statistically significantly more likely to exhibit errors. In order to make an authoritative statement concerning the success of "overlapping concepts" as a methodology for a whole family of similar ontologies (or of large subhierarchies of ontologies), it is necessary to show that "overlapping concepts" have a higher likelihood of errors for six out of six ontologies of the family. In this paper, we are demonstrating for two more ontologies that "overlapping concepts" can successfully predict groups of concepts with a higher error rate than concepts from a control group. The fifth ontology is the Neoplasm subhierarchy of the National Cancer Institute thesaurus (NCIt). The sixth ontology is the Infectious Disease subhierarchy of SNOMED CT. We demonstrate quality assurance results for both of them. Furthermore, in this paper we observe two novel, important, and useful phenomena during quality assurance of "overlapping concepts." First, an erroneous "overlapping concept" can help with discovering other erroneous "non-overlapping concepts" in its vicinity. Secondly, correcting erroneous "overlapping concepts" may turn them into "non-overlapping concepts." We demonstrate that this may reduce the complexity of parts of the ontology, which in turn makes the ontology more comprehensible, simplifying maintenance and use of the ontology.
Collapse
Affiliation(s)
- Ling Zheng
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States.
| | - Yan Chen
- CIS Department, Borough of Manhattan Community College, CUNY, NY 10007, United States
| | - Gai Elhanan
- Applied Innovation Center, Desert Research Institute, Reno, NV 89512, United States
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States
| | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States
| | | |
Collapse
|
10
|
Cui L, Zhu W, Tao S, Case JT, Bodenreider O, Zhang GQ. Mining non-lattice subgraphs for detecting missing hierarchical relations and concepts in SNOMED CT. J Am Med Inform Assoc 2018; 24:788-798. [PMID: 28339775 PMCID: PMC6080685 DOI: 10.1093/jamia/ocw175] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2016] [Accepted: 12/03/2016] [Indexed: 11/14/2022] Open
Abstract
Objective Quality assurance of large ontological systems such as SNOMED CT is an indispensable part of the terminology management lifecycle. We introduce a hybrid structural-lexical method for scalable and systematic discovery of missing hierarchical relations and concepts in SNOMED CT. Material and Methods All non-lattice subgraphs (the structural part) in SNOMED CT are exhaustively extracted using a scalable MapReduce algorithm. Four lexical patterns (the lexical part) are identified among the extracted non-lattice subgraphs. Non-lattice subgraphs exhibiting such lexical patterns are often indicative of missing hierarchical relations or concepts. Each lexical pattern is associated with a potential specific type of error. Results Applying the structural-lexical method to SNOMED CT (September 2015 US edition), we found 6801 non-lattice subgraphs that matched these lexical patterns, of which 2046 were amenable to visual inspection. We evaluated a random sample of 100 small subgraphs, of which 59 were reviewed in detail by domain experts. All the subgraphs reviewed contained errors confirmed by the experts. The most frequent type of error was missing is-a relations due to incomplete or inconsistent modeling of the concepts. Conclusions Our hybrid structural-lexical method is innovative and proved effective not only in detecting errors in SNOMED CT, but also in suggesting remediation for these errors.
Collapse
Affiliation(s)
- Licong Cui
- Department of Computer Science, University of Kentucky, Lexington, KY, USA.,Institute for Biomedical Informatics, University of Kentucky
| | - Wei Zhu
- Institute for Biomedical Informatics, University of Kentucky
| | - Shiqiang Tao
- Institute for Biomedical Informatics, University of Kentucky.,Division of Biomedical Informatics, College of Medicine, University of Kentucky
| | | | | | - Guo-Qiang Zhang
- Institute for Biomedical Informatics, University of Kentucky.,Division of Biomedical Informatics, College of Medicine, University of Kentucky
| |
Collapse
|
11
|
Cui L, Bodenreider O, Shi J, Zhang GQ. Auditing SNOMED CT hierarchical relations based on lexical features of concepts in non-lattice subgraphs. J Biomed Inform 2018; 78:177-184. [PMID: 29274386 PMCID: PMC5835197 DOI: 10.1016/j.jbi.2017.12.010] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2017] [Revised: 12/18/2017] [Accepted: 12/19/2017] [Indexed: 11/19/2022]
Abstract
OBJECTIVE We introduce a structural-lexical approach for auditing SNOMED CT using a combination of non-lattice subgraphs of the underlying hierarchical relations and enriched lexical attributes of fully specified concept names. Our goal is to develop a scalable and effective approach that automatically identifies missing hierarchical IS-A relations. METHODS Our approach involves 3 stages. In stage 1, all non-lattice subgraphs of SNOMED CT's IS-A hierarchical relations are extracted. In stage 2, lexical attributes of fully-specified concept names in such non-lattice subgraphs are extracted. For each concept in a non-lattice subgraph, we enrich its set of attributes with attributes from its ancestor concepts within the non-lattice subgraph. In stage 3, subset inclusion relations between the lexical attribute sets of each pair of concepts in each non-lattice subgraph are compared to existing IS-A relations in SNOMED CT. For concept pairs within each non-lattice subgraph, if a subset relation is identified but an IS-A relation is not present in SNOMED CT IS-A transitive closure, then a missing IS-A relation is reported. The September 2017 release of SNOMED CT (US edition) was used in this investigation. RESULTS A total of 14,380 non-lattice subgraphs were extracted, from which we suggested a total of 41,357 missing IS-A relations. For evaluation purposes, 200 non-lattice subgraphs were randomly selected from 996 smaller subgraphs (of size 4, 5, or 6) within the "Clinical Finding" and "Procedure" sub-hierarchies. Two domain experts confirmed 185 (among 223) suggested missing IS-A relations, a precision of 82.96%. CONCLUSIONS Our results demonstrate that analyzing the lexical features of concepts in non-lattice subgraphs is an effective approach for auditing SNOMED CT.
Collapse
Affiliation(s)
- Licong Cui
- Department of Computer Science, University of Kentucky, Lexington, KY, USA; Institute for Biomedical Informatics, University of Kentucky, Lexington, KY, USA.
| | | | - Jay Shi
- Department of Internal Medicine, University of Kentucky, Lexington, KY, USA
| | - Guo-Qiang Zhang
- Department of Computer Science, University of Kentucky, Lexington, KY, USA; Institute for Biomedical Informatics, University of Kentucky, Lexington, KY, USA; Department of Internal Medicine, University of Kentucky, Lexington, KY, USA
| |
Collapse
|
12
|
Callahan TJ, Baumgartner WA, Bada M, Stefanski AL, Tripodi I, White EK, Hunter LE. OWL-NETS: Transforming OWL Representations for Improved Network Inference. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2018; 23:133-144. [PMID: 29218876 PMCID: PMC5737627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Our knowledge of the biological mechanisms underlying complex human disease is largely incomplete. While Semantic Web technologies, such as the Web Ontology Language (OWL), provide powerful techniques for representing existing knowledge, well-established OWL reasoners are unable to account for missing or uncertain knowledge. The application of inductive inference methods, like machine learning and network inference are vital for extending our current knowledge. Therefore, robust methods which facilitate inductive inference on rich OWL-encoded knowledge are needed. Here, we propose OWL-NETS (NEtwork Transformation for Statistical learning), a novel computational method that reversibly abstracts OWL-encoded biomedical knowledge into a network representation tailored for network inference. Using several examples built with the Open Biomedical Ontologies, we show that OWL-NETS can leverage existing ontology-based knowledge representations and network inference methods to generate novel, biologically-relevant hypotheses. Further, the lossless transformation of OWL-NETS allows for seamless integration of inferred edges back into the original knowledge base, extending its coverage and completeness.
Collapse
Affiliation(s)
- Tiffany J Callahan
- Computational Bioscience Program, University of Colorado Denver Anschutz Medical Campus, Aurora, CO 80045, USA,
| | | | | | | | | | | | | |
Collapse
|
13
|
Taxonomy-Based Approaches to Quality Assurance of Ontologies. JOURNAL OF HEALTHCARE ENGINEERING 2017; 2017:3495723. [PMID: 29158885 PMCID: PMC5660792 DOI: 10.1155/2017/3495723] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/07/2017] [Accepted: 08/06/2017] [Indexed: 11/17/2022]
Abstract
Ontologies are important components of health information management systems. As such, the quality of their content is of paramount importance. It has been proven to be practical to develop quality assurance (QA) methodologies based on automated identification of sets of concepts expected to have higher likelihood of errors. Four kinds of such sets (called QA-sets) organized around the themes of complex and uncommonly modeled concepts are introduced. A survey of different methodologies based on these QA-sets and the results of applying them to various ontologies are presented. Overall, following these approaches leads to higher QA yields and better utilization of QA personnel. The formulation of additional QA-set methodologies will further enhance the suite of available ontology QA tools.
Collapse
|
14
|
Zheng L, Yumak H, Chen L, Ochs C, Geller J, Kapusnik-Uner J, Perl Y. Quality assurance of chemical ingredient classification for the National Drug File - Reference Terminology. J Biomed Inform 2017; 73:30-42. [PMID: 28723580 DOI: 10.1016/j.jbi.2017.07.013] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2017] [Revised: 07/13/2017] [Accepted: 07/14/2017] [Indexed: 02/04/2023]
Abstract
The National Drug File - Reference Terminology (NDF-RT) is a large and complex drug terminology consisting of several classification hierarchies on top of an extensive collection of drug concepts. These hierarchies provide important information about clinical drugs, e.g., their chemical ingredients, mechanisms of action, dosage form and physiological effects. Within NDF-RT such information is represented using tens of thousands of roles connecting drugs to classifications. In previous studies, we have introduced various kinds of Abstraction Networks to summarize the content and structure of terminologies in order to facilitate their visual comprehension, and support quality assurance of terminologies. However, these previous kinds of Abstraction Networks are not appropriate for summarizing the NDF-RT classification hierarchies, due to its unique structure. In this paper, we present the novel Ingredient Abstraction Network (IAbN) to summarize, visualize and support the audit of NDF-RT's Chemical Ingredients hierarchy and its associated drugs. A common theme in our quality assurance framework is to use characterizations of sets of concepts, revealed by the Abstraction Network structure, to capture concepts, the modeling of which is more complex than for other concepts. For the IAbN, we characterize drug ingredient concepts as more complex if they belong to IAbN groups with multiple parent groups. We show that such concepts have a statistically significantly higher rate of errors than a control sample and identify two especially common patterns of errors.
Collapse
Affiliation(s)
- Ling Zheng
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States.
| | - Hasan Yumak
- BMCC, CUNY, New York, NY 10007, United States.
| | - Ling Chen
- BMCC, CUNY, New York, NY 10007, United States.
| | - Christopher Ochs
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States.
| | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States.
| | | | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, United States.
| |
Collapse
|
15
|
Elhanan G, Ochs C, Mejino JLV, Liu H, Mungall CJ, Perl Y. From SNOMED CT to Uberon: Transferability of evaluation methodology between similarly structured ontologies. Artif Intell Med 2017; 79:9-14. [PMID: 28532962 DOI: 10.1016/j.artmed.2017.05.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2016] [Revised: 05/03/2017] [Accepted: 05/04/2017] [Indexed: 12/29/2022]
Abstract
OBJECTIVE To examine whether disjoint partial-area taxonomy, a semantically-based evaluation methodology that has been successfully tested in SNOMED CT, will perform with similar effectiveness on Uberon, an anatomical ontology that belongs to a structurally similar family of ontologies as SNOMED CT. METHOD A disjoint partial-area taxonomy was generated for Uberon. One hundred randomly selected test concepts that overlap between partial-areas were matched to a same size control sample of non-overlapping concepts. The samples were blindly inspected for non-critical issues and presumptive errors first by a general domain expert whose results were then confirmed or rejected by a highly experienced anatomical ontology domain expert. Reported issues were subsequently reviewed by Uberon's curators. RESULTS Overlapping concepts in Uberon's disjoint partial-area taxonomy exhibited a significantly higher rate of all issues. Clear-cut presumptive errors trended similarly but did not reach statistical significance. A sub-analysis of overlapping concepts with three or more relationship types indicated a much higher rate of issues. CONCLUSIONS Overlapping concepts from Uberon's disjoint abstraction network are quite likely (up to 28.9%) to exhibit issues. The results suggest that the methodology can transfer well between same family ontologies. Although Uberon exhibited relatively few overlapping concepts, the methodology can be combined with other semantic indicators to expand the process to other concepts within the ontology that will generate high yields of discovered issues.
Collapse
Affiliation(s)
- Gai Elhanan
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ, USA.
| | - Christopher Ochs
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ, USA
| | - Jose L V Mejino
- Department of Biological Structure (Structural Informatics Group), University of Washington, Seattle, WA, USA
| | - Hao Liu
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ, USA
| | | | - Yehoshua Perl
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ, USA
| |
Collapse
|
16
|
Min H, Zheng L, Perl Y, Halper M, De Coronado S, Ochs C. Relating Complexity and Error Rates of Ontology Concepts. More Complex NCIt Concepts Have More Errors. Methods Inf Med 2017; 56:200-208. [PMID: 28244549 DOI: 10.3414/me16-01-0085] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2016] [Accepted: 01/19/2017] [Indexed: 11/09/2022]
Abstract
OBJECTIVES Ontologies are knowledge structures that lend support to many health-information systems. A study is carried out to assess the quality of ontological concepts based on a measure of their complexity. The results show a relation between complexity of concepts and error rates of concepts. METHODS A measure of lateral complexity defined as the number of exhibited role types is used to distinguish between more complex and simpler concepts. Using a framework called an area taxonomy, a kind of abstraction network that summarizes the structural organization of an ontology, concepts are divided into two groups along these lines. Various concepts from each group are then subjected to a two-phase QA analysis to uncover and verify errors and inconsistencies in their modeling. A hierarchy of the National Cancer Institute thesaurus (NCIt) is used as our test-bed. A hypothesis pertaining to the expected error rates of the complex and simple concepts is tested. RESULTS Our study was done on the NCIt's Biological Process hierarchy. Various errors, including missing roles, incorrect role targets, and incorrectly assigned roles, were discovered and verified in the two phases of our QA analysis. The overall findings confirmed our hypothesis by showing a statistically significant difference between the amounts of errors exhibited by more laterally complex concepts vis-à-vis simpler concepts. CONCLUSIONS QA is an essential part of any ontology's maintenance regimen. In this paper, we reported on the results of a QA study targeting two groups of ontology concepts distinguished by their level of complexity, defined in terms of the number of exhibited role types. The study was carried out on a major component of an important ontology, the NCIt. The findings suggest that more complex concepts tend to have a higher error rate than simpler concepts. These findings can be utilized to guide ongoing efforts in ontology QA.
Collapse
Affiliation(s)
- Hua Min
- Hua Min, Department of Health Administration and Policy, College of Health and Human Services, George Mason University, MS: 1J3, 4400 University Drive, Fairfax, VA 22030-4444, USA, E-mail:
| | | | | | | | | | | |
Collapse
|
17
|
Ochs C, Case JT, Perl Y. Analyzing structural changes in SNOMED CT's Bacterial infectious diseases using a visual semantic delta. J Biomed Inform 2017; 67:101-116. [PMID: 28215561 DOI: 10.1016/j.jbi.2017.02.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2016] [Revised: 02/08/2017] [Accepted: 02/09/2017] [Indexed: 12/23/2022]
Abstract
Thousands of changes are applied to SNOMED CT's concepts during each release cycle. These changes are the result of efforts to improve or expand the coverage of health domains in the terminology. Understanding which concepts changed, how they changed, and the overall impact of a set of changes is important for editors and end users. Each SNOMED CT release comes with delta files, which identify all of the individual additions and removals of concepts and relationships. These files typically contain tens of thousands of individual entries, overwhelming users. They also do not identify the editorial processes that were applied to individual concepts and they do not capture the overall impact of a set of changes on a subhierarchy of concepts. In this paper we introduce a methodology and accompanying software tool called a SNOMED CT Visual Semantic Delta ("semantic delta" for short) to enable a comprehensive review of changes in SNOMED CT. The semantic delta displays a graphical list of editing operations that provides semantics and context to the additions and removals in the delta files. However, there may still be thousands of editing operations applied to a set of concepts. To address this issue, a semantic delta includes a visual summary of changes that affected sets of structurally and semantically similar concepts. The software tool for creating semantic deltas offers views of various granularities, allowing a user to control how much change information they view. In this tool a user can select a set of structurally and semantically similar concepts and review the editing operations that affected their modeling. The semantic delta methodology is demonstrated on SNOMED CT's Bacterial infectious disease subhierarchy, which has undergone a significant remodeling effort over the last two years.
Collapse
Affiliation(s)
- Christopher Ochs
- Computer Science Department, New Jersey Institute of Technology, University Heights, Newark, NJ 07102, USA.
| | - James T Case
- National Library of Medicine/National Institutes of Health, Bethesda, MD 20894, USA
| | - Yehoshua Perl
- Computer Science Department, New Jersey Institute of Technology, University Heights, Newark, NJ 07102, USA
| |
Collapse
|
18
|
López-García P, Schulz S. Structural Patterns under X-Rays: Is SNOMED CT Growing Straight? PLoS One 2016; 11:e0165619. [PMID: 27812127 PMCID: PMC5094788 DOI: 10.1371/journal.pone.0165619] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2015] [Accepted: 10/15/2016] [Indexed: 11/18/2022] Open
Abstract
Unprincipled modeling decisions in large-domain ontologies, such as SNOMED CT, are problematic and might act as a barrier for their quality assurance and successful use in electronic health records. Most previous work has focused on clustering problematic concepts, which is helpful for quality control but faces difficulties in pinpointing the origin of those modeling problems. In this study, we examined the underlying structural patterns in SNOMED CT's data model as such patterns directly reflect the modeling strategies of editors. Our results showed that 92% of all structural patterns found accumulated in the Procedure and Clinical finding sub-hierarchies, and pattern reuse was low; over 30% of patterns were only used once. A qualitative analysis of a sample of 50 such singleton patterns revealed modeling problems, including redundancy, omission, and inconsistency. The problems detected in the sample suggest that the analysis of structural patterns is a valuable technique for revealing problematic areas of SNOMED CT and modeling the styles of terminology editors. Furthermore, the patterns that describe the modeling of a large number of concepts could provide insights for template creation and refinement in SNOMED CT.
Collapse
Affiliation(s)
- Pablo López-García
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Graz, Austria
| | - Stefan Schulz
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Graz, Austria
| |
Collapse
|
19
|
Perl Y, Geller J, Halper M, Ochs C, Zheng L, Kapusnik-Uner J. Introducing the Big Knowledge to Use (BK2U) challenge. Ann N Y Acad Sci 2016; 1387:12-24. [PMID: 27750400 DOI: 10.1111/nyas.13225] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2016] [Revised: 07/07/2016] [Accepted: 08/11/2016] [Indexed: 12/26/2022]
Abstract
The purpose of the Big Data to Knowledge initiative is to develop methods for discovering new knowledge from large amounts of data. However, if the resulting knowledge is so large that it resists comprehension, referred to here as Big Knowledge (BK), how can it be used properly and creatively? We call this secondary challenge, Big Knowledge to Use. Without a high-level mental representation of the kinds of knowledge in a BK knowledgebase, effective or innovative use of the knowledge may be limited. We describe summarization and visualization techniques that capture the big picture of a BK knowledgebase, possibly created from Big Data. In this research, we distinguish between assertion BK and rule-based BK (rule BK) and demonstrate the usefulness of summarization and visualization techniques of assertion BK for clinical phenotyping. As an example, we illustrate how a summary of many intracranial bleeding concepts can improve phenotyping, compared to the traditional approach. We also demonstrate the usefulness of summarization and visualization techniques of rule BK for drug-drug interaction discovery.
Collapse
Affiliation(s)
| | | | - Michael Halper
- Information Technology Department, New Jersey Institute of Technology, Newark, New Jersey
| | | | | | | |
Collapse
|
20
|
López-García P, Schulz S. Can SNOMED CT be squeezed without losing its shape? J Biomed Semantics 2016; 7:56. [PMID: 27655655 PMCID: PMC5031277 DOI: 10.1186/s13326-016-0101-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2016] [Accepted: 09/09/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In biomedical applications where the size and complexity of SNOMED CT become problematic, using a smaller subset that can act as a reasonable substitute is usually preferred. In a special class of use cases-like ontology-based quality assurance, or when performing scaling experiments for real-time performance-it is essential that modules show a similar shape than SNOMED CT in terms of concept distribution per sub-hierarchy. Exactly how to extract such balanced modules remains unclear, as most previous work on ontology modularization has focused on other problems. In this study, we investigate to what extent extracting balanced modules that preserve the original shape of SNOMED CT is possible, by presenting and evaluating an iterative algorithm. METHODS We used a graph-traversal modularization approach based on an input signature. To conform to our definition of a balanced module, we implemented an iterative algorithm that carefully bootstraped and dynamically adjusted the signature at each step. We measured the error for each sub-hierarchy and defined convergence as a residual sum of squares <1. RESULTS Using 2000 concepts as an initial signature, our algorithm converged after seven iterations and extracted a module 4.7 % the size of SNOMED CT. Seven sub-hierarhies were either over or under-represented within a range of 1-8 %. CONCLUSIONS Our study shows that balanced modules from large terminologies can be extracted using ontology graph-traversal modularization techniques under certain conditions: that the process is repeated a number of times, the input signature is dynamically adjusted in each iteration, and a moderate under/over-representation of some hierarchies is tolerated. In the case of SNOMED CT, our results conclusively show that it can be squeezed to less than 5 % of its size without any sub-hierarchy losing its shape more than 8 %, which is likely sufficient in most use cases.
Collapse
Affiliation(s)
- Pablo López-García
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Auenbruggerplatz 2, Graz, Austria.
| | - Stefan Schulz
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Auenbruggerplatz 2, Graz, Austria
| |
Collapse
|
21
|
Hernández-Chan GS, Ceh-Varela EE, Sanchez-Cervantes JL, Villanueva-Escalante M, Rodríguez-González A, Pérez-Gallardo Y. Collective intelligence in medical diagnosis systems: A case study. Comput Biol Med 2016; 74:45-53. [DOI: 10.1016/j.compbiomed.2016.04.016] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2015] [Revised: 04/26/2016] [Accepted: 04/26/2016] [Indexed: 11/26/2022]
|
22
|
Ochs C, Geller J, Perl Y, Musen MA. A unified software framework for deriving, visualizing, and exploring abstraction networks for ontologies. J Biomed Inform 2016; 62:90-105. [PMID: 27345947 DOI: 10.1016/j.jbi.2016.06.008] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2016] [Revised: 06/02/2016] [Accepted: 06/22/2016] [Indexed: 11/27/2022]
Abstract
Software tools play a critical role in the development and maintenance of biomedical ontologies. One important task that is difficult without software tools is ontology quality assurance. In previous work, we have introduced different kinds of abstraction networks to provide a theoretical foundation for ontology quality assurance tools. Abstraction networks summarize the structure and content of ontologies. One kind of abstraction network that we have used repeatedly to support ontology quality assurance is the partial-area taxonomy. It summarizes structurally and semantically similar concepts within an ontology. However, the use of partial-area taxonomies was ad hoc and not generalizable. In this paper, we describe the Ontology Abstraction Framework (OAF), a unified framework and software system for deriving, visualizing, and exploring partial-area taxonomy abstraction networks. The OAF includes support for various ontology representations (e.g., OWL and SNOMED CT's relational format). A Protégé plugin for deriving "live partial-area taxonomies" is demonstrated.
Collapse
Affiliation(s)
- Christopher Ochs
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA.
| | - James Geller
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA
| | - Yehoshua Perl
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA
| | - Mark A Musen
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
23
|
Utilizing a structural meta-ontology for family-based quality assurance of the BioPortal ontologies. J Biomed Inform 2016; 61:63-76. [PMID: 26988001 DOI: 10.1016/j.jbi.2016.03.007] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2015] [Revised: 02/05/2016] [Accepted: 03/04/2016] [Indexed: 11/22/2022]
Abstract
An Abstraction Network is a compact summary of an ontology's structure and content. In previous research, we showed that Abstraction Networks support quality assurance (QA) of biomedical ontologies. The development of an Abstraction Network and its associated QA methodologies, however, is a labor-intensive process that previously was applicable only to one ontology at a time. To improve the efficiency of the Abstraction-Network-based QA methodology, we introduced a QA framework that uses uniform Abstraction Network derivation techniques and QA methodologies that are applicable to whole families of structurally similar ontologies. For the family-based framework to be successful, it is necessary to develop a method for classifying ontologies into structurally similar families. We now describe a structural meta-ontology that classifies ontologies according to certain structural features that are commonly used in the modeling of ontologies (e.g., object properties) and that are important for Abstraction Network derivation. Each class of the structural meta-ontology represents a family of ontologies with identical structural features, indicating which types of Abstraction Networks and QA methodologies are potentially applicable to all of the ontologies in the family. We derive a collection of 81 families, corresponding to classes of the structural meta-ontology, that enable a flexible, streamlined family-based QA methodology, offering multiple choices for classifying an ontology. The structure of 373 ontologies from the NCBO BioPortal is analyzed and each ontology is classified into multiple families modeled by the structural meta-ontology.
Collapse
|
24
|
Ochs C, Zheng L, Gu H, Perl Y, Geller J, Kapusnik-Uner J, Zakharchenko A. Drug-drug Interaction Discovery Using Abstraction Networks for "National Drug File - Reference Terminology" Chemical Ingredients. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2015; 2015:973-982. [PMID: 26958234 PMCID: PMC4765653] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The National Drug File - Reference Terminology (NDF-RT) is a large and complex drug terminology. NDF-RT provides important information about clinical drugs, e.g., their chemical ingredients, mechanisms of action, dosage form and physiological effects. Within NDF-RT such information is represented using tens of thousands of roles. It is difficult to comprehend large, complex terminologies like NDF-RT. In previous studies, we introduced abstraction networks to summarize the content and structure of terminologies. In this paper, we introduce the Ingredient Abstraction Network to summarize NDF-RT's Chemical Ingredients and their associated drugs. Additionally, we introduce the Aggregate Ingredient Abstraction Network, for controlling the granularity of summarization provided by the Ingredient Abstraction Network. The Ingredient Abstraction Network is used to support the discovery of new candidate drug-drug interactions (DDIs) not appearing in First Databank, Inc.'s DDI knowledgebase.
Collapse
Affiliation(s)
| | - Ling Zheng
- New Jersey Institute of Technology, Newark, NJ
| | - Huanying Gu
- New York Institute of Technology, New York, NY
| | | | | | | | | |
Collapse
|
25
|
Wei D, Helen Gu H, Perl Y, Halper M, Ochs C, Elhanan G, Chen Y. Structural measures to track the evolution of SNOMED CT hierarchies. J Biomed Inform 2015; 57:278-87. [PMID: 26260003 DOI: 10.1016/j.jbi.2015.08.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2015] [Revised: 08/01/2015] [Accepted: 08/01/2015] [Indexed: 11/28/2022]
Abstract
The Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) is an extensive reference terminology with an attendant amount of complexity. It has been updated continuously and revisions have been released semi-annually to meet users' needs and to reflect the results of quality assurance (QA) activities. Two measures based on structural features are proposed to track the effects of both natural terminology growth and QA activities based on aspects of the complexity of SNOMED CT. These two measures, called the structural density measure and accumulated structural measure, are derived based on two abstraction networks, the area taxonomy and the partial-area taxonomy. The measures derive from attribute relationship distributions and various concept groupings that are associated with the abstraction networks. They are used to track the trends in the complexity of structures as SNOMED CT changes over time. The measures were calculated for consecutive releases of five SNOMED CT hierarchies, including the Specimen hierarchy. The structural density measure shows that natural growth tends to move a hierarchy's structure toward a more complex state, whereas the accumulated structural measure shows that QA processes tend to move a hierarchy's structure toward a less complex state. It is also observed that both the structural density and accumulated structural measures are useful tools to track the evolution of an entire SNOMED CT hierarchy and reveal internal concept migration within it.
Collapse
Affiliation(s)
- Duo Wei
- Computer Science and Information Systems-BUSN, Stockton University, Galloway, NJ 08205, United States.
| | - Huanying Helen Gu
- Computer Science Dept., New York Institute of Technology, New York, NY 10023, United States
| | - Yehoshua Perl
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102, United States
| | - Michael Halper
- Information Technology Dept., New Jersey Institute of Technology, Newark, NJ 07102, United States
| | - Christopher Ochs
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102, United States
| | - Gai Elhanan
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102, United States; Halfpenny Technologies Inc., Blue Bell, PA 19422, United States
| | - Yan Chen
- Computer Information Systems Dept., BMCC, CUNY, New York, NY 10007, United States
| |
Collapse
|
26
|
Halper M, Gu H, Perl Y, Ochs C. Abstraction networks for terminologies: Supporting management of "big knowledge". Artif Intell Med 2015; 64:1-16. [PMID: 25890687 DOI: 10.1016/j.artmed.2015.03.005] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2014] [Revised: 02/24/2015] [Accepted: 03/25/2015] [Indexed: 11/16/2022]
Abstract
OBJECTIVE Terminologies and terminological systems have assumed important roles in many medical information processing environments, giving rise to the "big knowledge" challenge when terminological content comprises tens of thousands to millions of concepts arranged in a tangled web of relationships. Use and maintenance of knowledge structures on that scale can be daunting. The notion of abstraction network is presented as a means of facilitating the usability, comprehensibility, visualization, and quality assurance of terminologies. METHODS AND MATERIALS An abstraction network overlays a terminology's underlying network structure at a higher level of abstraction. In particular, it provides a more compact view of the terminology's content, avoiding the display of minutiae. General abstraction network characteristics are discussed. Moreover, the notion of meta-abstraction network, existing at an even higher level of abstraction than a typical abstraction network, is described for cases where even the abstraction network itself represents a case of "big knowledge." Various features in the design of abstraction networks are demonstrated in a methodological survey of some existing abstraction networks previously developed and deployed for a variety of terminologies. RESULTS The applicability of the general abstraction-network framework is shown through use-cases of various terminologies, including the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED CT), the Medical Entities Dictionary (MED), and the Unified Medical Language System (UMLS). Important characteristics of the surveyed abstraction networks are provided, e.g., the magnitude of the respective size reduction referred to as the abstraction ratio. Specific benefits of these alternative terminology-network views, particularly their use in terminology quality assurance, are discussed. Examples of meta-abstraction networks are presented. CONCLUSIONS The "big knowledge" challenge constitutes the use and maintenance of terminological structures that comprise tens of thousands to millions of concepts and their attendant complexity. The notion of abstraction network has been introduced as a tool in helping to overcome this challenge, thus enhancing the usefulness of terminologies. Abstraction networks have been shown to be applicable to a variety of existing biomedical terminologies, and these alternative structural views hold promise for future expanded use with additional terminologies.
Collapse
Affiliation(s)
- Michael Halper
- Information Technology Department, New Jersey Institute of Technology, Newark, NJ 07102, USA.
| | - Huanying Gu
- Computer Science Department, New York Institute of Technology, New York, NY 10023, USA.
| | - Yehoshua Perl
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA.
| | - Christopher Ochs
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA.
| |
Collapse
|
27
|
Ochs C, Geller J, Perl Y, Chen Y, Xu J, Min H, Case JT, Wei Z. Scalable quality assurance for large SNOMED CT hierarchies using subject-based subtaxonomies. J Am Med Inform Assoc 2014; 22:507-18. [PMID: 25336594 DOI: 10.1136/amiajnl-2014-003151] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2014] [Accepted: 09/27/2014] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE Standards terminologies may be large and complex, making their quality assurance challenging. Some terminology quality assurance (TQA) methodologies are based on abstraction networks (AbNs), compact terminology summaries. We have tested AbNs and the performance of related TQA methodologies on small terminology hierarchies. However, some standards terminologies, for example, SNOMED, are composed of very large hierarchies. Scaling AbN TQA techniques to such hierarchies poses a significant challenge. We present a scalable subject-based approach for AbN TQA. METHODS An innovative technique is presented for scaling TQA by creating a new kind of subject-based AbN called a subtaxonomy for large hierarchies. New hypotheses about concentrations of erroneous concepts within the AbN are introduced to guide scalable TQA. RESULTS We test the TQA methodology for a subject-based subtaxonomy for the Bleeding subhierarchy in SNOMED's large Clinical finding hierarchy. To test the error concentration hypotheses, three domain experts reviewed a sample of 300 concepts. A consensus-based evaluation identified 87 erroneous concepts. The subtaxonomy-based TQA methodology was shown to uncover statistically significantly more erroneous concepts when compared to a control sample. DISCUSSION The scalability of TQA methodologies is a challenge for large standards systems like SNOMED. We demonstrated innovative subject-based TQA techniques by identifying groups of concepts with a higher likelihood of having errors within the subtaxonomy. Scalability is achieved by reviewing a large hierarchy by subject. CONCLUSIONS An innovative methodology for scaling the derivation of AbNs and a TQA methodology was shown to perform successfully for the largest hierarchy of SNOMED.
Collapse
Affiliation(s)
- Christopher Ochs
- Computer Science Department, New Jersey Institute of Technology, Newark, New Jersey, USA
| | - James Geller
- Computer Science Department, New Jersey Institute of Technology, Newark, New Jersey, USA
| | - Yehoshua Perl
- Computer Science Department, New Jersey Institute of Technology, Newark, New Jersey, USA
| | - Yan Chen
- Computer Information Systems Department, BMCC, CUNY, New York, New York, USA
| | - Junchuan Xu
- Division of Knowledge Informatics, NYU, New York, New York, USA
| | - Hua Min
- Department of Health Administration and Policy, George Mason University, Fairfax, Virginia, USA
| | | | - Zhi Wei
- Computer Science Department, New Jersey Institute of Technology, Newark, New Jersey, USA
| |
Collapse
|
28
|
Ochs C, Geller J, Perl Y, Chen Y, Agrawal A, Case JT, Hripcsak G. A tribal abstraction network for SNOMED CT target hierarchies without attribute relationships. J Am Med Inform Assoc 2014; 22:628-39. [PMID: 25332354 DOI: 10.1136/amiajnl-2014-003173] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2014] [Accepted: 09/20/2014] [Indexed: 11/03/2022] Open
Abstract
OBJECTIVE Large and complex terminologies, such as Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT), are prone to errors and inconsistencies. Abstraction networks are compact summarizations of the content and structure of a terminology. Abstraction networks have been shown to support terminology quality assurance. In this paper, we introduce an abstraction network derivation methodology which can be applied to SNOMED CT target hierarchies whose classes are defined using only hierarchical relationships (ie, without attribute relationships) and similar description-logic-based terminologies. METHODS We introduce the tribal abstraction network (TAN), based on the notion of a tribe-a subhierarchy rooted at a child of a hierarchy root, assuming only the existence of concepts with multiple parents. The TAN summarizes a hierarchy that does not have attribute relationships using sets of concepts, called tribal units that belong to exactly the same multiple tribes. Tribal units are further divided into refined tribal units which contain closely related concepts. A quality assurance methodology that utilizes TAN summarizations is introduced. RESULTS A TAN is derived for the Observable entity hierarchy of SNOMED CT, summarizing its content. A TAN-based quality assurance review of the concepts of the hierarchy is performed, and erroneous concepts are shown to appear more frequently in large refined tribal units than in small refined tribal units. Furthermore, more erroneous concepts appear in large refined tribal units of more tribes than of fewer tribes. CONCLUSIONS In this paper we introduce the TAN for summarizing SNOMED CT target hierarchies. A TAN was derived for the Observable entity hierarchy of SNOMED CT. A quality assurance methodology utilizing the TAN was introduced and demonstrated.
Collapse
Affiliation(s)
- Christopher Ochs
- Computer Science Department, New Jersey Institute of Technology, Newark, New Jersey, USA
| | - James Geller
- Computer Science Department, New Jersey Institute of Technology, Newark, New Jersey, USA
| | - Yehoshua Perl
- Computer Science Department, New Jersey Institute of Technology, Newark, New Jersey, USA
| | - Yan Chen
- Computer Information Systems Department, BMCC, CUNY, New York, New York, USA
| | - Ankur Agrawal
- Department of Computer Science, Manhattan College, Riverdale, New York, USA
| | | | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, New York, New Jersey, USA
| |
Collapse
|
29
|
Abstract
Background The Refined Semantic Network (RSN) for the UMLS was previously introduced to
complement the UMLS Semantic Network (SN). The RSN partitions the UMLS
Metathesaurus (META) into disjoint groups of concepts. Each such group is
semantically uniform. However, the RSN was initially an order of magnitude
larger than the SN, which is undesirable since to be useful, a semantic
network should be compact. Most semantic types in the RSN represent
combinations of semantic types in the UMLS SN. Such a “combination
semantic type” is called Intersection Semantic Type (IST). Many ISTs
are assigned to very few concepts. Moreover, when reviewing those concepts,
many semantic type assignment inconsistencies were found. After correcting
those inconsistencies many ISTs, among them some that contradicted UMLS
rules, disappeared, which made the RSN smaller. Objective The authors performed a longitudinal study with the goal of reducing the size
of the RSN to become compact. This goal was achieved by correcting
inconsistencies and errors in the IST assignments in the UMLS, which
additionally helped identify and correct ambiguities, inconsistencies, and
errors in source terminologies widely used in the realm of public
health. Methods In this paper, we discuss the process and steps employed in this longitudinal
study and the intermediate results for different stages. The sculpting
process includes removing redundant semantic type assignments, expanding
semantic type assignments, and removing illegitimate ISTs by auditing ISTs
of small extents. However, the emphasis of this paper is not on the auditing
methodologies employed during the process, since they were introduced in
earlier publications, but on the strategy of employing them in order to
transform the RSN into a compact network. For this paper we also performed a
comprehensive audit of 168 “small ISTs” in the 2013AA version
of the UMLS to finalize the longitudinal study. Results Over the years it was found that the editors of the UMLS introduced some new
inconsistencies that resulted in the reintroduction of unwarranted ISTs that
had already been eliminated as a result of their previous corrections.
Because of that, the transformation of the RSN into a compact network
covering all necessary categories for the UMLS was slowed down. The
corrections suggested by an audit of the 2013AA version of the UMLS achieve
a compact RSN of equal magnitude as the UMLS SN. The number of ISTs has been
reduced to 336. We also demonstrate how auditing the semantic type
assignments of UMLS concepts can expose other modeling errors in the UMLS
source terminologies, e.g., SNOMED CT, LOINC, and RxNORM that are important
for health informatics. Such errors would otherwise stay hidden. Conclusions It is hoped that the UMLS curators will implement all required corrections
and use the RSN along with the SN when maintaining and extending the UMLS.
When used correctly, the RSN will support the prevention of the accidental
introduction of inconsistent semantic type assignments into the UMLS.
Furthermore, this way the RSN will support the exposure of other hidden
errors and inconsistencies in health informatics terminologies, which are
sources of the UMLS. Notably, the development of the RSN materializes the
deeper, more refined Semantic Network for the UMLS that its designers
envisioned originally but had not implemented.
Collapse
|
30
|
He Z, Ochs C, Agrawal A, Perl Y, Zeginis D, Tarabanis K, Elhanan G, Halper M, Noy N, Geller J. A family-based framework for supporting quality assurance of biomedical ontologies in BioPortal. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2013; 2013:581-590. [PMID: 24551360 PMCID: PMC3900201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
BioPortal contains over 300 ontologies, for which quality assurance (QA) is critical. Abstraction networks (ANs), compact summarizations of ontology structure and content, have been used in such QA efforts, typically in a "one-off" manner for a single ontology. Ontologies can be characterized-independently of knowledge-content focus-from a structural standpoint leading to the formulation of ontology families. A family is defined as a set of ontologies satisfying some overarching condition regarding their structural features. Seven such families, comprising 186 ontologies, are identified. To increase efficiency, a new family-based QA framework is introduced in which an automated, uniform AN derivation technique and accompanying semi-automated, uniform QA regimen are applicable to the ontologies of a given family. Specifically, across an entire family, the QA efforts exploit family-wide AN features in the characterization of sets of classes that are more likely to harbor errors. The approach is demonstrated on the Cancer Chemoprevention BioPortal ontology.
Collapse
Affiliation(s)
- Zhe He
- New Jersey Institute of Technology, Newark, NJ
| | | | | | | | | | | | | | | | | | | |
Collapse
|
31
|
Contrasting lexical similarity and formal definitions in SNOMED CT: consistency and implications. J Biomed Inform 2013; 47:192-8. [PMID: 24239752 DOI: 10.1016/j.jbi.2013.11.003] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2013] [Revised: 10/04/2013] [Accepted: 11/03/2013] [Indexed: 11/23/2022]
Abstract
OBJECTIVE To quantify the presence of and evaluate an approach for detection of inconsistencies in the formal definitions of SNOMED CT (SCT) concepts utilizing a lexical method. MATERIAL AND METHOD Utilizing SCT's Procedure hierarchy, we algorithmically formulated similarity sets: groups of concepts with similar lexical structure of their fully specified name. We formulated five random samples, each with 50 similarity sets, based on the same parameter: number of parents, attributes, groups, all the former as well as a randomly selected control sample. All samples' sets were reviewed for types of formal definition inconsistencies: hierarchical, attribute assignment, attribute target values, groups, and definitional. RESULTS For the Procedure hierarchy, 2111 similarity sets were formulated, covering 18.1% of eligible concepts. The evaluation revealed that 38 (Control) to 70% (Different relationships) of similarity sets within the samples exhibited significant inconsistencies. The rate of inconsistencies for the sample with different relationships was highly significant compared to Control, as well as the number of attribute assignment and hierarchical inconsistencies within their respective samples. DISCUSSION AND CONCLUSION While, at this time of the HITECH initiative, the formal definitions of SCT are only a minor consideration, in the grand scheme of sophisticated, meaningful use of captured clinical data, they are essential. However, significant portion of the concepts in the most semantically complex hierarchy of SCT, the Procedure hierarchy, are modeled inconsistently in a manner that affects their computability. Lexical methods can efficiently identify such inconsistencies and possibly allow for their algorithmic resolution.
Collapse
|
32
|
Agrawal A, He Z, Perl Y, Wei D, Halper M, Elhanan G, Chen Y. The readiness of SNOMED problem list concepts for meaningful use of electronic health records. Artif Intell Med 2013; 58:73-80. [PMID: 23602702 DOI: 10.1016/j.artmed.2013.03.008] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2012] [Revised: 03/05/2013] [Accepted: 03/17/2013] [Indexed: 11/24/2022]
Abstract
OBJECTIVE By 2015, SNOMED CT (SCT) will become the USA's standard for encoding diagnoses and problem lists in electronic health records (EHRs). To facilitate this effort, the National Library of Medicine has published the "SCT Clinical Observations Recording and Encoding" and the "Veterans Health Administration and Kaiser Permanente" problem lists (collectively, the "PL"). The PL is studied in regard to its readiness to support meaningful use of EHRs. In particular, we wish to determine if inconsistencies appearing in SCT, in general, occur as frequently in the PL, and whether further quality-assurance (QA) efforts on the PL are required. METHODS AND MATERIALS A study is conducted where two random samples of SCT concepts are compared. The first consists of concepts strictly from the PL and the second contains general SCT concepts distributed proportionally to the PL's in terms of their hierarchies. Each sample is analyzed for its percentage of primitive concepts and for frequency of modeling errors of various severity levels as quality measures. A simple structural indicator, namely, the number of parents, is suggested to locate high likelihood inconsistencies in hierarchical relationships. The effectiveness of this indicator is evaluated. RESULTS PL concepts are found to be slightly better than other concepts in the respective SCT hierarchies with regards to the quality measure of the percentage of primitive concepts and the frequency of modeling errors. There were 58% primitive concepts in the PL sample versus 62% in the control sample. The structural indicator of number of parents is shown to be statistically significant in its ability to identify concepts having a higher likelihood of inconsistencies in their hierarchical relationships. The absolute number of errors in the group of concepts having 1-3 parents was shown to be significantly lower than that for concepts with 4-6 parents and those with 7 or more parents based on Chi-squared analyses. CONCLUSION PL concepts suffer from the same issues as general SCT concepts, although to a slightly lesser extent, and do require further QA efforts to promote meaningful use of EHRs. To support such efforts, a structural indicator is shown to effectively ferret out potentially problematic concepts where those QA efforts should be focused.
Collapse
Affiliation(s)
- Ankur Agrawal
- Computer Science Department, New Jersey Institute of Technology, Newark, NJ 07102, USA.
| | | | | | | | | | | | | |
Collapse
|
33
|
Mikroyannidi E, Stevens R, Iannone L, Rector A. Analysing Syntactic Regularities and Irregularities in SNOMED-CT. J Biomed Semantics 2012; 3:8. [PMID: 23244503 PMCID: PMC3637289 DOI: 10.1186/2041-1480-3-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2012] [Accepted: 11/13/2012] [Indexed: 11/28/2022] Open
Abstract
Motivation In this paper we demonstrate the usage of RIO; a framework for detecting syntactic regularities using cluster analysis of the entities in the signature of an ontology. Quality assurance in ontologies is vital for their use in real applications, as well as a complex and difficult task. It is also important to have such methods and tools when the ontology lacks documentation and the user cannot consult the ontology developers to understand its construction. One aspect of quality assurance is checking how well an ontology complies with established ‘coding standards’; is the ontology regular in how descriptions of different types of entities are axiomatised? Is there a similar way to describe them and are there any corner cases that are not covered by a pattern? Detection of regularities and irregularities in axiom patterns should provide ontology authors and quality inspectors with a level of abstraction such that compliance to coding standards can be automated. However, there is a lack of such reverse ontology engineering methods and tools. Results RIO framework allows regularities to be detected in an OWL ontology, i.e. repetitive structures in the axioms of an ontology. We describe the use of standard machine learning approaches to make clusters of similar entities and generalise over their axioms to find regularities. This abstraction allows matches to, and deviations from, an ontology’s patterns to be shown. We demonstrate its usage with the inspection of three modules from SNOMED-CT, a large medical terminology, that cover “Present” and “Absent” findings, as well as “Chronic” and “Acute” findings. The module sizes are 5 065, 20 688 and 19 812 asserted axioms. They are analysed in terms of their types and number of regularities and irregularities in the asserted axioms of the ontology. The analysis showed that some modules of the terminology, which were expected to instantiate a pattern described in the SNOMED-CT technical guide, were found to have a high number of regularity deviations. A subset of these were categorised as “design defects” by verifying them with past work on the quality assurance of SNOMED-CT. These were mainly incomplete descriptions. In the worst case, the expected patterns described in the technical guide were followed by only 5% of the axioms in the module. Conclusion It is possible to automatically detect regularities and then inspect irregularities in an ontology. We argue that RIO is a tool to find and report such matches and mismatches, for evaluations by the domain experts. We have demonstrated that standard clustering techniques from machine learning can offer a tool in the drive for quality assurance in ontologies. Availability http://riotool.sourceforge.net/ Contact http://eleni.mikroyannidi@manchester.ac.uk, http://robert.stevens@manchehster.ac.uk
Collapse
Affiliation(s)
- Eleni Mikroyannidi
- School of Computer Science, The University of Manchester, Oxford Road, Manchester, M13 9PL UK.
| | | | | | | |
Collapse
|
34
|
Geller J, Ochs C, Perl Y, Xu J. New abstraction networks and a new visualization tool in support of auditing the SNOMED CT content. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2012; 2012:237-246. [PMID: 23304293 PMCID: PMC3540556] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Medical terminologies are large and complex. Frequently, errors are hidden in this complexity. Our objective is to find such errors, which can be aided by deriving abstraction networks from a large terminology. Abstraction networks preserve important features but eliminate many minor details, which are often not useful for identifying errors. Providing visualizations for such abstraction networks aids auditors by allowing them to quickly focus on elements of interest within a terminology. Previously we introduced area taxonomies and partial area taxonomies for SNOMED CT. In this paper, two advanced, novel kinds of abstraction networks, the relationship-constrained partial area subtaxonomy and the root-constrained partial area subtaxonomy are defined and their benefits are demonstrated. We also describe BLUSNO, an innovative software tool for quickly generating and visualizing these SNOMED CT abstraction networks. BLUSNO is a dynamic, interactive system that provides quick access to well organized information about SNOMED CT.
Collapse
Affiliation(s)
- James Geller
- New Jersey Institute of Technology, Newark, NJ, USA
| | | | | | | |
Collapse
|
35
|
Rule-based support system for multiple UMLS semantic type assignments. J Biomed Inform 2012; 46:97-110. [PMID: 23041716 DOI: 10.1016/j.jbi.2012.09.007] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2012] [Revised: 09/14/2012] [Accepted: 09/15/2012] [Indexed: 11/20/2022]
Abstract
BACKGROUND When new concepts are inserted into the UMLS, they are assigned one or several semantic types from the UMLS Semantic Network by the UMLS editors. However, not every combination of semantic types is permissible. It was observed that many concepts with rare combinations of semantic types have erroneous semantic type assignments or prohibited combinations of semantic types. The correction of such errors is resource-intensive. OBJECTIVE We design a computational system to inform UMLS editors as to whether a specific combination of two, three, four, or five semantic types is permissible or prohibited or questionable. METHODS We identify a set of inclusion and exclusion instructions in the UMLS Semantic Network documentation and derive corresponding rule-categories as well as rule-categories from the UMLS concept content. We then design an algorithm adviseEditor based on these rule-categories. The algorithm specifies rules for an editor how to proceed when considering a tuple (pair, triple, quadruple, quintuple) of semantic types to be assigned to a concept. RESULTS Eight rule-categories were identified. A Web-based system was developed to implement the adviseEditor algorithm, which returns for an input combination of semantic types whether it is permitted, prohibited or (in a few cases) requires more research. The numbers of semantic type pairs assigned to each rule-category are reported. Interesting examples for each rule-category are illustrated. Cases of semantic type assignments that contradict rules are listed, including recently introduced ones. CONCLUSION The adviseEditor system implements explicit and implicit knowledge available in the UMLS in a system that informs UMLS editors about the permissibility of a desired combination of semantic types. Using adviseEditor might help accelerate the work of the UMLS editors and prevent erroneous semantic type assignments.
Collapse
|
36
|
Wang Y, Halper M, Wei D, Gu H, Perl Y, Xu J, Elhanan G, Chen Y, Spackman KA, Case JT, Hripcsak G. Auditing complex concepts of SNOMED using a refined hierarchical abstraction network. J Biomed Inform 2012; 45:1-14. [PMID: 21907827 PMCID: PMC3313651 DOI: 10.1016/j.jbi.2011.08.016] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2011] [Revised: 08/25/2011] [Accepted: 08/26/2011] [Indexed: 10/17/2022]
Abstract
Auditors of a large terminology, such as SNOMED CT, face a daunting challenge. To aid them in their efforts, it is essential to devise techniques that can automatically identify concepts warranting special attention. "Complex" concepts, which by their very nature are more difficult to model, fall neatly into this category. A special kind of grouping, called a partial-area, is utilized in the characterization of complex concepts. In particular, the complex concepts that are the focus of this work are those appearing in intersections of multiple partial-areas and are thus referred to as overlapping concepts. In a companion paper, an automatic methodology for identifying and partitioning the entire collection of overlapping concepts into disjoint, singly-rooted groups, that are more manageable to work with and comprehend, has been presented. The partitioning methodology formed the foundation for the development of an abstraction network for the overlapping concepts called a disjoint partial-area taxonomy. This new disjoint partial-area taxonomy offers a collection of semantically uniform partial-areas and is exploited herein as the basis for a novel auditing methodology. The review of the overlapping concepts is done in a top-down order within semantically uniform groups. These groups are themselves reviewed in a top-down order, which proceeds from the less complex to the more complex overlapping concepts. The results of applying the methodology to SNOMED's Specimen hierarchy are presented. Hypotheses regarding error ratios for overlapping concepts and between different kinds of overlapping concepts are formulated. Two phases of auditing the Specimen hierarchy for two releases of SNOMED are reported on. With the use of the double bootstrap and Fisher's exact test (two-tailed), the auditing of concepts and especially roots of overlapping partial-areas is shown to yield a statistically significant higher proportion of errors.
Collapse
Affiliation(s)
- Yue Wang
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102 USA
| | - Michael Halper
- Information Technology Dept., New Jersey Institute of Technology, Newark, NJ 07102 USA
| | - Duo Wei
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102 USA
| | - Huanying Gu
- Computer Science Dept., New York Institute of Technology, New York, NY 10023 USA
| | - Yehoshua Perl
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102 USA
| | - Junchuan Xu
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102 USA
| | - Gai Elhanan
- Computer Science Dept., New Jersey Institute of Technology, Newark, NJ 07102 USA
- Halfpenny Technologies, Inc., Blue Bell, PA 19422 USA
| | - Yan Chen
- Computer Information Systems Dept., BMCC, CUNY, New York, NY 10007 USA
| | | | | | - George Hripcsak
- Dept. of Biomedical Informatics, Columbia University, New York, NY 10032 USA
| |
Collapse
|