1
|
Hao X, Abeysinghe R, Roberts K, Cui L. Logical definition-based identification of potential missing concepts in SNOMED CT. BMC Med Inform Decis Mak 2023; 23:87. [PMID: 37161566 PMCID: PMC10169302 DOI: 10.1186/s12911-023-02183-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Accepted: 04/20/2023] [Indexed: 05/11/2023] Open
Abstract
BACKGROUND Biomedical ontologies are representations of biomedical knowledge that provide terms with precisely defined meanings. They play a vital role in facilitating biomedical research in a cross-disciplinary manner. Quality issues of biomedical ontologies will hinder their effective usage. One such quality issue is missing concepts. In this study, we introduce a logical definition-based approach to identify potential missing concepts in SNOMED CT. A unique contribution of our approach is that it is capable of obtaining both logical definitions and fully specified names for potential missing concepts. METHOD The logical definitions of unrelated pairs of fully defined concepts in non-lattice subgraphs that indicate quality issues are intersected to generate the logical definitions of potential missing concepts. A text summarization model (called PEGASUS) is fine-tuned to predict the fully specified names of the potential missing concepts from their generated logical definitions. Furthermore, the identified potential missing concepts are validated using external resources including the Unified Medical Language System (UMLS), biomedical literature in PubMed, and a newer version of SNOMED CT. RESULTS From the March 2021 US Edition of SNOMED CT, we obtained a total of 30,313 unique logical definitions for potential missing concepts through the intersecting process. We fine-tuned a PEGASUS summarization model with 289,169 training instances and tested it on 36,146 instances. The model achieved 72.83 of ROUGE-1, 51.06 of ROUGE-2, and 71.76 of ROUGE-L on the test dataset. The model correctly predicted 11,549 out of 36,146 fully specified names in the test dataset. Applying the fine-tuned model on the 30,313 unique logical definitions, 23,031 total potential missing concepts were identified. Out of these, a total of 2,312 (10.04%) were automatically validated by either of the three resources. CONCLUSIONS The results showed that our logical definition-based approach for identification of potential missing concepts in SNOMED CT is encouraging. Nevertheless, there is still room for improving the performance of naming concepts based on logical definitions.
Collapse
Affiliation(s)
- Xubing Hao
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX USA
| | - Rashmie Abeysinghe
- Department of Neurology, McGovern Medical School, University of Texas Health Science Center at Houston, Houston, TX USA
| | - Kirk Roberts
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX USA
| | - Licong Cui
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX USA
| |
Collapse
|
2
|
Mohtashamian M, Abeysinghe R, Hao X, Cui L. Identifying Missing IS-A Relations in Orphanet Rare Disease Ontology. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2022; 2022:3274-3279. [PMID: 36776767 PMCID: PMC9918376 DOI: 10.1109/bibm55620.2022.9995614] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
The Orphanet Rare Disease Ontology (ORDO) provides a structured vocabulary encapsulating rare diseases. Downstream applications of ORDO depend on its accuracy to effectively perform their tasks. In this paper, we implement an automated quality assurance pipeline to identify missing is-a relations in ORDO. We first obtain lexical features from concept names. Then we generate related and unrelated feature sharing concept-pairs, where a feature sharing concept-pair can further generate derived term-pairs. If an unrelated and related feature sharing concept-pair generate the same derived term-pair, then we suggest a potential missing is-a relation between the unrelated feature sharing concept-pair. Applying this approach on the 2022-06-27 release of ORDO, we obtained 705 potential missing is-a relations. Leveraging external ontological information in the Unified Medical Language System, we validated 164 missing is-a relations. This indicates that our approach is a promising way to audit is-a relations in ORDO, even though further domain expert evaluation is still needed to validate the remaining potential missing is-a relations identified.
Collapse
Affiliation(s)
| | | | - Xubing Hao
- The University of Texas Health, Science Center at Houston, School of Biomedical Informatics, Houston, TX
| | | |
Collapse
|
3
|
Hao X, Abeysinghe R, Zheng F, Cui L. Leveraging non-lattice subgraphs for suggestion of new concepts for SNOMED CT. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2021; 2021:1805-1812. [PMID: 35291311 PMCID: PMC8919474 DOI: 10.1109/bibm52615.2021.9669407] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
Missing hierarchical is-a relations and missing concepts are common quality issues in biomedical ontologies. Non-lattice subgraphs have been extensively studied for automatically identifying missing is-a relations in biomedical ontologies like SNOMED CT. However, little is known about non-lattice subgraphs' capability to uncover new or missing concepts in biomedical ontologies. In this work, we investigate a lexical-based intersection approach based on non-lattice subgraphs to identify potential missing concepts in SNOMED CT. We first construct lexical features of concepts using their fully specified names. Then we generate hierarchically unrelated concept pairs in non-lattice subgraphs as the candidates to derive new concepts. For each candidate pair of concepts, we conduct an order-preserving intersection based on the two concepts' lexical features, with the intersection result serving as the potential new concept name suggested. We further perform automatic validation through terminologies in the Unified Medical Language System (UMLS) and literature in PubMed. Applying this approach to the March 2021 release of SNOMED CT US Edition, we obtained 7,702 potential missing concepts, among which 1,288 were validated through UMLS and 1,309 were validated through PubMed. The results showed that non-lattice subgraphs have the potential to facilitate suggestion of new concepts for SNOMED CT.
Collapse
Affiliation(s)
- Xubing Hao
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Rashmie Abeysinghe
- Department of Neurology, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Fengbo Zheng
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Licong Cui
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
4
|
Zheng F, Abeysinghe R, Cui L. Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis. BMC Med Inform Decis Mak 2021; 21:234. [PMID: 34753458 PMCID: PMC8579614 DOI: 10.1186/s12911-021-01592-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Accepted: 07/21/2021] [Indexed: 11/15/2022] Open
Abstract
Background As biomedical knowledge is rapidly evolving, concept enrichment of biomedical terminologies is an active research area involving automatic identification of missing or new concepts. Previously, we prototyped a lexical-based formal concept analysis (FCA) approach in which concepts were derived by intersecting bags of words, to identify potentially missing concepts in the National Cancer Institute (NCI) Thesaurus. However, this prototype did not handle concept naming and positioning. In this paper, we introduce a sequenced-based FCA approach to identify potentially missing concepts, supporting concept naming and positioning. Methods We consider the concept name sequences as FCA attributes to construct the formal context. The concept-forming process is performed by computing the longest common substrings of concept name sequences. After new concepts are formalized, we further predict their potential positions in the original hierarchy by identifying their supertypes and subtypes from original concepts. Automated validation via external terminologies in the Unified Medical Language System (UMLS) and biomedical literature in PubMed is performed to evaluate the effectiveness of our approach. Results We applied our sequenced-based FCA approach to all the sub-hierarchies under Disease or Disorder in the NCI Thesaurus (19.08d version) and five sub-hierarchies under Clinical Finding and Procedure in the SNOMED CT (US Edition, March 2020 release). In total, 1397 potentially missing concepts were identified in the NCI Thesaurus and 7223 in the SNOMED CT. For NCI Thesaurus, 85 potentially missing concepts were found in external terminologies and 315 of the remaining 1312 appeared in biomedical literature. For SNOMED CT, 576 were found in external terminologies and 1159 out of the remaining 6647 were found in biomedical literature. Conclusion Our sequence-based FCA approach has shown the promise for identifying potentially missing concepts in biomedical terminologies.
Collapse
Affiliation(s)
- Fengbo Zheng
- Department of Computer Science, University of Kentucky, Lexington, KY, USA.,School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Rashmie Abeysinghe
- Department of Neurology, McGovern School of Medicine, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Licong Cui
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA.
| |
Collapse
|
5
|
Zheng F, Cui L. A Lexical-based Formal Concept Analysis Method to Identify Missing Concepts in the NCI Thesaurus. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2021; 2020. [PMID: 34721941 DOI: 10.1109/bibm49941.2020.9313186] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Biomedical terminologies have been increasingly used in modern biomedical research and applications to facilitate data management and ensure semantic interoperability. As part of the evolution process, new concepts are regularly added to biomedical terminologies in response to the evolving domain knowledge and emerging applications. Most existing concept enrichment methods suggest new concepts via directly importing knowledge from external sources. In this paper, we introduced a lexical method based on formal concept analysis (FCA) to identify potentially missing concepts in a given terminology by leveraging its intrinsic knowledge - concept names. We first construct the FCA formal context based on the lexical features of concepts. Then we perform multistage intersection to formalize new concepts and detect potentially missing concepts. We applied our method to the Disease or Disorder sub-hierarchy in the National Cancer Institute (NCI) Thesaurus (19.08d version) and identified a total of 8,983 potentially missing concepts. As a preliminary evaluation of our method to validate the potentially missing concepts, we further checked whether they were included in any external source terminology in the Unified Medical Language System (UMLS). The result showed that 592 out of 8,937 potentially missing concepts were found in the UMLS.
Collapse
Affiliation(s)
- Fengbo Zheng
- Department of Computer Science, University of Kentucky, Lexington, Kentucky, USA.,School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Licong Cui
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
6
|
Jing X. The Unified Medical Language System at 30 Years and How It Is Used and Published: Systematic Review and Content Analysis. JMIR Med Inform 2021; 9:e20675. [PMID: 34236337 PMCID: PMC8433943 DOI: 10.2196/20675] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Revised: 11/25/2020] [Accepted: 07/02/2021] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND The Unified Medical Language System (UMLS) has been a critical tool in biomedical and health informatics, and the year 2021 marks its 30th anniversary. The UMLS brings together many broadly used vocabularies and standards in the biomedical field to facilitate interoperability among different computer systems and applications. OBJECTIVE Despite its longevity, there is no comprehensive publication analysis of the use of the UMLS. Thus, this review and analysis is conducted to provide an overview of the UMLS and its use in English-language peer-reviewed publications, with the objective of providing a comprehensive understanding of how the UMLS has been used in English-language peer-reviewed publications over the last 30 years. METHODS PubMed, ACM Digital Library, and the Nursing & Allied Health Database were used to search for studies. The primary search strategy was as follows: UMLS was used as a Medical Subject Headings term or a keyword or appeared in the title or abstract. Only English-language publications were considered. The publications were screened first, then coded and categorized iteratively, following the grounded theory. The review process followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. RESULTS A total of 943 publications were included in the final analysis. Moreover, 32 publications were categorized into 2 categories; hence the total number of publications before duplicates are removed is 975. After analysis and categorization of the publications, UMLS was found to be used in the following emerging themes or areas (the number of publications and their respective percentages are given in parentheses): natural language processing (230/975, 23.6%), information retrieval (125/975, 12.8%), terminology study (90/975, 9.2%), ontology and modeling (80/975, 8.2%), medical subdomains (76/975, 7.8%), other language studies (53/975, 5.4%), artificial intelligence tools and applications (46/975, 4.7%), patient care (35/975, 3.6%), data mining and knowledge discovery (25/975, 2.6%), medical education (20/975, 2.1%), degree-related theses (13/975, 1.3%), digital library (5/975, 0.5%), and the UMLS itself (150/975, 15.4%), as well as the UMLS for other purposes (27/975, 2.8%). CONCLUSIONS The UMLS has been used successfully in patient care, medical education, digital libraries, and software development, as originally planned, as well as in degree-related theses, the building of artificial intelligence tools, data mining and knowledge discovery, foundational work in methodology, and middle layers that may lead to advanced products. Natural language processing, the UMLS itself, and information retrieval are the 3 most common themes that emerged among the included publications. The results, although largely related to academia, demonstrate that UMLS achieves its intended uses successfully, in addition to achieving uses broadly beyond its original intentions.
Collapse
Affiliation(s)
- Xia Jing
- Department of Public Health Sciences, College of Behavioral, Social and Health Sciences, Clemson University, Clemson, SC, United States
| |
Collapse
|
7
|
Keloth VK, Geller J, Chen Y, Xu J. Extending import detection algorithms for concept import from two to three biomedical terminologies. BMC Med Inform Decis Mak 2020; 20:272. [PMID: 33319702 PMCID: PMC7737255 DOI: 10.1186/s12911-020-01290-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Accepted: 10/12/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND While enrichment of terminologies can be achieved in different ways, filling gaps in the IS-A hierarchy backbone of a terminology appears especially promising. To avoid difficult manual inspection, we started a research program in 2014, investigating terminology densities, where the comparison of terminologies leads to the algorithmic discovery of potentially missing concepts in a target terminology. While candidate concepts have to be approved for import by an expert, the human effort is greatly reduced by algorithmic generation of candidates. In previous studies, a single source terminology was used with one target terminology. METHODS In this paper, we are extending the algorithmic detection of "candidate concepts for import" from one source terminology to two source terminologies used in tandem. We show that the combination of two source terminologies relative to one target terminology leads to the discovery of candidate concepts for import that could not be found with the same "reliability" when comparing one source terminology alone to the target terminology. We investigate which triples of UMLS terminologies can be gainfully used for the described purpose and how many candidate concepts can be found for each individual triple of terminologies. RESULTS The analysis revealed a specific configuration of concepts, overlapping two source and one target terminology, for which we coined the name "fire ladder" pattern. The three terminologies in this pattern are tied together by a kind of "transitivity." We provide a quantitative analysis of the discovered fire ladder patterns and we report on the inter-rater agreement concerning the decision of importing candidate concepts from source terminologies into the target terminology. We algorithmically identified 55 instances of the fire ladder pattern and two domain experts agreed on import for 39 instances. In total, 48 concepts were approved by at least one expert. In addition, 105 import candidate concepts from a single source terminology into the target terminology were also detected, as a "beneficial side-effect" of this method, increasing the cardinality of the result. CONCLUSION We showed that pairs of biomedical source terminologies can be transitively chained to suggest possible imports of concepts into a target terminology.
Collapse
Affiliation(s)
- Vipina K Keloth
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA.
| | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA
| | - Yan Chen
- Department of Computer Information Systems, Borough of Manhattan Community College, City University of New York, New York, NY, 10007, USA
| | - Julia Xu
- JXU Consulting, 4343 Pine Blossom Tr., Houston, TX, 77059, USA
| |
Collapse
|
8
|
Zheng F, Shi J, Yang Y, Zheng WJ, Cui L. A transformation-based method for auditing the IS-A hierarchy of biomedical terminologies in the Unified Medical Language System. J Am Med Inform Assoc 2020; 27:1568-1575. [PMID: 32918476 PMCID: PMC7566369 DOI: 10.1093/jamia/ocaa123] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 05/09/2020] [Accepted: 05/20/2020] [Indexed: 01/06/2023] Open
Abstract
OBJECTIVE The Unified Medical Language System (UMLS) integrates various source terminologies to support interoperability between biomedical information systems. In this article, we introduce a novel transformation-based auditing method that leverages the UMLS knowledge to systematically identify missing hierarchical IS-A relations in the source terminologies. MATERIALS AND METHODS Given a concept name in the UMLS, we first identify its base and secondary noun chunks. For each identified noun chunk, we generate replacement candidates that are more general than the noun chunk. Then, we replace the noun chunks with their replacement candidates to generate new potential concept names that may serve as supertypes of the original concept. If a newly generated name is an existing concept name in the same source terminology with the original concept, then a potentially missing IS-A relation between the original and the new concept is identified. RESULTS Applying our transformation-based method to English-language concept names in the UMLS (2019AB release), a total of 39 359 potentially missing IS-A relations were detected in 13 source terminologies. Domain experts evaluated a random sample of 200 potentially missing IS-A relations identified in the SNOMED CT (U.S. edition) and 100 in Gene Ontology. A total of 173 of 200 and 63 of 100 potentially missing IS-A relations were confirmed by domain experts, indicating that our method achieved a precision of 86.5% and 63% for the SNOMED CT and Gene Ontology, respectively. CONCLUSIONS Our results showed that our transformation-based method is effective in identifying missing IS-A relations in the UMLS source terminologies.
Collapse
Affiliation(s)
- Fengbo Zheng
- Department of Computer Science, University of Kentucky, Lexington, Kentucky, USA
| | - Jay Shi
- Department of Internal Medicine, University of Kentucky, Lexington, Kentucky, USA
| | - Yuntao Yang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - W Jim Zheng
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Licong Cui
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
9
|
Zheng L, He Z, Wei D, Keloth V, Fan JW, Lindemann L, Zhu X, Cimino JJ, Perl Y. A review of auditing techniques for the Unified Medical Language System. J Am Med Inform Assoc 2020; 27:1625-1638. [PMID: 32766692 PMCID: PMC7566540 DOI: 10.1093/jamia/ocaa108] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 05/05/2020] [Accepted: 05/13/2020] [Indexed: 11/12/2022] Open
Abstract
OBJECTIVE The study sought to describe the literature related to the development of methods for auditing the Unified Medical Language System (UMLS), with particular attention to identifying errors and inconsistencies of attributes of the concepts in the UMLS Metathesaurus. MATERIALS AND METHODS We applied the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) approach by searching the MEDLINE database and Google Scholar for studies referencing the UMLS and any of several terms related to auditing, error detection, and quality assurance. A qualitative analysis and summarization of articles that met inclusion criteria were performed. RESULTS Eighty-three studies were reviewed in detail. We first categorized techniques based on various aspects including concepts, concept names, and synonymy (n = 37), semantic type assignments (n = 36), hierarchical relationships (n = 24), lateral relationships (n = 12), ontology enrichment (n = 8), and ontology alignment (n = 18). We also categorized the methods according to their level of automation (ie, automated systematic, automated heuristic, or manual) and the type of knowledge used (ie, intrinsic or extrinsic knowledge). CONCLUSIONS This study is a comprehensive review of the published methods for auditing the various conceptual aspects of the UMLS. Categorizing the auditing techniques according to the various aspects will enable the curators of the UMLS as well as researchers comprehensive easy access to this wealth of knowledge (eg, for auditing lateral relationships in the UMLS). We also reviewed ontology enrichment and alignment techniques due to their critical use of and impact on the UMLS.
Collapse
Affiliation(s)
- Ling Zheng
- Department of Computer Science and Software Engineering, Monmouth University, West Long Branch, New Jersey, USA
| | - Zhe He
- School of Information, Florida State University, Tallahassee, Florida, USA
| | - Duo Wei
- School of Business, Stockton University, Galloway, New Jersey, USA
| | - Vipina Keloth
- Department of Computer Science, New Jersey Institute of Technology, Newark, New Jersey, USA
| | - Jung-Wei Fan
- Division of Digital Health Sciences, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA
| | - Luke Lindemann
- Center for Biomedical Data Science, Yale School of Medicine, New Haven, Connecticut, USA
| | - Xinxin Zhu
- Center for Biomedical Data Science, Yale School of Medicine, New Haven, Connecticut, USA
| | - James J Cimino
- Informatics Institute, University of Alabama at Birmingham, Birmingham, Alabama, USA
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, New Jersey, USA
| |
Collapse
|
10
|
Keloth VK, He Z, Elhanan G, Geller J. Alternative classification of identical concepts in different terminologies: Different ways to view the world. J Biomed Inform 2019; 94:103193. [PMID: 31048072 DOI: 10.1016/j.jbi.2019.103193] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2019] [Revised: 04/17/2019] [Accepted: 04/28/2019] [Indexed: 10/26/2022]
Abstract
In previous research, we have studied concepts that occur in pairs of medical terminologies and are known to be identical, because they have the same ID number in the Unified Medical Language System (UMLS). We observed that such concepts rarely have exactly the same sets of children (=subconcepts) in the two terminologies. The number of common children was found to vary widely. A special situation was identified where the children in one terminology relate to the common parent in a very different way than the children in the other terminology. For example, children in one terminology might subdivide a parent concept by anatomical location in one terminology and by disease kind in the other terminology. We coined the term "alternative classification" (of the same parent concept) for such situations. In previous work, only human experts could recognize alternative classifications. In this paper, we present a mathematically expressed criterion for likely cases of alternative classifications. We compare the recommendations of this criterion, expressed by a mathematical quantity called "EFI" becoming zero, with the decisions of a human expert. It is found that the human expert agreed with the criterion in 72% of all cases, which is a big improvement over having no computable criterion at all. Besides alternative classifications, common parent concepts in a pair of terminologies might also indicate a possible import of a child concept missing in one terminology, different granularities, or errors in either one of the two terminologies. In this paper, we further investigate different kinds of alternative classifications.
Collapse
Affiliation(s)
| | - Zhe He
- Florida State University, Tallahassee, FL, USA
| | - Gai Elhanan
- Desert Research Institute, Reno, Nevada, USA
| | - James Geller
- New Jersey Institute of Technology, Newark, NJ, USA
| |
Collapse
|
11
|
He Z, Keloth VK, Chen Y, Geller J. Extended Analysis of Topological-Pattern-Based Ontology Enrichment. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2019; 2018:1641-1648. [PMID: 30854243 DOI: 10.1109/bibm.2018.8621564] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Maintenance of biomedical ontologies is difficult. We have previously developed a topological-pattern-based method to deal with the problem of identifying concepts in a reference ontology that could be of interest for insertion into a target ontology. Assuming that both ontologies are parts of the Unified Medical Language System (UMLS), the method suggests approximate locations where the target ontology could be extended with new concepts from the reference ontology. However, the final decision about each concept has to be made by a human expert. In this paper, we describe the universe of cross-ontology topological patterns in quantitative terms. We then present a theoretical analysis of the number of potential placements of reference concepts in a path in a target ontology, allowing for new cross-ontology synonyms. This provides a rough estimate of what expert resources need to be allocated for the task. One insight in previous work on this topic was the large percentage of cases where importing concepts was impossible, due to a configuration called "alternative classification." In this paper, we confirm this observation. Our target ontology is the National Cancer Institute thesaurus (NCIt). However, the methods can be applied to other pairs of ontologies with hierarchical relationships from the UMLS.
Collapse
Affiliation(s)
- Zhe He
- School of Information, Florida State University Tallahassee, Florida USA
| | | | - Yan Chen
- Department of Computer Inforamtion Systems, BMCC, CUNY, New York, NY USA,
| | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ USA,
| |
Collapse
|
12
|
Prosperi M, Min JS, Bian J, Modave F. Big data hurdles in precision medicine and precision public health. BMC Med Inform Decis Mak 2018; 18:139. [PMID: 30594159 PMCID: PMC6311005 DOI: 10.1186/s12911-018-0719-2] [Citation(s) in RCA: 87] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2018] [Accepted: 12/04/2018] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Nowadays, trendy research in biomedical sciences juxtaposes the term 'precision' to medicine and public health with companion words like big data, data science, and deep learning. Technological advancements permit the collection and merging of large heterogeneous datasets from different sources, from genome sequences to social media posts or from electronic health records to wearables. Additionally, complex algorithms supported by high-performance computing allow one to transform these large datasets into knowledge. Despite such progress, many barriers still exist against achieving precision medicine and precision public health interventions for the benefit of the individual and the population. MAIN BODY The present work focuses on analyzing both the technical and societal hurdles related to the development of prediction models of health risks, diagnoses and outcomes from integrated biomedical databases. Methodological challenges that need to be addressed include improving semantics of study designs: medical record data are inherently biased, and even the most advanced deep learning's denoising autoencoders cannot overcome the bias if not handled a priori by design. Societal challenges to face include evaluation of ethically actionable risk factors at the individual and population level; for instance, usage of gender, race, or ethnicity as risk modifiers, not as biological variables, could be replaced by modifiable environmental proxies such as lifestyle and dietary habits, household income, or access to educational resources. CONCLUSIONS Data science for precision medicine and public health warrants an informatics-oriented formalization of the study design and interoperability throughout all levels of the knowledge inference process, from the research semantics, to model development, and ultimately to implementation.
Collapse
Affiliation(s)
- Mattia Prosperi
- Department of Epidemiology, College of Medicine & College of Public Health and Health Professions, University of Florida, Gainesville, FL, 32610, USA.
| | - Jae S Min
- Department of Epidemiology, College of Medicine & College of Public Health and Health Professions, University of Florida, Gainesville, FL, 32610, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, 32610, USA
| | - François Modave
- Center for Health Outcomes and Informatics Research, Loyola University Chicago, Maywood, IL, 60153, USA
| |
Collapse
|
13
|
Keloth VK, He Z, Chen Y, Geller J. Leveraging Horizontal Density Differences between Ontologies to Identify Missing Child Concepts: A Proof of Concept. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2018:644-653. [PMID: 30815106 PMCID: PMC6371323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Previously, we investigated pairs of ontologies with local similarities where corresponding "is-a" paths are of different lengths. This indicated the possibility of importing concepts from one ontology into the other. We referred to such structures as diamonds of concepts. In this paper, we address the question whether pairs of identical concepts in pairs of ontologies have the same children in both. Separate reviews of SNOMED CT and NCIt relative to eight other ontologies uncovered differences in child sets. We provide quantitative data concerning these differences. In cases where there are many identical children in two ontologies, the questions arise why one has more children and whether these children are "missing" in the other ontology. We performed randomized controlled trials in which a human expert evaluated the "fit for import" of such potentially missing child concepts. In two out of four studies, statistical significance was achieved in support of algorithmic import.
Collapse
Affiliation(s)
| | - Zhe He
- Florida State University, Tallahassee, Florida, USA
| | - Yan Chen
- Borough of Manhattan Community College, New York, NY, USA
| | - James Geller
- New Jersey Institute of Technology, Newark, New Jersey, USA
| |
Collapse
|
14
|
Quality assurance of biomedical terminologies and ontologies. J Biomed Inform 2018; 86:106-108. [PMID: 30205171 DOI: 10.1016/j.jbi.2018.09.006] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 09/07/2018] [Indexed: 11/22/2022]
|
15
|
El-Sappagh S, Franda F, Ali F, Kwak KS. SNOMED CT standard ontology based on the ontology for general medical science. BMC Med Inform Decis Mak 2018; 18:76. [PMID: 30170591 PMCID: PMC6119323 DOI: 10.1186/s12911-018-0651-5] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2017] [Accepted: 07/31/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT, hereafter abbreviated SCT) is a comprehensive medical terminology used for standardizing the storage, retrieval, and exchange of electronic health data. Some efforts have been made to capture the contents of SCT as Web Ontology Language (OWL), but these efforts have been hampered by the size and complexity of SCT. METHOD Our proposal here is to develop an upper-level ontology and to use it as the basis for defining the terms in SCT in a way that will support quality assurance of SCT, for example, by allowing consistency checks of definitions and the identification and elimination of redundancies in the SCT vocabulary. Our proposed upper-level SCT ontology (SCTO) is based on the Ontology for General Medical Science (OGMS). RESULTS The SCTO is implemented in OWL 2, to support automatic inference and consistency checking. The approach will allow integration of SCT data with data annotated using Open Biomedical Ontologies (OBO) Foundry ontologies, since the use of OGMS will ensure consistency with the Basic Formal Ontology, which is the top-level ontology of the OBO Foundry. Currently, the SCTO contains 304 classes, 28 properties, 2400 axioms, and 1555 annotations. It is publicly available through the bioportal at http://bioportal.bioontology.org/ontologies/SCTO/ . CONCLUSION The resulting ontology can enhance the semantics of clinical decision support systems and semantic interoperability among distributed electronic health records. In addition, the populated ontology can be used for the automation of mobile health applications.
Collapse
Affiliation(s)
- Shaker El-Sappagh
- Information Systems Department, Faculty of Computers and Informatics, Benha University, Banha, Egypt
- Department of Information and Communication Engineering, Inha University, Incheon, South Korea
| | - Francesco Franda
- Department of Philosophy, University at Buffalo, Buffalo, NY USA
| | - Farman Ali
- Department of Information and Communication Engineering, Inha University, Incheon, South Korea
| | - Kyung-Sup Kwak
- Department of Information and Communication Engineering, Inha University, Incheon, South Korea
| |
Collapse
|
16
|
Amith M, He Z, Bian J, Lossio-Ventura JA, Tao C. Assessing the practice of biomedical ontology evaluation: Gaps and opportunities. J Biomed Inform 2018; 80:1-13. [PMID: 29462669 PMCID: PMC5882531 DOI: 10.1016/j.jbi.2018.02.010] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Revised: 02/12/2018] [Accepted: 02/16/2018] [Indexed: 11/26/2022]
Abstract
With the proliferation of heterogeneous health care data in the last three decades, biomedical ontologies and controlled biomedical terminologies play a more and more important role in knowledge representation and management, data integration, natural language processing, as well as decision support for health information systems and biomedical research. Biomedical ontologies and controlled terminologies are intended to assure interoperability. Nevertheless, the quality of biomedical ontologies has hindered their applicability and subsequent adoption in real-world applications. Ontology evaluation is an integral part of ontology development and maintenance. In the biomedicine domain, ontology evaluation is often conducted by third parties as a quality assurance (or auditing) effort that focuses on identifying modeling errors and inconsistencies. In this work, we first organized four categorical schemes of ontology evaluation methods in the existing literature to create an integrated taxonomy. Further, to understand the ontology evaluation practice in the biomedicine domain, we reviewed a sample of 200 ontologies from the National Center for Biomedical Ontology (NCBO) BioPortal-the largest repository for biomedical ontologies-and observed that only 15 of these ontologies have documented evaluation in their corresponding inception papers. We then surveyed the recent quality assurance approaches for biomedical ontologies and their use. We also mapped these quality assurance approaches to the ontology evaluation criteria. It is our anticipation that ontology evaluation and quality assurance approaches will be more widely adopted in the development life cycle of biomedical ontologies.
Collapse
Affiliation(s)
- Muhammad Amith
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Zhe He
- School of Information, Florida State University, Tallahassee, FL, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, FL, USA
| | | | - Cui Tao
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA.
| |
Collapse
|
17
|
Evaluating the granularity balance of hierarchical relationships within large biomedical terminologies towards quality improvement. J Biomed Inform 2017; 75:129-137. [PMID: 28987379 DOI: 10.1016/j.jbi.2017.10.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2017] [Revised: 09/15/2017] [Accepted: 10/02/2017] [Indexed: 11/21/2022]
Abstract
Organizing the descendants of a concept under a particular semantic relationship may be rather arbitrarily carried out during the manual creation processes of large biomedical terminologies, resulting in imbalances in relationship granularity. This work aims to propose scalable models towards systematically evaluating the granularity balance of semantic relationships. We first utilize "parallel concepts set (PCS)" and two features (the length and the strength) of the paths between PCSs to design the general evaluation models, based on which we propose eight concrete evaluation models generated by two specific types of PCSs: single concept set and symmetric concepts set. We then apply those concrete models to the IS-A relationship in FMA and SNOMED CT's Body Structure subset, as well as to the Part-Of relationship in FMA. Moreover, without loss of generality, we conduct two additional rounds of applications on the Part-Of relationship after removing length redundancies and strength redundancies sequentially. At last, we perform automatic evaluation on the imbalances detected after the final round for identifying missing concepts, misaligned relations and inconsistencies. For the IS-A relationship, 34 missing concepts, 80 misalignments and 18 redundancies in FMA as well as 28 missing concepts, 114 misalignments and 1 redundancy in SNOMED CT were uncovered. In addition, 6,801 instances of imbalances for the Part-Of relationship in FMA were also identified, including 3,246 redundancies. After removing those redundancies from FMA, the total number of Part-Of imbalances was dramatically reduced to 327, including 51 missing concepts, 294 misaligned relations, and 36 inconsistencies. Manual curation performed by the FMA project leader confirmed the effectiveness of our method in identifying curation errors. In conclusion, the granularity balance of hierarchical semantic relationship is a valuable property to check for ontology quality assurance, and the scalable evaluation models proposed in this study are effective in fulfilling this task, especially in auditing relationships with sub-hierarchies, such as the seldom evaluated Part-Of relationship.
Collapse
|
18
|
He Z, Chen Y, de Coronado S, Piskorski K, Geller J. Topological-Pattern-Based Recommendation of UMLS Concepts for National Cancer Institute Thesaurus. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2017; 2016:618-627. [PMID: 28269858 PMCID: PMC5333219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The National Cancer Institute Thesaurus (NCIt) is a reference terminology used to support clinical, translational and basic research as well as administrative activities. As medical knowledge evolves, concepts that might be missing from a particular needed subdomain are regularly added to the NCIt. However, terminology development is known to be labor-intensive and error-prone. Therefore, cost-effective semi-automated methods for identifying potentially missing concepts would be useful to terminology curators. Previously, we have developed a structural method leveraging the native term mappings of the Unified Medical Language System to identify potential concepts in several of its source vocabularies to enrich the SNOMED CT. In this paper, we tested an analogous method for NCIt. Concepts from eight UMLS source terminologies were identified as possibilities to enrich NCIt's conceptual content.
Collapse
Affiliation(s)
- Zhe He
- School of Information, Florida State University, Tallahassee, FL; Institute for Successful Longevity, Florida State University, Tallahassee, FL
| | - Yan Chen
- Department of Computer Information Systems, Borough of Manhattan Community College, City University of New York, New York, NY
| | | | | | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ
| |
Collapse
|
19
|
He Z, Chen Y, Geller J. Perceiving the Usefulness of the National Cancer Institute Metathesaurus for Enriching NCIt with Topological Patterns. Stud Health Technol Inform 2017; 245:863-867. [PMID: 29295222 PMCID: PMC5785238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
The National Cancer Institute Thesaurus (NCIt), developed and maintained by the National Cancer Institute, is an important reference terminology in the cancer domain. As a controlled terminology needs to continuously incorporate new concepts to enrich its conceptual content, automated and semi-automated methods for identifying potential new concepts are in high demand. We have previously developed a topological-pattern-based method for identifying new concepts in a controlled terminology to enrich another terminology, using the UMLS Metathesaurus. In this work, we utilize this method with the National Cancer Institute Metathesaurus to identify new concepts for NCIt. While previous work was only oriented towards identifying candidate import concepts for human review, we are now also adding an algorithmic method to evaluate candidate concepts and reject a well defined group of them.
Collapse
Affiliation(s)
- Zhe He
- School of Information, Florida State University, Tallahassee, FL, USA
| | - Yan Chen
- Department of Computer Information Systems, BMCC, City University of New York, New York, USA
| | - James Geller
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| |
Collapse
|
20
|
Park MS, He Z, Chen Z, Oh S, Bian J. Consumers' Use of UMLS Concepts on Social Media: Diabetes-Related Textual Data Analysis in Blog and Social Q&A Sites. JMIR Med Inform 2016; 4:e41. [PMID: 27884812 PMCID: PMC5146325 DOI: 10.2196/medinform.5748] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2016] [Revised: 08/02/2016] [Accepted: 10/22/2016] [Indexed: 11/24/2022] Open
Abstract
Background The widely known terminology gap between health professionals and health consumers hinders effective information seeking for consumers. Objective The aim of this study was to better understand consumers’ usage of medical concepts by evaluating the coverage of concepts and semantic types of the Unified Medical Language System (UMLS) on diabetes-related postings in 2 types of social media: blogs and social question and answer (Q&A). Methods We collected 2 types of social media data: (1) a total of 3711 blogs tagged with “diabetes” on Tumblr posted between February and October 2015; and (2) a total of 58,422 questions and associated answers posted between 2009 and 2014 in the diabetes category of Yahoo! Answers. We analyzed the datasets using a widely adopted biomedical text processing framework Apache cTAKES and its extension YTEX. First, we applied the named entity recognition (NER) method implemented in YTEX to identify UMLS concepts in the datasets. We then analyzed the coverage and the popularity of concepts in the UMLS source vocabularies across the 2 datasets (ie, blogs and social Q&A). Further, we conducted a concept-level comparative coverage analysis between SNOMED Clinical Terms (SNOMED CT) and Open-Access Collaborative Consumer Health Vocabulary (OAC CHV)—the top 2 UMLS source vocabularies that have the most coverage on our datasets. We also analyzed the UMLS semantic types that were frequently observed in our datasets. Results We identified 2415 UMLS concepts from blog postings, 6452 UMLS concepts from social Q&A questions, and 10,378 UMLS concepts from the answers. The medical concepts identified in the blogs can be covered by 56 source vocabularies in the UMLS, while those in questions and answers can be covered by 58 source vocabularies. SNOMED CT was the dominant vocabulary in terms of coverage across all the datasets, ranging from 84.9% to 95.9%. It was followed by OAC CHV (between 73.5% and 80.0%) and Metathesaurus Names (MTH) (between 55.7% and 73.5%). All of the social media datasets shared frequent semantic types such as “Amino Acid, Peptide, or Protein,” “Body Part, Organ, or Organ Component,” and “Disease or Syndrome.” Conclusions Although the 3 social media datasets vary greatly in size, they exhibited similar conceptual coverage among UMLS source vocabularies and the identified concepts showed similar semantic type distributions. As such, concepts that are both frequently used by consumers and also found in professional vocabularies such as SNOMED CT can be suggested to OAC CHV to improve its coverage.
Collapse
Affiliation(s)
- Min Sook Park
- School of Information, Florida State University, Tallahassee, FL, United States
| | - Zhe He
- School of Information, Florida State University, Tallahassee, FL, United States.,Institute for Successful Longevity, Florida State University, Tallahassee, FL, United States
| | - Zhiwei Chen
- Department of Computer Science, Florida State University, Tallahassee, FL, United States
| | - Sanghee Oh
- School of Information, Florida State University, Tallahassee, FL, United States
| | - Jiang Bian
- Department of Health Outcomes and Policy, University of Florida, Gainesville, FL, United States
| |
Collapse
|
21
|
HE Z, GELLER J. Preliminary Analysis of Difficulty of Importing Pattern-Based Concepts into the National Cancer Institute Thesaurus. Stud Health Technol Inform 2016; 228:389-93. [PMID: 27577410 PMCID: PMC5785234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
Abstract
Maintenance of biomedical ontologies is difficult. We have developed a pattern-based method for dealing with the problem of identifying missing concepts in the National Cancer Institute thesaurus (NCIt). Specifically, we are mining patterns connecting NCIt concepts with concepts in other ontologies to identify candidate missing concepts. However, the final decision about a concept insertion is always up to a human ontology curator. In this paper, we are estimating the difficulty of this task for a domain expert by counting possible choices for a pattern-based insertion. We conclude that even with support of our mining algorithm, the insertion task is challenging.
Collapse
Affiliation(s)
- Zhe HE
- School of Information, Florida State University
| | - James GELLER
- Department of Computer Science, New Jersey Institute of Technology,Corresponding Author: CS Dept., NJIT, Newark NJ 07102, USA.
| |
Collapse
|
22
|
Chandar P, Yaman A, Hoxha J, He Z, Weng C. Similarity-Based Recommendation of New Concepts to a Terminology. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2015; 2015:386-395. [PMID: 26958170 PMCID: PMC4765685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Terminologies can suffer from poor concept coverage due to delays in addition of new concepts. This study tests a similarity-based approach to recommending concepts from a text corpus to a terminology. Our approach involves extraction of candidate concepts from a given text corpus, which are represented using a set of features. The model learns the important features to characterize a concept and recommends new concepts to a terminology. Further, we propose a cost-effective evaluation methodology to estimate the effectiveness of terminology enrichment methods. To test our methodology, we use the clinical trial eligibility criteria free-text as an example text corpus to recommend concepts for SNOMED CT. We computed precision at various rank intervals to measure the performance of the methods. Results indicate that our automated algorithm is an effective method for concept recommendation.
Collapse
Affiliation(s)
- Praveen Chandar
- Department of Biomedical Informatics, Columbia University, New York, NY USA
| | - Anil Yaman
- Department of Biomedical Informatics, Columbia University, New York, NY USA
| | - Julia Hoxha
- Department of Biomedical Informatics, Columbia University, New York, NY USA
| | - Zhe He
- Department of Biomedical Informatics, Columbia University, New York, NY USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY USA
| |
Collapse
|