1
|
Marcou Q, Berti-Equille L, Novelli N. Creating a computer assisted ICD coding system: Performance metric choice and use of the ICD hierarchy. J Biomed Inform 2024; 152:104617. [PMID: 38432534 DOI: 10.1016/j.jbi.2024.104617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 02/23/2024] [Accepted: 02/24/2024] [Indexed: 03/05/2024]
Abstract
OBJECTIVE Machine learning methods hold the promise of leveraging available data and generating higher-quality data while alleviating the data collection burden on healthcare professionals. International Classification of Diseases (ICD) diagnoses data, collected globally for billing and epidemiological purposes, represents a valuable source of structured information. However, ICD coding is a challenging task. While numerous previous studies reported promising results in automatic ICD classification, they often describe input data specific model architectures, that are heterogeneously evaluated with different performance metrics and ICD code subsets. This study aims to explore the evaluation and construction of more effective Computer Assisted Coding (CAC) systems using generic approaches, focusing on the use of ICD hierarchy, medication data and a feed forward neural network architecture. METHODS We conduct comprehensive experiments using the MIMIC-III clinical database, mapped to the OMOP data model. Our evaluations encompass various performance metrics, alongside investigations into multitask, hierarchical, and imbalanced learning for neural networks. RESULTS We introduce a novel metric, , tailored to the ICD coding task, which offers interpretable insights for healthcare informatics practitioners, aiding them in assessing the quality of assisted coding systems. Our findings highlight that selectively cherry-picking ICD codes diminish retrieval performance without performance improvement over the selected subset. We show that optimizing for metrics such as NDCG and AUPRC outperforms traditional F1-based metrics in ranking performance. We observe that Neural Network training on different ICD levels simultaneously offers minor benefits for ranking and significant runtime gains. However, our models do not derive benefits from hierarchical or class imbalance correction techniques for ICD code retrieval. CONCLUSION This study offers valuable insights for researchers and healthcare practitioners interested in developing and evaluating CAC systems. Using a straightforward sequential neural network model, we confirm that medical prescriptions are a rich data source for CAC systems, providing competitive retrieval capabilities for a fraction of the computational load compared to text-based models. Our study underscores the importance of metric selection and challenges existing practices related to ICD code sub-setting for model training and evaluation.
Collapse
Affiliation(s)
- Quentin Marcou
- Aix-Marseille Université, Faculté des sciences médicales et paramédicales, Marseille, France; Aix-Marseille Université, UMR7020 CNRS, Laboratoire d'Informatique et Systèmes (LIS), Marseille, France.
| | | | - Noël Novelli
- Aix-Marseille Université, UMR7020 CNRS, Laboratoire d'Informatique et Systèmes (LIS), Marseille, France
| |
Collapse
|
2
|
Nath N, Lee SH, Lee I. Application of specialized word embeddings and named entity and attribute recognition to the problem of unsupervised automated clinical coding. Comput Biol Med 2023; 165:107422. [PMID: 37722157 DOI: 10.1016/j.compbiomed.2023.107422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 07/30/2023] [Accepted: 08/28/2023] [Indexed: 09/20/2023]
Abstract
Notes documented by clinicians, such as patient histories, hospital courses, lab reports and others are often annotated with standardized clinical codes by medical coders to facilitate a variety of secondary processing applications such as billing and statistical analyses. Clinical coding, traditionally manual and labor-intensive, has seen a surge in research interest by deep learning researchers pursuing to automate it. However, deep learning methods require large volumes of annotated clinical data for training and offer little to explain why codes were assigned to pieces of text. In this paper, we propose an unsupervised method which does not need annotated clinical text and is fully interpretable, by using Named Entity and Attribute Recognition and word embeddings specialized for the clinical domain. These methods successfully glean important information from large volumes of clinical notes and encode them effectively in order to perform automatic clinical coding.
Collapse
Affiliation(s)
- Namrata Nath
- UniSA STEM, University of South Australia, GPO Box 2471, Adelaide, SA, 5001, Australia.
| | - Sang-Heon Lee
- UniSA STEM, University of South Australia, Adelaide, Australia
| | - Ivan Lee
- UniSA STEM, University of South Australia, Adelaide, Australia
| |
Collapse
|
3
|
Niu K, Wu Y, Li Y, Li M. Retrieve and rerank for automated ICD coding via Contrastive Learning. J Biomed Inform 2023:104396. [PMID: 37211195 DOI: 10.1016/j.jbi.2023.104396] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Revised: 04/22/2023] [Accepted: 05/15/2023] [Indexed: 05/23/2023]
Abstract
Automated ICD coding is a multi-label prediction task aiming at assigning patient diagnoses with the most relevant subsets of disease codes. In the deep learning regime, recent works have suffered from large label set and heavy imbalance distribution. To mitigate the negative effect in such scenarios, we propose a retrieve and rerank framework that introduces the Contrastive Learning (CL) for label retrieval, allowing the model to make more accurate prediction from a simplified label space. Given the appealing discriminative power of CL, we adopt it as the training strategy to replace the standard cross-entropy objective and retrieve a small subset by taking the distance between clinical notes and ICD codes into account. After properly training, the retriever could implicitly capture the code co-occurrence, which makes up for the deficiency of cross-entropy assigning each label independently of the others. Further, we evolve a powerful model via a Transformer variant for refining and reranking the candidate set, which can extract semantically meaningful features from long clinical sequences. Applying our method on well-known models, experiments show that our framework provides more accurate results guaranteed by preselecting a small subset of candidates before fine-level reranking. Relying on the framework, our proposed model achieves 0.590 and 0.990 in terms of Micro-F1 and Micro-AUC on benchmark MIMIC-III.
Collapse
Affiliation(s)
- Kunying Niu
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.
| | - Yifan Wu
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.
| | - Yaohang Li
- Department of Computer Science, Old Dominion University, Nor-folk, USA.
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.
| |
Collapse
|
4
|
Deng Y, Denecke K. Classification of user queries according to a hierarchical medical procedure encoding system using an ensemble classifier. Front Artif Intell 2022; 5:1000283. [DOI: 10.3389/frai.2022.1000283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Accepted: 10/10/2022] [Indexed: 11/06/2022] Open
Abstract
The Swiss classification of surgical interventions (CHOP) has to be used in daily practice by physicians to classify clinical procedures. Its purpose is to encode the delivered healthcare services for the sake of quality assurance and billing. For encoding a procedure, a code of a maximal of 6-digits has to be selected from the classification system, which is currently realized by a rule-based system composed of encoding experts and a manual search in the CHOP catalog. In this paper, we will investigate the possibility of automatic CHOP code generation based on a short query to enable automatic support of manual classification. The wide and deep hierarchy of CHOP and the differences between text used in queries and catalog descriptions are two apparent obstacles for training and deploying a learning-based algorithm. Because of these challenges, there is a need for an appropriate classification approach. We evaluate different strategies (multi-class non-terminal and per-node classifications) with different configurations so that a flexible modular solution with high accuracy and efficiency can be provided. The results clearly show that the per-node binary classification outperforms the non-terminal multi-class classification with an F1-micro measure between 92.6 and 94%. The hierarchical prediction based on per-node binary classifiers achieved a high exact match by the single code assignment on the 5-fold cross-validation. In conclusion, the hierarchical context from the CHOP encoding can be employed by both classifier training and representation learning. The hierarchical features have all shown improvement in the classification performances under different configurations, respectively: the stacked autoencoder and training examples aggregation using true path rules as well as the unified vocabulary space have largely increased the utility of hierarchical features. Additionally, the threshold adaption through Bayesian aggregation has largely increased the vertical reachability of the per node classification. All the trainable nodes can be triggered after the threshold adaption, while the F1 measures at code levels 3–6 have been increased from 6 to 89% after the threshold adaption.
Collapse
|
5
|
Choi S, Joo HJ, Kim Y, Kim JH, Seok J. Conversion of Automated 12-Lead Electrocardiogram Interpretations to OMOP CDM Vocabulary. Appl Clin Inform 2022; 13:880-890. [PMID: 36130711 PMCID: PMC9492322 DOI: 10.1055/s-0042-1756427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Background
A computerized 12-lead electrocardiogram (ECG) can automatically generate diagnostic statements, which are helpful for clinical purposes. Standardization is required for big data analysis when using ECG data generated by different interpretation algorithms. The common data model (CDM) is a standard schema designed to overcome heterogeneity between medical data. Diagnostic statements usually contain multiple CDM concepts and also include non-essential noise information, which should be removed during CDM conversion. Existing CDM conversion tools have several limitations, such as the requirement for manual validation, inability to extract multiple CDM concepts, and inadequate noise removal.
Objectives
We aim to develop a fully automated text data conversion algorithm that overcomes limitations of existing tools and manual conversion.
Methods
We used interpretations printed by 12-lead resting ECG tests from three different vendors: GE Medical Systems, Philips Medical Systems, and Nihon Kohden. For automatic mapping, we first constructed an ontology-lexicon of ECG interpretations. After clinical coding, an optimized tool for converting ECG interpretation to CDM terminology is developed using term-based text processing.
Results
Using the ontology-lexicon, the cosine similarity-based algorithm and rule-based hierarchical algorithm showed comparable conversion accuracy (97.8 and 99.6%, respectively), while an integrated algorithm based on a heuristic approach, ECG2CDM, demonstrated superior performance (99.9%) for datasets from three major vendors.
Conclusion
We developed a user-friendly software that runs the ECG2CDM algorithm that is easy to use even if the user is not familiar with CDM or medical terminology. We propose that automated algorithms can be helpful for further big data analysis with an integrated and standardized ECG dataset.
Collapse
Affiliation(s)
- Sunho Choi
- School of Electrical Engineering, Korea University, Seoul, South Korea
| | - Hyung Joon Joo
- Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, South Korea.,Department of Cardiology, Cardiovascular Center, Korea University College of Medicine, Seoul, South Korea
| | - Yoojoong Kim
- School of Computer Science and Information Engineering, The Catholic University of Korea, Seoul, South Korea
| | - Jong-Ho Kim
- Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, South Korea.,Department of Cardiology, Cardiovascular Center, Korea University College of Medicine, Seoul, South Korea
| | - Junhee Seok
- School of Electrical Engineering, Korea University, Seoul, South Korea
| |
Collapse
|
6
|
All Patient Refined-Diagnosis Related Groups' (APR-DRGs) Severity of Illness and Risk of Mortality as predictors of in-hospital mortality. J Med Syst 2022; 46:37. [PMID: 35524075 DOI: 10.1007/s10916-022-01805-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Accepted: 02/07/2022] [Indexed: 10/18/2022]
Abstract
The aims of this study were to assess All-Patient Refined Diagnosis-Related Groups' (APR-DRG) Severity of Illness (SOI) and Risk of Mortality (ROM) as predictors of in-hospital mortality, comparing with Charlson Comorbidity Index (CCI) and Elixhauser Comorbidity Index (ECI) scores. We performed a retrospective observational study using mainland Portuguese public hospitalizations of adult patients from 2011 to 2016. Model discrimination (C-statistic/ area under the curve) and goodness-of-fit (R-squared) were calculated. Our results comprised 4,176,142 hospitalizations with 5.9% in-hospital deaths. Compared to the CCI and ECI models, the model considering SOI, age and sex showed a statistically significantly higher discrimination in 49.6% (132 out of 266) of APR-DRGs, while in the model with ROM that happened in 33.5% of APR-DRGs. Between these two models, SOI was the best performer for nearly 20% of APR-DRGs. Some particular APR-DRGs have showed good discrimination (e.g. related to burns, viral meningitis or specific transplants). In conclusion, SOI or ROM, combined with age and sex, perform better than more widely used comorbidity indices. Despite ROM being the only score specifically designed for in-hospital mortality prediction, SOI performed better. These findings can be helpful for hospital or organizational models benchmarking or epidemiological analysis.
Collapse
|
7
|
Gutton J, Lin F, Billuart O, Lajonchère JP, Crubilié C, Sauvage C, Buronfosse A. [Artificial intelligence for medical information departments : construction and evaluation of a decision-making tool to identify and prioritize stays of which the PMSI coding could be optimized, and to ensure the revenues generated by activity-based pricing]. Rev Epidemiol Sante Publique 2022; 70:1-8. [PMID: 35027236 DOI: 10.1016/j.respe.2021.11.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Revised: 03/11/2021] [Accepted: 11/22/2021] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Medical Information Departments help to optimize the hospital revenues generated by activity-based pricing. A review of medical files, selected after the targeting of coding summaries, is organized. The aim is to make any corrections to the diagnoses or coded procedures with a potential impact on the pricing of the stay. Targeting is of major importance as a means of concentrating resources on the files for which coding can be effectively improved. The tools available for targeting can be optimized. We have developed a decision-making support tool to make targeting more efficient. The objective of our study was to evaluate the performance of this tool. METHODS The tool combines an artificial intelligence module with a rule-based expert module. A predictive score is assigned to each coding summary that reflects the probability of a revalued stay. Evaluation of the performance of this tool was based on a sample of 400 stays of at least 3 nights of patients hospitalized at the Paris Saint-Joseph Hospital from 1st November to 31st December 2019. Each stay was reviewed by a coding expert, without knowledge of the score assigned and without help from expert queries. Two main assessment criteria were used: area under the ROC curve and positive predictive value (PPV). RESULTS The area under the ROC curve was 0.70 (CI 95% [0.64-0.76]). With a revalued coding rate of 32%, PPV was 41% for scores above 5, 65% for scores above 8, 88% for scores above 9. CONCLUSION The study made it possible to validate the performance of the tool. The implementation of new variables could further increase its performance. This is an area of development to be considered, particularly with in view of generalizing individual invoicing in hospitals.
Collapse
Affiliation(s)
- J Gutton
- Direction de l'information médicale du Groupe Hospitalier Paris Saint-Joseph, 185 rue Raymond Losserand, 75014 Paris, France.
| | - F Lin
- Direction de l'information médicale du Groupe Hospitalier Paris Saint-Joseph, 185 rue Raymond Losserand, 75014 Paris, France
| | - O Billuart
- Direction de l'information médicale du Groupe Hospitalier Paris Saint-Joseph, 185 rue Raymond Losserand, 75014 Paris, France
| | - J-P Lajonchère
- Direction du Groupe Hospitalier Paris Saint-Joseph, Paris, France
| | - C Crubilié
- Direction de l'information médicale du Groupe Hospitalier Paris Saint-Joseph, 185 rue Raymond Losserand, 75014 Paris, France
| | - C Sauvage
- Direction de l'information médicale du Groupe Hospitalier Paris Saint-Joseph, 185 rue Raymond Losserand, 75014 Paris, France
| | - A Buronfosse
- Direction de l'information médicale du Groupe Hospitalier Paris Saint-Joseph, 185 rue Raymond Losserand, 75014 Paris, France
| |
Collapse
|
8
|
A Deep Learning Based Approach to Automate Clinical Coding of Electronic Health Records. BIG DATA ANALYTICS 2022. [DOI: 10.1007/978-3-031-24094-2_7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
|
9
|
Millares Martin P. Consultation analysis: use of free text versus coded text. HEALTH AND TECHNOLOGY 2021; 11:349-357. [PMID: 33520588 PMCID: PMC7829039 DOI: 10.1007/s12553-020-00517-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2020] [Accepted: 12/21/2020] [Indexed: 11/28/2022]
Abstract
General practice in the United Kingdom has been using electronic health records for over two decades, but coding clinical information remains poor. Lack of interest and training are considerable barriers preventing code use levels improvement. Tailored training could be the way forward, to break barriers in the uptake of coding; to do so it is paramount to understand coding use of the particular clinicians, to recognise their needs. It should be possible to easily assess text quantity and quality in medical consultations. A tool to measure these parameters, which could be used to tailor training needs and assess change, is demonstrated. The tool is presented and a preliminary study using a randomised sample of five recent consultations from thirteen different clinicians is used as an example. The tool, based on using a word processor and a spread-sheet, allowed quantitative analysis among clinicians while word clouds permitted a qualitative comparison between coded and free text. The average amount of free text per consultation was 68.2 words, (ranging from 25.4 and 130.2 among clinicians); an average of 6% of the text was coded (ranging from 0 to 13%). Patterns among clinicians could be identified. Using Word cloud, a different text use was demonstrated depending on its purpose. Some free text could be turned into code but nomenclature probably prevented some of the codings, like the expression of time. This proof of concept demonstrated that it is possible to calculate what percentage of consultations are coded and what codes are used. This allowed understanding clinicians’ preferences; training needs and gaps in nomenclature.
Collapse
|
10
|
Suleiman M, Demirhan H, Boyd L, Girosi F, Aksakalli V. A clinical coding recommender system. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
11
|
Kersloot MG, van Putten FJP, Abu-Hanna A, Cornet R, Arts DL. Natural language processing algorithms for mapping clinical text fragments onto ontology concepts: a systematic review and recommendations for future studies. J Biomed Semantics 2020; 11:14. [PMID: 33198814 PMCID: PMC7670625 DOI: 10.1186/s13326-020-00231-z] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 11/03/2020] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Free-text descriptions in electronic health records (EHRs) can be of interest for clinical research and care optimization. However, free text cannot be readily interpreted by a computer and, therefore, has limited value. Natural Language Processing (NLP) algorithms can make free text machine-interpretable by attaching ontology concepts to it. However, implementations of NLP algorithms are not evaluated consistently. Therefore, the objective of this study was to review the current methods used for developing and evaluating NLP algorithms that map clinical text fragments onto ontology concepts. To standardize the evaluation of algorithms and reduce heterogeneity between studies, we propose a list of recommendations. METHODS Two reviewers examined publications indexed by Scopus, IEEE, MEDLINE, EMBASE, the ACM Digital Library, and the ACL Anthology. Publications reporting on NLP for mapping clinical text from EHRs to ontology concepts were included. Year, country, setting, objective, evaluation and validation methods, NLP algorithms, terminology systems, dataset size and language, performance measures, reference standard, generalizability, operational use, and source code availability were extracted. The studies' objectives were categorized by way of induction. These results were used to define recommendations. RESULTS Two thousand three hundred fifty five unique studies were identified. Two hundred fifty six studies reported on the development of NLP algorithms for mapping free text to ontology concepts. Seventy-seven described development and evaluation. Twenty-two studies did not perform a validation on unseen data and 68 studies did not perform external validation. Of 23 studies that claimed that their algorithm was generalizable, 5 tested this by external validation. A list of sixteen recommendations regarding the usage of NLP systems and algorithms, usage of data, evaluation and validation, presentation of results, and generalizability of results was developed. CONCLUSION We found many heterogeneous approaches to the reporting on the development and evaluation of NLP algorithms that map clinical text to ontology concepts. Over one-fourth of the identified publications did not perform an evaluation. In addition, over one-fourth of the included studies did not perform a validation, and 88% did not perform external validation. We believe that our recommendations, alongside an existing reporting standard, will increase the reproducibility and reusability of future studies and NLP algorithms in medicine.
Collapse
Affiliation(s)
- Martijn G. Kersloot
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
- Castor EDC, Amsterdam, The Netherlands
| | - Florentien J. P. van Putten
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
| | - Ameen Abu-Hanna
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
| | - Ronald Cornet
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
| | - Derk L. Arts
- Amsterdam UMC, University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health Research Institute Castor EDC, Room J1B-109, PO Box 22700, 1100 DE Amsterdam, The Netherlands
- Castor EDC, Amsterdam, The Netherlands
| |
Collapse
|
12
|
Sonabend W A, Cai W, Ahuja Y, Ananthakrishnan A, Xia Z, Yu S, Hong C. Automated ICD coding via unsupervised knowledge integration (UNITE). Int J Med Inform 2020; 139:104135. [PMID: 32361145 DOI: 10.1016/j.ijmedinf.2020.104135] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Revised: 02/14/2020] [Accepted: 03/26/2020] [Indexed: 12/30/2022]
Abstract
OBJECTIVE Accurate coding is critical for medical billing and electronic medical record (EMR)-based research. Recent research has been focused on developing supervised methods to automatically assign International Classification of Diseases (ICD) codes from clinical notes. However, supervised approaches rely on ICD code data stored in the hospital EMR system and is subject to bias rising from the practice and coding behavior. Consequently, portability of trained supervised algorithms to external EMR systems may suffer. METHOD We developed an unsupervised knowledge integration (UNITE) algorithm to automatically assign ICD codes for a specific disease by analyzing clinical narrative notes via semantic relevance assessment. The algorithm was validated using coded ICD data for 6 diseases from Partners HealthCare (PHS) Biobank and Medical Information Mart for Intensive Care (MIMIC-III). We compared the performance of UNITE against penalized logistic regression (LR), topic modeling, and neural network models within each EMR system. We additionally evaluated the portability of UNITE by training at PHS Biobank and validating at MIMIC-III, and vice versa. RESULTS UNITE achieved an averaged AUC of 0.91 at PHS and 0.92 at MIMIC over 6 diseases, comparable to LR and MLP. It had substantially better performance than topic models. In regards to portability, the performance of UNITE was consistent across different EMR systems, superior to LR, topic models and neural network models. CONCLUSION UNITE accurately assigns ICD code in EMR without requiring human labor, and has major advantages over commonly used machine learning approaches. In addition, the UNITE attained stable performance and high portability across EMRs in different institutions.
Collapse
Affiliation(s)
- Aaron Sonabend W
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | | | - Yuri Ahuja
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Ashwin Ananthakrishnan
- Division of Gastroenterology, Massachusetts General Hospital and Harvard Medical School, USA
| | - Zongqi Xia
- Department of Neurology and Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Sheng Yu
- Center for Statistical Science, Tsinghua University, Beijing, China; Department of Industrial Engineering, Tsinghua University, Beijing, China; Institute for Data Science, Tsinghua University, Beijing, China
| | - Chuan Hong
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
13
|
Brasil S, Pascoal C, Francisco R, dos Reis Ferreira V, A. Videira P, Valadão G. Artificial Intelligence (AI) in Rare Diseases: Is the Future Brighter? Genes (Basel) 2019; 10:genes10120978. [PMID: 31783696 PMCID: PMC6947640 DOI: 10.3390/genes10120978] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Revised: 11/19/2019] [Accepted: 11/20/2019] [Indexed: 02/06/2023] Open
Abstract
The amount of data collected and managed in (bio)medicine is ever-increasing. Thus, there is a need to rapidly and efficiently collect, analyze, and characterize all this information. Artificial intelligence (AI), with an emphasis on deep learning, holds great promise in this area and is already being successfully applied to basic research, diagnosis, drug discovery, and clinical trials. Rare diseases (RDs), which are severely underrepresented in basic and clinical research, can particularly benefit from AI technologies. Of the more than 7000 RDs described worldwide, only 5% have a treatment. The ability of AI technologies to integrate and analyze data from different sources (e.g., multi-omics, patient registries, and so on) can be used to overcome RDs’ challenges (e.g., low diagnostic rates, reduced number of patients, geographical dispersion, and so on). Ultimately, RDs’ AI-mediated knowledge could significantly boost therapy development. Presently, there are AI approaches being used in RDs and this review aims to collect and summarize these advances. A section dedicated to congenital disorders of glycosylation (CDG), a particular group of orphan RDs that can serve as a potential study model for other common diseases and RDs, has also been included.
Collapse
Affiliation(s)
- Sandra Brasil
- Portuguese Association for CDG, 2820-381 Lisboa, Portugal; (S.B.); (C.P.); (R.F.); (P.A.V.)
- CDG & Allies—Professionals and Patient Associations International Network (CDG & Allies—PPAIN), Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
| | - Carlota Pascoal
- Portuguese Association for CDG, 2820-381 Lisboa, Portugal; (S.B.); (C.P.); (R.F.); (P.A.V.)
- CDG & Allies—Professionals and Patient Associations International Network (CDG & Allies—PPAIN), Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
- UCIBIO, Departamento Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
| | - Rita Francisco
- Portuguese Association for CDG, 2820-381 Lisboa, Portugal; (S.B.); (C.P.); (R.F.); (P.A.V.)
- CDG & Allies—Professionals and Patient Associations International Network (CDG & Allies—PPAIN), Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
- UCIBIO, Departamento Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
| | - Vanessa dos Reis Ferreira
- Portuguese Association for CDG, 2820-381 Lisboa, Portugal; (S.B.); (C.P.); (R.F.); (P.A.V.)
- CDG & Allies—Professionals and Patient Associations International Network (CDG & Allies—PPAIN), Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
- Correspondence:
| | - Paula A. Videira
- Portuguese Association for CDG, 2820-381 Lisboa, Portugal; (S.B.); (C.P.); (R.F.); (P.A.V.)
- CDG & Allies—Professionals and Patient Associations International Network (CDG & Allies—PPAIN), Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
- UCIBIO, Departamento Ciências da Vida, Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, 2829-516 Lisboa, Portugal
| | - Gonçalo Valadão
- Instituto de Telecomunicações, 1049-001 Lisboa, Portugal;
- Departamento de Ciências e Tecnologias, Autónoma Techlab–Universidade Autónoma de Lisboa, 1169-023 Lisboa, Portugal
- Electronics, Telecommunications and Computers Engineering Department, Instituto Superior de Engenharia de Lisboa, 1959-007 Lisboa, Portugal
| |
Collapse
|
14
|
Dhombres F, Charlet J. Formal Medical Knowledge Representation Supports Deep Learning Algorithms, Bioinformatics Pipelines, Genomics Data Analysis, and Big Data Processes. Yearb Med Inform 2019; 28:152-155. [PMID: 31419827 PMCID: PMC6697514 DOI: 10.1055/s-0039-1677933] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
OBJECTIVE To select, present, and summarize the best papers published in 2018 in the field of Knowledge Representation and Management (KRM). METHODS A comprehensive and standardized review of the medical informatics literature was performed to select the most interesting papers published in 2018 in KRM, based on PubMed and ISI Web Of Knowledge queries. RESULTS Four best papers were selected among the 962 publications retrieved following the Yearbook review process. The research areas in 2018 were mainly related to the ontology-based data integration for phenotype-genotype association mining, the design of ontologies and their application, and the semantic annotation of clinical texts. CONCLUSION In the KRM selection for 2018, research on semantic representations demonstrated their added value for enhanced deep learning approaches in text mining and for designing novel bioinformatics pipelines based on graph databases. In addition, the ontology structure can enrich the analyses of whole genome expression data. Finally, semantic representations demonstrated promising results to process phenotypic big data.
Collapse
Affiliation(s)
- Ferdinand Dhombres
- Sorbonne Université, Université Paris 13, Sorbonne Paris Cité, INSERM, UMR_S 1142, LIMICS, Paris, France.,Médecine Sorbonne Université, Service de Médecine Fætale, AP-HP/HUEP, Hôpital Armand Trousseau, Paris, France
| | - Jean Charlet
- Sorbonne Université, Université Paris 13, Sorbonne Paris Cité, INSERM, UMR_S 1142, LIMICS, Paris, France.,AP-HP, Delegation for Clinical Research and Innovation, Paris, France
| | | |
Collapse
|