1
|
Eyupoglu C, Karakuş O. Novel CAD Diagnosis Method Based on Search, PCA, and AdaBoostM1 Techniques. J Clin Med 2024; 13:2868. [PMID: 38792410 PMCID: PMC11122190 DOI: 10.3390/jcm13102868] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2024] [Revised: 04/26/2024] [Accepted: 05/07/2024] [Indexed: 05/26/2024] Open
Abstract
Background: Cardiovascular diseases (CVDs) are the primary cause of mortality worldwide, resulting in a growing number of annual fatalities. Coronary artery disease (CAD) is one of the basic types of CVDs, and early diagnosis of CAD is crucial for convenient treatment and decreasing mortality rates. In the literature, several studies use many features for CAD diagnosis. However, due to the large number of features used in these studies, the possibility of early diagnosis is reduced. Methods: For this reason, in this study, a new method that uses only five features-age, hypertension, typical chest pain, t-wave inversion, and region with regional wall motion abnormality-and is a combination of eight different search techniques, principal component analysis (PCA), and the AdaBoostM1 algorithm has been proposed for early and accurate CAD diagnosis. Results: The proposed method is devised and tested on a benchmark dataset called Z-Alizadeh Sani. The performance of the proposed method is tested with a variety of metrics and compared with basic machine-learning techniques and the existing studies in the literature. The experimental results have shown that the proposed method is efficient and achieves the best classification performance, with an accuracy of 91.8%, ever reported on the Z-Alizadeh Sani dataset with so few features. Conclusions: As a result, medical practitioners can utilize the proposed approach for diagnosing CAD early and accurately.
Collapse
Affiliation(s)
- Can Eyupoglu
- Department of Computer Engineering, Turkish Air Force Academy, National Defence University, Istanbul 34149, Türkiye;
| | - Oktay Karakuş
- School of Computer Science and Informatics, Cardiff University, Cardiff CF24 4AG, UK
| |
Collapse
|
2
|
Sayadi M, Varadarajan V, Sadoughi F, Chopannejad S, Langarizadeh M. A Machine Learning Model for Detection of Coronary Artery Disease Using Noninvasive Clinical Parameters. LIFE (BASEL, SWITZERLAND) 2022; 12:life12111933. [PMID: 36431068 PMCID: PMC9698583 DOI: 10.3390/life12111933] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/23/2022] [Revised: 11/16/2022] [Accepted: 11/17/2022] [Indexed: 11/22/2022]
Abstract
Background and Objective: Coronary artery disease (CAD) is one of the most prevalent causes of death worldwide. The early diagnosis and timely medical care of cardiovascular patients can greatly prevent death and reduce the cost of treatments associated with CAD. In this study, we attempt to prepare a new model for early CAD diagnosis. The proposed model can diagnose CAD based on clinical data and without the use of an invasive procedure. Methods: In this paper, machine-learning (ML) techniques were used for the early detection of CAD, which were applied to a CAD dataset known as Z-Alizadeh Sani. Since this dataset has 54 features, the Pearson correlation feature selection method was conducted to identify the most effective features. Then, six machine learning techniques including decision tree, deep learning, logistic regression, random forest, support vector machine (SVM), and Xgboost were employed based on a semi-random-partitioning framework. Result: Applying Pearson feature selection to the dataset demonstrated that only eight features were the most effective for CAD diagnosis. The results of running the six machine-learning models on the selected features showed that logistic regression and SVM had the same performance with 95.45% accuracy, 95.91% sensitivity, 91.66% specificity, and a 96.90% F1 score. In addition, the ROC curve indicates a similar result regarding the AUC (0.98). Conclusions: Prediction is an important component of medical decision support systems. The results of the present study showed that feature selection has a high impact on machine-learning performance and, regardless of the evaluation metrics of the machine-learning models, determining the effective features is very important. However, SVM and Logistic Regression were designated as the best models according to our selected features.
Collapse
Affiliation(s)
- Mohammadjavad Sayadi
- Department of Health Information Management, School of Health Management and Information Sciences, Iran University of Medical Sciences, Tehran 14496-14535, Iran
- Department of Computer Engineering, Technical and Vocational University (TVU), Tehran 14357-61137, Iran
| | - Vijayakumar Varadarajan
- School of Computer Science and Engineering, The University of New South Wales, Sydney 2052, Australia
- Dean International, Ajeenkya D Y Patil University, Pune 412105, India
- Swiss School of Business and Management, 1213 Geneva, Switzerland
- Correspondence: (V.V.); (M.L.)
| | - Farahnaz Sadoughi
- Department of Health Information Management, School of Health Management and Information Sciences, Iran University of Medical Sciences, Tehran 14496-14535, Iran
| | - Sara Chopannejad
- Department of Health Information Management, School of Health Management and Information Sciences, Iran University of Medical Sciences, Tehran 14496-14535, Iran
| | - Mostafa Langarizadeh
- Department of Health Information Management, School of Health Management and Information Sciences, Iran University of Medical Sciences, Tehran 14496-14535, Iran
- Correspondence: (V.V.); (M.L.)
| |
Collapse
|
3
|
Javid I, Ghazali R, Zulqarnain M, Hassan N. Data pre-processing for cardiovascular disease classification: A systematic literature review. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2022. [DOI: 10.3233/jifs-220061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
The important task in the medical field is the early detection of disease. Heart disease is one of the greatest challenging diseases in all other diseases subsequently 17.3 million people died once a year due to heart disease. A minute error in heart disease diagnosis is a risk for an individual lifespan. Precise heart disease diagnosis is consequently critical. Different approaches including data mining have been used for the prediction of heart disease. However, there are some solemn concerns related to the data quality for example inconsistencies, missing values, noise, high dimensionality, and imbalanced statistics. In order to improve the accuracy of Data Mining based prediction systems, techniques for data preparation were applied to increase the quality of the data. The foremost objective of this paper is to highlight and summarize the research work about (i) data preparation techniques mostly used, (ii) the impact of pre-processing procedures on the accuracy of a heart disease prediction system, (iii) classifier enactment with data pre-processing techniques, (4) comparison in terms of accuracy of the different pre-processing model. A systematic literature review on the use of data pre-processing in heart disease diagnosis is carried out from January 2001 to July 2021 by studying the published material. Almost 30 studies were designated and examined related to the above-mentioned benchmarks. The literature review concludes that data reduction and data cleaning pre-processing techniques are mostly used in heart disease prediction systems. Overall this study concludes that data pre-processing has improved the accuracy of models used for heart disease prediction. Some hybrid models including (ANN+CHI), (ANN+PCA), (DNN+CHI) and (SVM+PCA) have shown improved accuracy level. However, due to the lack of clarification, there is a number of limitations and challenges in order to implementing these models in the real world.
Collapse
Affiliation(s)
- Irfan Javid
- Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn, Malaysia
- Department of Computer Science & IT, University of Poonch Rawalakot, AJK, Pakistan
| | - Rozaida Ghazali
- Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn, Malaysia
| | - Muhammad Zulqarnain
- Riphah College of Computing, Riphah International University Faisalabad Campus, Pakistan
| | - Norlida Hassan
- Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn, Malaysia
| |
Collapse
|
4
|
Hassannataj Joloudari J, Azizi F, Nematollahi MA, Alizadehsani R, Hassannatajjeloudari E, Nodehi I, Mosavi A. GSVMA: A Genetic Support Vector Machine ANOVA Method for CAD Diagnosis. Front Cardiovasc Med 2022; 8:760178. [PMID: 35187099 PMCID: PMC8855497 DOI: 10.3389/fcvm.2021.760178] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 12/22/2021] [Indexed: 11/13/2022] Open
Abstract
Background Coronary artery disease (CAD) is one of the crucial reasons for cardiovascular mortality in middle-aged people worldwide. The most typical tool is angiography for diagnosing CAD. The challenges of CAD diagnosis using angiography are costly and have side effects. One of the alternative solutions is the use of machine learning-based patterns for CAD diagnosis. Methods Hence, this paper provides a new hybrid machine learning model called genetic support vector machine and analysis of variance (GSVMA). The analysis of variance (ANOVA) is known as the kernel function for the SVM algorithm. The proposed model is performed based on the Z-Alizadeh Sani dataset so that a genetic optimization algorithm is used to select crucial features. In addition, SVM with ANOVA, linear SVM (LSVM), and library for support vector machine (LIBSVM) with radial basis function (RBF) methods were applied to classify the dataset. Results As a result, the GSVMA hybrid method performs better than other methods. This proposed method has the highest accuracy of 89.45% through a 10-fold crossvalidation technique with 31 selected features on the Z-Alizadeh Sani dataset. Conclusion We demonstrated that SVM combined with genetic optimization algorithm could be lead to more accuracy. Therefore, our study confirms that the GSVMA method outperforms other methods so that it can facilitate CAD diagnosis.
Collapse
Affiliation(s)
| | - Faezeh Azizi
- Department of Computer Engineering, Faculty of Engineering, University of Birjand, Birjand, Iran
| | | | - Roohallah Alizadehsani
- Institute for Intelligent Systems Research and Innovation, Deakin University, Geelong, VIC, Australia
| | - Edris Hassannatajjeloudari
- Department of Nursing, School of Nursing and Allied Medical Sciences, Maragheh Faculty of Medical Sciences, Maragheh, Iran
| | - Issa Nodehi
- Department of Computer Engineering, University of Qom, Qom, Iran
| | - Amir Mosavi
- Faculty of Informatics, Technische Universität Dresden, Dresden, Germany
- Faculty of Civil Engineering, TU-Dresden, Dresden, Germany
- John von Neumann Faculty of Informatics, Óbuda University, Budapest, Hungary
- Institute of Information Society, University of Public Service, Budapest, Hungary
- Institute of Information Engineering, Automation and Mathematics, Slovak University of Technology in Bratislava, Bratislava, Slovakia
| |
Collapse
|
5
|
Abdollahi J, Nouri-Moghaddam B. A hybrid method for heart disease diagnosis utilizing feature selection based ensemble classifier model generation. IRAN JOURNAL OF COMPUTER SCIENCE 2022; 5:229-246. [PMCID: PMC9081959 DOI: 10.1007/s42044-022-00104-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/03/2021] [Accepted: 04/19/2022] [Indexed: 09/29/2023]
Abstract
Heart disease is one of the most complicated diseases, and it affects a large number of individuals throughout the world. In healthcare, particularly cardiology, early and accurate detection of cardiac disease is critical. The Heart Disease Data Set-UCI repository collects data on heart disease. The search space and complexity of the classification models are increased by this raw dataset, which contains redundant and inconsistent data. We need to eliminate the redundant and unnecessary elements from the data to improve classification accuracy. As a consequence, feature selection approaches might be useful for reducing the cost of diagnosis by identifying the most important qualities. This research developed an ensemble classification model based on a feature selection approach in which selected features play a role in classification. Accordingly, a classification approach was introduced using ensemble learning with a genetic algorithm, feature selection, and biomedical test values to diagnose heart disease. Based on the results, it is deduced that the benefits of using the feature selection method vary depending on the utilized machine learning technique. However, the best-proposed model based on the combination of genetic algorithm and the ensemble learning model has achieved an accuracy of 97.57% on the considered datasets. The suggested diagnosis system achieved better accuracy than previously proposed methods and can easily be implemented in healthcare to identify heart disease.
Collapse
Affiliation(s)
- Jafar Abdollahi
- Department of Computer Engineering, Ardabil Branch, Islamic Azad University, Ardabil, Iran
| | - Babak Nouri-Moghaddam
- Department of Computer Engineering, Ardabil Branch, Islamic Azad University, Ardabil, Iran
| |
Collapse
|
6
|
C-CADZ: computational intelligence system for coronary artery disease detection using Z-Alizadeh Sani dataset. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02467-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
7
|
Apostolopoulos ID, Groumpos PP, Apostolopoulos DJ. Advanced fuzzy cognitive maps: state-space and rule-based methodology for coronary artery disease detection. Biomed Phys Eng Express 2021; 7. [PMID: 33930876 DOI: 10.1088/2057-1976/abfd83] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 04/30/2021] [Indexed: 11/11/2022]
Abstract
According to the World Health Organization, 50% of deaths in European Union are caused by Cardiovascular Diseases (CVD), while 80% of premature heart diseases and strokes can be prevented. In this study, a Computer-Aided Diagnostic model for a precise diagnosis of Coronary Artery Disease (CAD) is proposed. The methodology is based on State Space Advanced Fuzzy Cognitive Maps (AFCMs), an evolution of the traditional Fuzzy Cognitive Maps. Also, a rule-based mechanism is incorporated, to further increase the knowledge of the proposed system and the interpretability of the decision mechanism. The proposed method is evaluated utilizing a CAD dataset from the Department of Nuclear Medicine of the University Hospital of Patras, in Greece. Several experiments are conducted to define the optimal parameters of the proposed AFCM. Furthermore, the proposed AFCM is compared with the traditional FCM approach and the literature. The experiments highlight the effectiveness of the AFCM approach, obtaining 85.47% accuracy in CAD diagnosis, showing an improvement of +7% over the traditional approach. It is demonstrated that the AFCM approach in developing Fuzzy Cognitive Maps outperforms the conventional approach, while it constitutes a reliable method for the diagnosis of Coronary Artery Disease.
Collapse
Affiliation(s)
- Ioannis D Apostolopoulos
- University of Patras, Medical School, Department of Medical Physics, Rio, Achaia, PC 26504, Greece
| | - Peter P Groumpos
- University of Patras, Department Electrical and Computer Engineering, Rio, Achaia, PC 26504, Greece
| | | |
Collapse
|
8
|
Coronary Artery Disease Detection by Machine Learning with Coronary Bifurcation Features. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10217656] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Background: Early accurate detection of coronary artery disease (CAD) is one of the most important medical research areas. Researchers are motivated to utilize machine learning techniques for quick and accurate detection of CAD. Methods: To obtain the high quality of features used for machine learning, we here extracted the coronary bifurcation features from the coronary computed tomography angiography (CCTA) images by using the morphometric method. The machine learning classifier algorithms, such as logistic regression (LR), decision tree (DT), linear discriminant analysis (LDA), k-nearest neighbors (k-NN), artificial neural network (ANN), and support vector machine (SVM) were applied for estimating the performance by using the measured features. Results: The results showed that in comparison with other machine learning methods, the polynomial-SVM with the use of the grid search optimization method had the best performance for the detection of CAD and had yielded the classification accuracy of 100.00%. Among six examined coronary bifurcation features, the exponent of vessel diameter (n) and the area expansion ratio (AER) were two key features in the detection of CAD. Conclusions: This study could aid the clinicians to detect CAD accurately, which may probably provide an alternative method for the non-invasive diagnosis in clinical.
Collapse
|
9
|
Benhar H, Idri A, Fernández-Alemán JL. Data preprocessing for heart disease classification: A systematic literature review. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2020; 195:105635. [PMID: 32652383 DOI: 10.1016/j.cmpb.2020.105635] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/14/2019] [Accepted: 06/24/2020] [Indexed: 06/11/2023]
Abstract
CONTEXT Early detection of heart disease is an important challenge since 17.3 million people yearly lose their lives due to heart diseases. Besides, any error in diagnosis of cardiac disease can be dangerous and risks an individual's life. Accurate diagnosis is therefore critical in cardiology. Data Mining (DM) classification techniques have been used to diagnosis heart diseases but still limited by some challenges of data quality such as inconsistencies, noise, missing data, outliers, high dimensionality and imbalanced data. Data preprocessing (DP) techniques were therefore used to prepare data with the goal of improving the performance of heart disease DM based prediction systems. OBJECTIVE The purpose of this study is to review and summarize the current evidence on the use of preprocessing techniques in heart disease classification as regards: (1) the DP tasks and techniques most frequently used, (2) the impact of DP tasks and techniques on the performance of classification in cardiology, (3) the overall performance of classifiers when using DP techniques, and (4) comparisons of different combinations classifier-preprocessing in terms of accuracy rate. METHOD A systematic literature review is carried out, by identifying and analyzing empirical studies on the application of data preprocessing in heart disease classification published in the period between January 2000 and June 2019. A total of 49 studies were therefore selected and analyzed according to the aforementioned criteria. RESULTS The review results show that data reduction is the most used preprocessing task in cardiology, followed by data cleaning. In general, preprocessing either maintained or improved the performance of heart disease classifiers. Some combinations such as (ANN + PCA), (ANN + CHI) and (SVM + PCA) are promising terms of accuracy. However the deployment of these models in real-world diagnosis decision support systems is subject to several risks and limitations due to the lack of interpretation.
Collapse
Affiliation(s)
- H Benhar
- Software Project Management Research Team, ENSIAS, University Mohammed V in Rabat, Morocco.
| | - A Idri
- Software Project Management Research Team, ENSIAS, University Mohammed V in Rabat, Morocco; CSEHS-MSDA, Mohammed VI Polytechnic University, Benguerir, Morocco.
| | - J L Fernández-Alemán
- Department of Informatics and Systems, Faculty of Computer Science, University of Murcia, Spain.
| |
Collapse
|
10
|
Tama BA, Im S, Lee S. Improving an Intelligent Detection System for Coronary Heart Disease Using a Two-Tier Classifier Ensemble. BIOMED RESEARCH INTERNATIONAL 2020; 2020:9816142. [PMID: 32420387 PMCID: PMC7201579 DOI: 10.1155/2020/9816142] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Revised: 03/13/2020] [Accepted: 04/06/2020] [Indexed: 11/17/2022]
Abstract
Coronary heart disease (CHD) is one of the severe health issues and is one of the most common types of heart diseases. It is the most frequent cause of mortality across the globe due to the lack of a healthy lifestyle. Owing to the fact that a heart attack occurs without any apparent symptoms, an intelligent detection method is inescapable. In this article, a new CHD detection method based on a machine learning technique, e.g., classifier ensembles, is dealt with. A two-tier ensemble is built, where some ensemble classifiers are exploited as base classifiers of another ensemble. A stacked architecture is designed to blend the class label prediction of three ensemble learners, i.e., random forest, gradient boosting machine, and extreme gradient boosting. The detection model is evaluated on multiple heart disease datasets, i.e., Z-Alizadeh Sani, Statlog, Cleveland, and Hungarian, corroborating the generalisability of the proposed model. A particle swarm optimization-based feature selection is carried out to choose the most significant feature set for each dataset. Finally, a two-fold statistical test is adopted to justify the hypothesis, demonstrating that the performance differences of classifiers do not rely upon an assumption. Our proposed method outperforms any base classifiers in the ensemble with respect to 10-fold cross validation. Our detection model has performed better than current existing models based on traditional classifier ensembles and individual classifiers in terms of accuracy, F 1, and AUC. This study demonstrates that our proposed model adds a considerable contribution compared to the prior published studies in the current literature.
Collapse
Affiliation(s)
- Bayu Adhi Tama
- Department of Mechanical Engineering, Pohang University of Science and Technology, Republic of Korea
| | - Sun Im
- Department of Rehabilitation Medicine, Bucheon St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Republic of Korea
| | - Seungchul Lee
- Department of Mechanical Engineering, Pohang University of Science and Technology, Republic of Korea
| |
Collapse
|
11
|
A Systematic Mapping Study of Data Preparation in Heart Disease Knowledge Discovery. J Med Syst 2018; 43:17. [PMID: 30542772 DOI: 10.1007/s10916-018-1134-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2018] [Accepted: 12/03/2018] [Indexed: 01/25/2023]
Abstract
The increasing amount of data produced by various biomedical and healthcare systems has led to a need for methodologies related to knowledge data discovery. Data mining (DM) offers a set of powerful techniques that allow the identification and extraction of relevant information from medical datasets, thus enabling doctors and patients to greatly benefit from DM, particularly in the case of diseases with high mortality and morbidity rates, such as heart disease (HD). Nonetheless, the use of raw medical data implies several challenges, such as missing data, noise, redundancy and high dimensionality, which make the extraction of useful and relevant information difficult and challenging. Intensive research has, therefore, recently begun in order to prepare raw healthcare data before knowledge extraction. In any knowledge data discovery (KDD) process, data preparation is the step prior to DM that deals with data imperfectness in order to improve its quality so as to satisfy the requirements and improve the performances of DM techniques. The objective of this paper is to perform a systematic mapping study (SMS) on data preparation for KDD in cardiology so as to provide an overview of the quantity and type of research carried out in this respect. The SMS consisted of a set of 58 selected papers published in the period January 2000 and December 2017. The selected studies were analyzed according to six criteria: year and channel of publication, preparation task, medical task, DM objective, research type and empirical type. The results show that a high amount of data preparation research was carried out in order to improve the performance of DM-based decision support systems in cardiology. Researchers were mainly interested in the data reduction preparation task and particularly in feature selection. Moreover, the majority of the selected studies focused on classification for the diagnosis of HD. Two main research types were identified in the selected studies: solution proposal and evaluation research, and the most frequently used empirical type was that of historical-based evaluation.
Collapse
|
12
|
Idri A, Benhar H, Fernández-Alemán JL, Kadi I. A systematic map of medical data preprocessing in knowledge discovery. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2018; 162:69-85. [PMID: 29903496 DOI: 10.1016/j.cmpb.2018.05.007] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/08/2017] [Revised: 04/25/2018] [Accepted: 05/03/2018] [Indexed: 06/08/2023]
Abstract
BACKGROUND AND OBJECTIVE Datamining (DM) has, over the last decade, received increased attention in the medical domain and has been widely used to analyze medical datasets in order to extract useful knowledge and previously unknown patterns. However, historical medical data can often comprise inconsistent, noisy, imbalanced, missing and high dimensional data. These challenges lead to a serious bias in predictive modeling and reduce the performance of DM techniques. Data preprocessing is, therefore, an essential step in knowledge discovery as regards improving the quality of data and making it appropriate and suitable for DM techniques. The objective of this paper is to review the use of preprocessing techniques in clinical datasets. METHODS We performed a systematic map of studies regarding the application of data preprocessing to healthcare and published between January 2000 and December 2017. A search string was determined on the basis of the mapping questions and the PICO categories. The search string was then applied in digital databases covering the fields of computer science and medical informatics in order to identify relevant studies. The studies were initially selected by reading their titles, abstracts and keywords. Those that were selected at that stage were then reviewed using a set of inclusion and exclusion criteria in order to eliminate any that were not relevant. This process resulted in 126 primary studies. RESULTS Selected studies were analyzed and classified according to their publication years and channels, research type, empirical type and contribution type. The findings of this mapping study revealed that researchers have paid a considerable amount of attention to preprocessing in medical DM in last decade. A significant number of the selected studies used data reduction and cleaning preprocessing tasks. Moreover, the disciplines in which preprocessing have received most attention are: cardiology, endocrinology and oncology. CONCLUSIONS Researchers should develop and implement standards for an effective integration of multiple medical data types. Moreover, we identified the need to perform literature reviews.
Collapse
Affiliation(s)
- A Idri
- Software Project Management Research Team, ENSIAS, University Mohammed V of Rabat, Morocco.
| | - H Benhar
- Software Project Management Research Team, ENSIAS, University Mohammed V of Rabat, Morocco.
| | - J L Fernández-Alemán
- Department of Informatics and Systems, Faculty of Computer Science, University of Murcia, Spain.
| | - I Kadi
- Software Project Management Research Team, ENSIAS, University Mohammed V of Rabat, Morocco.
| |
Collapse
|