1
|
Huang MW, Tsai CF, Tsui SC, Lin WC. Combining data discretization and missing value imputation for incomplete medical datasets. PLoS One 2023; 18:e0295032. [PMID: 38033140 PMCID: PMC10688879 DOI: 10.1371/journal.pone.0295032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Accepted: 11/14/2023] [Indexed: 12/02/2023] Open
Abstract
Data discretization aims to transform a set of continuous features into discrete features, thus simplifying the representation of information and making it easier to understand, use, and explain. In practice, users can take advantage of the discretization process to improve knowledge discovery and data analysis on medical domain problem datasets containing continuous features. However, certain feature values were frequently missing. Many data-mining algorithms cannot handle incomplete datasets. In this study, we considered the use of both discretization and missing-value imputation to process incomplete medical datasets, examining how the order of discretization and missing-value imputation combined influenced performance. The experimental results were obtained using seven different medical domain problem datasets: two discretizers, including the minimum description length principle (MDLP) and ChiMerge; three imputation methods, including the mean/mode, classification and regression tree (CART), and k-nearest neighbor (KNN) methods; and two classifiers, including support vector machines (SVM) and the C4.5 decision tree. The results show that a better performance can be obtained by first performing discretization followed by imputation, rather than vice versa. Furthermore, the highest classification accuracy rate was achieved by combining ChiMerge and KNN with SVM.
Collapse
Affiliation(s)
- Min-Wei Huang
- Kaohsiung Municipal Kai-Syuan Psychiatric Hospital, Kaohsiung, Taiwan
- Department of Physical Therapy and Graduate Institute of Rehabilitation Science, China Medical University, Taichung, Taiwan
| | - Chih-Fong Tsai
- Department of Information Management, National Central University, Taoyuan, Taiwan
| | - Shu-Ching Tsui
- Department of Information Management, National Central University, Taoyuan, Taiwan
| | - Wei-Chao Lin
- Department of Digital Financial Technology, Chang Gung University, Taoyuan, Taiwan
- Department of Information Management, Chang Gung University, Taoyuan, Taiwan
- Division of Thoracic Surgery, Chang Gung Memorial Hospital at Linkou, Taoyuan, Taiwan
| |
Collapse
|
2
|
Shen Y, Li H, Zhang B, Cao Y, Guo Z, Gao X, Chen Y. An artificial neural network-based data filling approach for smart operation of digital wastewater treatment plants. ENVIRONMENTAL RESEARCH 2023; 224:115549. [PMID: 36822533 DOI: 10.1016/j.envres.2023.115549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 02/13/2023] [Accepted: 02/21/2023] [Indexed: 06/18/2023]
Abstract
With the prevalence of digitization, smart operation has become mainstream in future wastewater treatment plants. This requires substantial and complete historical data for model construction. However, the data collected from the front-end sensor contained numerous missing dissolved oxygen (DO) values. Therefore, this study proposed a framework that adaptively adjusted the structure of embedded filling models according to the missing situation. Long short-term memory and gated recurrent units (GRU) were embedded for experiments, and some standard filling methods were selected as benchmarks. The experimental dataset indicated that the K-nearest neighbor could achieve good filling results by traversing the parameters. The effect obtained by the method proposed in this study was slightly better, and GRU was better among the three embedded models. Analysis of the filling results for each DO column revealed that the effect was highly correlated with the dispersion of DO data. The experimental results for the entire dataset demonstrated that the filling effect of the proposed method was significantly better and more stable than the others. The proposed model suffered from the problem of insufficient interpretability and long training time. This study provides an efficient and practical method to solve the intricate missing DO and lays the foundation for the smart operation of wastewater treatment plants.
Collapse
Affiliation(s)
- Yu Shen
- National Research Base of Intelligent Manufacturing Service, Chongqing Technology and Business University, Chongqing, 400067, China; Chongqing South-to-Thais Environmental Protection Technology Research Institute Co., Ltd., Chongqing, 400069, China
| | - Huimin Li
- National Research Base of Intelligent Manufacturing Service, Chongqing Technology and Business University, Chongqing, 400067, China
| | - Bing Zhang
- National Research Base of Intelligent Manufacturing Service, Chongqing Technology and Business University, Chongqing, 400067, China; Chongqing Yujiang Intelligent Technology Co., Ltd., Chongqing, 409003, China.
| | - Yang Cao
- School of Environmental and Ecology, Chongqing University, Chongqing, 400044, China
| | - Zhiwei Guo
- National Research Base of Intelligent Manufacturing Service, Chongqing Technology and Business University, Chongqing, 400067, China
| | - Xu Gao
- National Research Base of Intelligent Manufacturing Service, Chongqing Technology and Business University, Chongqing, 400067, China; Chongqing Water Group Co., Ltd, Chongqing, China
| | - Youpeng Chen
- School of Environmental and Ecology, Chongqing University, Chongqing, 400044, China.
| |
Collapse
|
3
|
Yu L, Li M. A case-based reasoning driven ensemble learning paradigm for financial distress prediction with missing data. Appl Soft Comput 2023. [DOI: 10.1016/j.asoc.2023.110163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/04/2023]
|
4
|
Zhang Q, Wen J, Zhou J, Zhang B. Missing-view completion for fatty liver disease detection. Comput Biol Med 2022; 150:106097. [PMID: 36244304 DOI: 10.1016/j.compbiomed.2022.106097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Revised: 08/22/2022] [Accepted: 09/10/2022] [Indexed: 11/15/2022]
Abstract
Fatty liver disease is a common disease that causes extra fat storage in an individual's liver. Patients with fatty liver disease may progress to cirrhosis and liver failure, further leading to liver cancer. The prevalence of fatty liver disease ranges from 10% to 30% in many countries. In general, detecting fatty liver requires professional neuroimaging modalities or methods such as computed tomography, ultrasound, and medical experts' practical experiences. Considering this point, finding intelligent electronic noninvasive diagnostic approaches are desired at present. Currently, most existing works in the area of computerized noninvasive disease detection often apply one view (modality) or perform multi-view (several modalities) analysis, e.g., face, tongue, and/or sublingual for disease detection. The multi-view data of patients provides more complementary information for diagnosis. However, due to the conditions of data acquisition, interference by human factors, etc., many multi-view data are defective with some missing-view information, making these multi-view data difficult to evaluate. This factor largely affects the performance of classifying disease and the development of fully computerized noninvasive methods. Thus, the purpose of this study is to address the missing view issue among noninvasive disease detection. In this work, a multi-view dataset containing facial, sublingual vein, and tongue images are initially processed to produce corresponding feature for incomplete multi-view disease diagnostic evaluation. Hereby, we propose a novel method, i.e., multi-view completion, to process the incomplete multi-view data in order to complete the missing-view information for classifying fatty liver disease from healthy candidates. In particular, this method can explore the intra-view and inter-view information to produce the missing-view data effectively. Extensive experiments on a collected dataset with 220 fatty liver patients and 220 healthy samples show that our proposed approach achieves better diagnostic results with missing-view completion compared to the original incomplete multi-view data under various classifiers. Related results prove that our method can effectively process the missing-view issue and improve the noninvasive disease detection performance.
Collapse
Affiliation(s)
- Qi Zhang
- PAMI Research Group, Dept. of Computer and Information Science, University of Macau, Macau, China
| | - Jie Wen
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
| | - Jianhang Zhou
- PAMI Research Group, Dept. of Computer and Information Science, University of Macau, Macau, China
| | - Bob Zhang
- PAMI Research Group, Dept. of Computer and Information Science, University of Macau, Macau, China; Beijing Key Laboratory of Big Data Technology for Food Safety, Beijing Technology and Business University, Beijing, China.
| |
Collapse
|
5
|
Multiple imputation method of missing credit risk assessment data based on generative adversarial networks. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.109273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
6
|
|
7
|
Pan H, Ye Z, He Q, Yan C, Yuan J, Lai X, Su J, Li R. Discrete Missing Data Imputation Using Multilayer Perceptron and Momentum Gradient Descent. SENSORS (BASEL, SWITZERLAND) 2022; 22:5645. [PMID: 35957197 PMCID: PMC9371018 DOI: 10.3390/s22155645] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Revised: 07/21/2022] [Accepted: 07/26/2022] [Indexed: 06/15/2023]
Abstract
Data are a strategic resource for industrial production, and an efficient data-mining process will increase productivity. However, there exist many missing values in data collected in real life due to various problems. Because the missing data may reduce productivity, missing value imputation is an important research topic in data mining. At present, most studies mainly focus on imputation methods for continuous missing data, while a few concentrate on discrete missing data. In this paper, a discrete missing value imputation method based on a multilayer perceptron (MLP) is proposed, which employs a momentum gradient descent algorithm, and some prefilling strategies are utilized to improve the convergence speed of the MLP. To verify the effectiveness of the method, experiments are conducted to compare the classification accuracy with eight common imputation methods, such as the mode, random, hot-deck, KNN, autoencoder, and MLP, under different missing mechanisms and missing proportions. Experimental results verify that the improved MLP model (IMLP) can effectively impute discrete missing values in most situations under three missing patterns.
Collapse
Affiliation(s)
- Hu Pan
- School of Computer Science, Hubei University of Technology, Wuhan 430068, China; (H.P.); (Q.H.); (C.Y.); (J.Y.); (J.S.); (R.L.)
| | - Zhiwei Ye
- School of Computer Science, Hubei University of Technology, Wuhan 430068, China; (H.P.); (Q.H.); (C.Y.); (J.Y.); (J.S.); (R.L.)
- Fujian Provincial Key Laboratory of Data Intensive Computing, Quanzhou 362000, China
- Key Laboratory of Intelligent Computing and Information Processing, Fujian Province, Quanzhou 362000, China
| | - Qiyi He
- School of Computer Science, Hubei University of Technology, Wuhan 430068, China; (H.P.); (Q.H.); (C.Y.); (J.Y.); (J.S.); (R.L.)
| | - Chunyan Yan
- School of Computer Science, Hubei University of Technology, Wuhan 430068, China; (H.P.); (Q.H.); (C.Y.); (J.Y.); (J.S.); (R.L.)
| | - Jianyu Yuan
- School of Computer Science, Hubei University of Technology, Wuhan 430068, China; (H.P.); (Q.H.); (C.Y.); (J.Y.); (J.S.); (R.L.)
| | - Xudong Lai
- School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430072, China;
| | - Jun Su
- School of Computer Science, Hubei University of Technology, Wuhan 430068, China; (H.P.); (Q.H.); (C.Y.); (J.Y.); (J.S.); (R.L.)
| | - Ruihan Li
- School of Computer Science, Hubei University of Technology, Wuhan 430068, China; (H.P.); (Q.H.); (C.Y.); (J.Y.); (J.S.); (R.L.)
| |
Collapse
|
8
|
Yuan F, Che J. An ensemble multi-step M-RMLSSVR model based on VMD and two-group strategy for day-ahead short-term load forecasting. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109440] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
9
|
Hesami M, Alizadeh M, Jones AMP, Torkamaneh D. Machine learning: its challenges and opportunities in plant system biology. Appl Microbiol Biotechnol 2022; 106:3507-3530. [PMID: 35575915 DOI: 10.1007/s00253-022-11963-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 03/14/2022] [Accepted: 05/07/2022] [Indexed: 12/25/2022]
Abstract
Sequencing technologies are evolving at a rapid pace, enabling the generation of massive amounts of data in multiple dimensions (e.g., genomics, epigenomics, transcriptomic, metabolomics, proteomics, and single-cell omics) in plants. To provide comprehensive insights into the complexity of plant biological systems, it is important to integrate different omics datasets. Although recent advances in computational analytical pipelines have enabled efficient and high-quality exploration and exploitation of single omics data, the integration of multidimensional, heterogenous, and large datasets (i.e., multi-omics) remains a challenge. In this regard, machine learning (ML) offers promising approaches to integrate large datasets and to recognize fine-grained patterns and relationships. Nevertheless, they require rigorous optimizations to process multi-omics-derived datasets. In this review, we discuss the main concepts of machine learning as well as the key challenges and solutions related to the big data derived from plant system biology. We also provide in-depth insight into the principles of data integration using ML, as well as challenges and opportunities in different contexts including multi-omics, single-cell omics, protein function, and protein-protein interaction. KEY POINTS: • The key challenges and solutions related to the big data derived from plant system biology have been highlighted. • Different methods of data integration have been discussed. • Challenges and opportunities of the application of machine learning in plant system biology have been highlighted and discussed.
Collapse
Affiliation(s)
- Mohsen Hesami
- Department of Plant Agriculture, University of Guelph, Guelph, ON, N1G 2W1, Canada
| | - Milad Alizadeh
- Department of Botany, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | | | - Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec City, QC, G1V 0A6, Canada. .,Institut de Biologie Intégrative Et Des Systèmes (IBIS), Université Laval, Québec City, QC, G1V 0A6, Canada.
| |
Collapse
|