1
|
Zhu QX, Zhang HT, Tian Y, Zhang N, Xu Y, He YL. Co-training based virtual sample generation for solving the small sample size problem in process industry. ISA TRANSACTIONS 2023; 134:290-301. [PMID: 36064497 DOI: 10.1016/j.isatra.2022.08.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 08/20/2022] [Accepted: 08/22/2022] [Indexed: 06/15/2023]
Abstract
With the development of industrialization, the production scale and complexity of process industries are getting larger and larger. But, limited by the small amounts of samples and the uneven sample distribution in the process industry, it is difficult to establish accurate and efficient data-driven soft sensor models to predict some variables. To further develop the application of soft sensor models, generating new virtual samples based on the original sample distribution to extend the sample set is an ideal approach to solve this problem. In this paper, a novel virtual sample generation method based on the co-training of two K-Nearest Neighbor (KNN) models is proposed. First, according to the sparse parameter, sparse regions in each dimension of the feature space are identified. Second, the input features of virtual samples are generated in these sparse regions by performing interpolation operations. Third, the outputs of virtual samples are predicted by double KNN regressors based on co-training. The qualified virtual samples are screened and the model is updated using these virtual samples to improve the prediction accuracy of the double KNN models. To verify the effectiveness and superiority of the proposed virtual sample generation method based on the co-training (CTVSG), case studies are conducted using two standard functions and a Purified Terephthalic Acid (PTA) industrial dataset, where the effectiveness of CTVSG is confirmed.
Collapse
Affiliation(s)
- Qun-Xiong Zhu
- College of Information Science & Technology, Beijing University of Chemical Technology, Beijing, 100029, China; Engineering Research Center of Intelligent PSE, Ministry of Education of China, Beijing 100029, China
| | - Hong-Tao Zhang
- College of Information Science & Technology, Beijing University of Chemical Technology, Beijing, 100029, China; Engineering Research Center of Intelligent PSE, Ministry of Education of China, Beijing 100029, China
| | - Ye Tian
- College of Information Science & Technology, Beijing University of Chemical Technology, Beijing, 100029, China; Engineering Research Center of Intelligent PSE, Ministry of Education of China, Beijing 100029, China
| | - Ning Zhang
- College of Information Science & Technology, Beijing University of Chemical Technology, Beijing, 100029, China; Engineering Research Center of Intelligent PSE, Ministry of Education of China, Beijing 100029, China
| | - Yuan Xu
- College of Information Science & Technology, Beijing University of Chemical Technology, Beijing, 100029, China; Engineering Research Center of Intelligent PSE, Ministry of Education of China, Beijing 100029, China.
| | - Yan-Lin He
- College of Information Science & Technology, Beijing University of Chemical Technology, Beijing, 100029, China; Engineering Research Center of Intelligent PSE, Ministry of Education of China, Beijing 100029, China.
| |
Collapse
|
2
|
Merits of Bayesian networks in overcoming small data challenges: a meta-model for handling missing data. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01577-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
3
|
Howey R, Clark AD, Naamane N, Reynard LN, Pratt AG, Cordell HJ. A Bayesian network approach incorporating imputation of missing data enables exploratory analysis of complex causal biological relationships. PLoS Genet 2021; 17:e1009811. [PMID: 34587167 PMCID: PMC8504979 DOI: 10.1371/journal.pgen.1009811] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Revised: 10/11/2021] [Accepted: 09/07/2021] [Indexed: 12/01/2022] Open
Abstract
Bayesian networks can be used to identify possible causal relationships between variables based on their conditional dependencies and independencies, which can be particularly useful in complex biological scenarios with many measured variables. Here we propose two improvements to an existing method for Bayesian network analysis, designed to increase the power to detect potential causal relationships between variables (including potentially a mixture of both discrete and continuous variables). Our first improvement relates to the treatment of missing data. When there is missing data, the standard approach is to remove every individual with any missing data before performing analysis. This can be wasteful and undesirable when there are many individuals with missing data, perhaps with only one or a few variables missing. This motivates the use of imputation. We present a new imputation method that uses a version of nearest neighbour imputation, whereby missing data from one individual is replaced with data from another individual, their nearest neighbour. For each individual with missing data, the subsets of variables to be used to select the nearest neighbour are chosen by sampling without replacement the complete data and estimating a best fit Bayesian network. We show that this approach leads to marked improvements in the recall and precision of directed edges in the final network identified, and we illustrate the approach through application to data from a recent study investigating the causal relationship between methylation and gene expression in early inflammatory arthritis patients. We also describe a second improvement in the form of a pseudo-Bayesian approach for upweighting certain network edges, which can be useful when there is prior evidence concerning their directions. Data analysis using Bayesian networks can help identify possible causal relationships between measured biological variables. Here we propose two improvements to an existing method for Bayesian network analysis. Our first improvement relates to the treatment of missing data. When there is missing data, the standard approach is to remove every individual with any missing data before performing analysis, even if only one or a few variables are missing. This is undesirable as it can reduce the ability of the approach to infer correct relationships. We propose a new method to instead fill in (impute) the missing data prior to analysis. We show through computer simulations that our method improves the reliability of the results obtained, and we illustrate the proposed approach by applying it to data from a recent study in early inflammatory arthritis. We also describe a second improvement involving the upweighting of certain network edges, which can be useful when there is prior evidence concerning their directions.
Collapse
Affiliation(s)
- Richard Howey
- Population Health Sciences Institute, Newcastle University, Newcastle upon Tyne, United Kingdom
| | - Alexander D. Clark
- Translational and Clinical Research Institute, Newcastle University, Newcastle upon Tyne, United Kingdom
| | - Najib Naamane
- Translational and Clinical Research Institute, Newcastle University, Newcastle upon Tyne, United Kingdom
| | - Louise N. Reynard
- Biosciences Institute, Newcastle University, Newcastle upon Tyne, United Kingdom
| | - Arthur G. Pratt
- Translational and Clinical Research Institute, Newcastle University, Newcastle upon Tyne, United Kingdom
- Musculoskeletal Services Directorate, Newcastle upon Tyne Hospitals NHS Foundation Trust, Newcastle upon Tyne, United Kingdom
| | - Heather J. Cordell
- Population Health Sciences Institute, Newcastle University, Newcastle upon Tyne, United Kingdom
- * E-mail:
| |
Collapse
|
4
|
Chen L, Wang L, Zhao J, Wang W. Relevance Vector Machines-Based Time Series Prediction for Incomplete Training Dataset: Two Comparative Approaches. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:4298-4311. [PMID: 31329570 DOI: 10.1109/tcyb.2019.2923434] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Considering that real-life time series mixed with missing points cannot be directly modeled by using most of the supervised machine learning methods, this paper proposes a novel time series prediction method based on relevance vector machines for incomplete training dataset. Given the regularity between the missing inputs and outputs constructed by the phase space reconstruction, this paper imputes the missing inputs during the learning process by the values of their corresponding missing outputs such that the elements in kernel matrix related to the missing inputs are capable of being updated. This paper designs two strategies to estimate the missing outputs. The first one is based on the expectation maximization formulation in which a joint posterior distribution over the missing outputs and the weights vector is derived as a multivariate Gaussian form, and the another maximizes the marginal likelihood function with respect to the missing outputs and other hyperparameters. To verify the performance of the two proposed computing strategies, two synthetic time series and a real-life dataset are employed. The results indicate that the proposed methods have robust and better performance over the other methods when dealing with incomplete time series training dataset.
Collapse
|
5
|
Vilardell M, Buxó M, Clèries R, Martínez JM, Garcia G, Ameijide A, Font R, Civit S, Marcos-Gragera R, Vilardell ML, Carulla M, Espinàs JA, Galceran J, Izquierdo A, Borràs JM. Missing data imputation and synthetic data simulation through modeling graphical probabilistic dependencies between variables (ModGraProDep): An application to breast cancer survival. Artif Intell Med 2020; 107:101875. [DOI: 10.1016/j.artmed.2020.101875] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 02/12/2020] [Accepted: 05/02/2020] [Indexed: 12/29/2022]
|
6
|
Affiliation(s)
- Marco Scutari
- Istituto Dalle Molle di Studi sull'Intelligenza Artificiale (IDSIA) Manno Switzerland
| |
Collapse
|
7
|
Nekouie A, Moattar MH. Missing value imputation for breast cancer diagnosis data using tensor factorization improved by enhanced reduced adaptive particle swarm optimization. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES 2019. [DOI: 10.1016/j.jksuci.2018.01.006] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
8
|
Vazifehdan M, Moattar MH, Jalali M. A hybrid Bayesian network and tensor factorization approach for missing value imputation to improve breast cancer recurrence prediction. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES 2019. [DOI: 10.1016/j.jksuci.2018.01.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
9
|
Datta S, del Carmen Pardo M, Scheike T, Yuen KC. Special issue on advances in survival analysis. Comput Stat Data Anal 2016. [DOI: 10.1016/j.csda.2015.08.015] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|