401
|
Zhou J, Jiang Z, Wang S. Laplacian least learning machine with dynamic updating for imbalanced classification. Appl Soft Comput 2020. [DOI: 10.1016/j.asoc.2019.106028] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
402
|
BAO YANG, KE BIN, LI BIN, YU YJULIA, ZHANG JIE. Detecting Accounting Fraud in Publicly Traded U.S. Firms Using a Machine Learning Approach. JOURNAL OF ACCOUNTING RESEARCH 2020; 58:199-235. [DOI: 10.1111/1475-679x.12292] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Affiliation(s)
- YANG BAO
- Antai College of Economics and ManagementShanghai Jiao Tong University
| | - BIN KE
- Department of Accounting, NUS Business SchoolNational University of Singapore
| | - BIN LI
- Department of Finance, Economics and Management SchoolWuhan University
| | - Y. JULIA YU
- McIntire School of CommerceUniversity of Virginia
| | - JIE ZHANG
- School of Computer EngineeringNanyang Technological University
| |
Collapse
|
403
|
Abduh Z, Nehary EA, Abdel Wahed M, Kadah YM. Classification of heart sounds using fractional fourier transform based mel-frequency spectral coefficients and traditional classifiers. Biomed Signal Process Control 2020. [DOI: 10.1016/j.bspc.2019.101788] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
404
|
Krasanakis E, Schinas E, Papadopoulos S, Kompatsiaris Y, Symeonidis A. Boosted seed oversampling for local community ranking. Inf Process Manag 2020. [DOI: 10.1016/j.ipm.2019.06.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
405
|
Joint maximization of accuracy and information for learning the structure of a Bayesian network classifier. Mach Learn 2020. [DOI: 10.1007/s10994-020-05869-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
406
|
Abstract
The development and integration of information technology and industrial control networks have expanded the magnitude of new data; detecting anomalies or discovering other valid information from them is of vital importance to the stable operation of industrial control systems. This paper proposes an incremental unsupervised anomaly detection method that can quickly analyze and process large-scale real-time data. Our evaluation on the Secure Water Treatment dataset shows that the method is converging to its offline counterpart for infinitely growing data streams.
Collapse
|
407
|
Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10041276] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance problem, the random sampling methods (over and under sampling) being the most widely employed approaches. Moreover, sophisticated sampling methods have been developed, including the Synthetic Minority Over-sampling Technique (SMOTE), and also they have been combined with cleaning techniques such as Editing Nearest Neighbor or Tomek’s Links (SMOTE+ENN and SMOTE+TL, respectively). In the big data context, it is noticeable that the class imbalance problem has been addressed by adaptation of traditional techniques, relatively ignoring intelligent approaches. Thus, the capabilities and possibilities of heuristic sampling methods on deep learning neural networks in big data domain are analyzed in this work, and the cleaning strategies are particularly analyzed. This study is developed on big data, multi-class imbalanced datasets obtained from hyper-spectral remote sensing images. The effectiveness of a hybrid approach on these datasets is analyzed, in which the dataset is cleaned by SMOTE followed by the training of an Artificial Neural Network (ANN) with those data, while the neural network output noise is processed with ENN to eliminate output noise; after that, the ANN is trained again with the resultant dataset. Obtained results suggest that best classification outcome is achieved when the cleaning strategies are applied on an ANN output instead of input feature space only. Consequently, the need to consider the classifier’s nature when the classical class imbalance approaches are adapted in deep learning and big data scenarios is clear.
Collapse
|
408
|
Nnamoko N, Korkontzelos I. Efficient treatment of outliers and class imbalance for diabetes prediction. Artif Intell Med 2020; 104:101815. [PMID: 32498997 DOI: 10.1016/j.artmed.2020.101815] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2018] [Revised: 01/31/2020] [Accepted: 02/04/2020] [Indexed: 12/12/2022]
Abstract
Learning from outliers and imbalanced data remains one of the major difficulties for machine learning classifiers. Among the numerous techniques dedicated to tackle this problem, data preprocessing solutions are known to be efficient and easy to implement. In this paper, we propose a selective data preprocessing approach that embeds knowledge of the outlier instances into artificially generated subset to achieve an even distribution. The Synthetic Minority Oversampling TEchnique (SMOTE) was used to balance the training data by introducing artificial minority instances. However, this was not before the outliers were identified and oversampled (irrespective of class). The aim is to balance the training dataset while controlling the effect of outliers. The experiments prove that such selective oversampling empowers SMOTE, ultimately leading to improved classification performance.
Collapse
Affiliation(s)
- Nonso Nnamoko
- Department of Computer Science, Edge Hill University, Ormskirk, United Kingdom.
| | | |
Collapse
|
409
|
Fang Y, Liu Y, Huang C, Liu L. FastEmbed: Predicting vulnerability exploitation possibility based on ensemble machine learning algorithm. PLoS One 2020; 15:e0228439. [PMID: 32027693 PMCID: PMC7004314 DOI: 10.1371/journal.pone.0228439] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2019] [Accepted: 01/16/2020] [Indexed: 11/18/2022] Open
Abstract
In recent years, the number of vulnerabilities discovered and publicly disclosed has shown a sharp upward trend. However, the value of exploitation of vulnerabilities varies for attackers, considering that only a small fraction of vulnerabilities are exploited. Therefore, the realization of quick exclusion of the non-exploitable vulnerabilities and optimal patch prioritization on limited resources has become imperative for organizations. Recent works using machine learning techniques predict exploited vulnerabilities by extracting features from open-source intelligence (OSINT). However, in the face of explosive growth of vulnerability information, there is room for improvement in the application of past methods to multiple threat intelligence. A more general method is needed to deal with various threat intelligence sources. Moreover, in previous methods, traditional text processing methods were used to deal with vulnerability related descriptions, which only grasped the static statistical characteristics but ignored the context and the meaning of the words of the text. To address these challenges, we propose an exploit prediction model, which is based on a combination of fastText and LightGBM algorithm and called fastEmbed. We replicate key portions of the state-of-the-art work of exploit prediction and use them as benchmark models. Our model outperforms the baseline model whether in terms of the generalization ability or the prediction ability without temporal intermixing with an average overall improvement of 6.283% by learning the embedding of vulnerability-related text on extremely imbalanced data sets. Besides, in terms of predicting the exploits in the wild, our model also outperforms the baseline model with an F1 measure of 0.586 on the minority class (33.577% improvement over the work using features from darkweb/deepweb). The results demonstrate that the model can improve the ability to describe the exploitability of vulnerabilities and predict exploits in the wild effectively.
Collapse
Affiliation(s)
- Yong Fang
- College of Cybersecurity Sichuan University, Chengdu, Sichuan, P.R.China
| | - Yongcheng Liu
- College of Cybersecurity Sichuan University, Chengdu, Sichuan, P.R.China
| | - Cheng Huang
- College of Cybersecurity Sichuan University, Chengdu, Sichuan, P.R.China
- * E-mail:
| | - Liang Liu
- College of Cybersecurity Sichuan University, Chengdu, Sichuan, P.R.China
| |
Collapse
|
410
|
Boosting Minority Class Prediction on Imbalanced Point Cloud Data. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10030973] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
Data imbalance during the training of deep networks can cause the network to skip directly to learning minority classes. This paper presents a novel framework by which to train segmentation networks using imbalanced point cloud data. PointNet, an early deep network used for the segmentation of point cloud data, proved effective in the point-wise classification of balanced data; however, performance degraded when imbalanced data was used. The proposed approach involves removing between-class data point imbalances and guiding the network to pay more attention to majority classes. Data imbalance is alleviated using a hybrid-sampling method involving oversampling, as well as undersampling, respectively, to decrease the amount of data in majority classes and increase the amount of data in minority classes. A balanced focus loss function is also used to emphasize the minority classes through the automated assignment of costs to the various classes based on their density in the point cloud. Experiments demonstrate the effectiveness of the proposed training framework when provided a point cloud dataset pertaining to six objects. The mean intersection over union (mIoU) test accuracy results obtained using PointNet training were as follows: XYZRGB data (91%) and XYZ data (86%). The mIoU test accuracy results obtained using the proposed scheme were as follows: XYZRGB data (98%) and XYZ data (93%).
Collapse
|
411
|
Kundu S, Ari S. MsCNN: A Deep Learning Framework for P300-Based Brain–Computer Interface Speller. ACTA ACUST UNITED AC 2020. [DOI: 10.1109/tmrb.2019.2959559] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
412
|
Zheng M, Li T, Zhu R, Tang Y, Tang M, Lin L, Ma Z. Conditional Wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2019.10.014] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
413
|
Xiao J, Zhou X, Zhong Y, Xie L, Gu X, Liu D. Cost-sensitive semi-supervised selective ensemble model for customer credit scoring. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2019.105118] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
414
|
A gravitational density-based mass sharing method for imbalanced data classification. SN APPLIED SCIENCES 2020. [DOI: 10.1007/s42452-020-2039-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
|
415
|
Zhu Z, Wang Z, Li D, Du W. NearCount: Selecting critical instances based on the cited counts of nearest neighbors. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2019.105196] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
416
|
Fountain-Jones NM, Clark NJ, Kinsley AC, Carstensen M, Forester J, Johnson TJ, Miller EA, Moore S, Wolf TM, Craft ME. Microbial associations and spatial proximity predict North American moose (Alces alces) gastrointestinal community composition. J Anim Ecol 2020; 89:817-828. [PMID: 31782152 DOI: 10.1111/1365-2656.13154] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2019] [Accepted: 11/04/2019] [Indexed: 01/04/2023]
Abstract
Microbial communities are increasingly recognized as crucial for animal health. However, our understanding of how microbial communities are structured across wildlife populations is poor. Mechanisms such as interspecific associations are important in structuring free-living communities, but we still lack an understanding of how important interspecific associations are in structuring gut microbial communities in comparison with other factors such as host characteristics or spatial proximity of hosts. Here, we ask how gut microbial communities are structured in a population of North American moose Alces alces. We identify key microbial interspecific associations within the moose gut and quantify how important they are relative to key host characteristics, such as body condition, for predicting microbial community composition. We sampled gut microbial communities from 55 moose in a population experiencing decline due to a myriad of factors, including pathogens and malnutrition. We examined microbial community dynamics in this population utilizing novel graphical network models that can explicitly incorporate spatial information. We found that interspecific associations were the most important mechanism structuring gut microbial communities in moose and detected both positive and negative associations. Models only accounting for associations between microbes had higher predictive value compared to models including moose sex, evidence of previous pathogen exposure or body condition. Adding spatial information on moose location further strengthened our model and allowed us to predict microbe occurrences with ~90% accuracy. Collectively, our results suggest that microbial interspecific associations coupled with host spatial proximity are vital in shaping gut microbial communities in a large herbivore. In this case, previous pathogen exposure and moose body condition were not as important in predicting gut microbial community composition. The approach applied here can be used to quantify interspecific associations and gain a more nuanced understanding of the spatial and host factors shaping microbial communities in non-model hosts.
Collapse
Affiliation(s)
| | - Nicholas J Clark
- UQ Spatial Epidemiology Laboratory, School of Veterinary Science, The University of Queensland, Gatton, Qld, Australia
| | - Amy C Kinsley
- Department of Veterinary Population Medicine, University of Minnesota, St Paul, MN, USA.,Center for Animal Health and Food Safety, University of Minnesota, St Paul, MN, USA
| | - Michelle Carstensen
- Minnesota Department of Natural Resources, Wildlife Health Program, Forest Lake, MN, USA
| | - James Forester
- Department of Fisheries, Wildlife and Conservation Biology, University of Minnesota, St Paul, MN, USA
| | - Timothy J Johnson
- Center for Animal Health and Food Safety, University of Minnesota, St Paul, MN, USA
| | - Elizabeth A Miller
- Center for Animal Health and Food Safety, University of Minnesota, St Paul, MN, USA
| | - Seth Moore
- Department of Biology and Environment, Grand Portage Band of Chippewa, Grand Portage, MN, USA
| | - Tiffany M Wolf
- Department of Veterinary Population Medicine, University of Minnesota, St Paul, MN, USA
| | - Meggan E Craft
- Department of Veterinary Population Medicine, University of Minnesota, St Paul, MN, USA
| |
Collapse
|
417
|
|
418
|
Recent Advances in Evapotranspiration Estimation Using Artificial Intelligence Approaches with a Focus on Hybridization Techniques—A Review. AGRONOMY-BASEL 2020. [DOI: 10.3390/agronomy10010101] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Difficulties are faced when formulating hydrological processes, including that of evapotranspiration (ET). Conventional empirical methods for formulating these possess some shortcomings. The artificial intelligence approach emerges as the best possible solution to map the relationships between climatic parameters and ET, even with limited knowledge of the interactions between variables. This review presents the state-of-the-art application of artificial intelligence models in ET estimation, along with different types and sources of data. This paper discovers the most significant climatic parameters for different climate patterns. The characteristics of the basic artificial intelligence models are also explored in this review. To overcome the pitfalls of the individual models, hybrid models which use techniques such as data fusion and ensemble modeling, data decomposition as well as remote sensing-based hybridization, are introduced. In particular, the principles and applications of the hybridization techniques, as well as their combinations with basic models, are explained. The review covers most of the related and excellent papers published from 2011 to 2019 to keep its relevancy in terms of time frame and field of study. Guidelines for the future prospects of ET estimation in research are advocated. It is anticipated that such work could contribute to the development of agriculture-based economy.
Collapse
|
419
|
Bashir K, Li T, Yohannese CW, Yahaya M. SMOTEFRIS-INFFC: Handling the challenge of borderline and noisy examples in imbalanced learning for software defect prediction. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2020. [DOI: 10.3233/jifs-179459] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Kamal Bashir
- School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China
- Department of Information Technology, College of Computer Science and Information Technology, Karary University, Omdurman, Sudan
| | - Tianrui Li
- School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China
| | | | - Mahama Yahaya
- School of Transport and Logistics Engineering, Southwest Jiaotong University, Chengdu, China
| |
Collapse
|
420
|
Thanh LT, Dao NTA, Dung NV, Trung NL, Abed-Meraim K. Multi-channel EEG epileptic spike detection by a new method of tensor decomposition. J Neural Eng 2020; 17:016023. [PMID: 31905174 DOI: 10.1088/1741-2552/ab5247] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
OBJECTIVE Epilepsy is one of the most common brain disorders. For epilepsy diagnosis or treatment, the neurologist needs to observe epileptic spikes from electroencephalography (EEG) data. Since multi-channel EEG records can be naturally represented by multi-way tensors, it is of interest to see whether tensor decomposition is able to analyze EEG epileptic spikes. APPROACH In this paper, we first proposed the problem of simultaneous multilinear low-rank approximation of tensors (SMLRAT) and proved that SMLRAT can obtain local optimum solutions by using two well-known tensor decomposition algorithms (HOSVD and Tucker-ALS). Second, we presented a new system for automatic epileptic spike detection based on SMLRAT. MAIN RESULTS We propose to formulate the problem of feature extraction from a set of EEG segments, represented by tensors, as the SMLRAT problem. Efficient EEG features were obtained, based on estimating the 'eigenspikes' derived from nonnegative GSMLRAT. We compared the proposed tensor analysis method with other common tensor methods in analyzing EEG signal and compared the proposed feature extraction method with the state-of-the-art methods. Experimental results indicated that our proposed method is able to detect epileptic spikes with high accuracy. SIGNIFICANCE Our method, for the first time, makes a step forward for automatic detection EEG epileptic spikes based on tensor decomposition. The method can provide a practical solution to distinguish epileptic spikes from artifacts in real-life EEG datasets.
Collapse
Affiliation(s)
- Le Trung Thanh
- Advanced Institute of Engineering and Technology (AVITECH), VNU University of Engineering and Technology, Vietnam National University, Hanoi, Vietnam
| | | | | | | | | |
Collapse
|
421
|
Margin-Based Pareto Ensemble Pruning: An Ensemble Pruning Algorithm That Learns to Search Optimized Ensembles. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2020; 2019:7560872. [PMID: 31281338 PMCID: PMC6589231 DOI: 10.1155/2019/7560872] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/07/2018] [Revised: 03/06/2019] [Accepted: 04/09/2019] [Indexed: 11/22/2022]
Abstract
The ensemble pruning system is an effective machine learning framework that combines several learners as experts to classify a test set. Generally, ensemble pruning systems aim to define a region of competence based on the validation set to select the most competent ensembles from the ensemble pool with respect to the test set. However, the size of the ensemble pool is usually fixed, and the performance of an ensemble pool heavily depends on the definition of the region of competence. In this paper, a dynamic pruning framework called margin-based Pareto ensemble pruning is proposed for ensemble pruning systems. The framework explores the optimized ensemble pool size during the overproduction stage and finetunes the experts during the pruning stage. The Pareto optimization algorithm is used to explore the size of the overproduction ensemble pool that can result in better performance. Considering the information entropy of the learners in the indecision region, the marginal criterion for each learner in the ensemble pool is calculated using margin criterion pruning, which prunes the experts with respect to the test set. The effectiveness of the proposed method for classification tasks is assessed using datasets. The results show that margin-based Pareto ensemble pruning can achieve smaller ensemble sizes and better classification performance in most datasets when compared with state-of-the-art models.
Collapse
|
422
|
|
423
|
A Correction Method of a Base Classifier Applied to Imbalanced Data Classification. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7303714 DOI: 10.1007/978-3-030-50423-6_7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
In this paper, the issue of tailoring the soft confusion matrix classifier to deal with imbalanced data is addressed. This is done by changing the definition of the soft neighbourhood of the classified object. The first approach is to change the neighbourhood to be more local by changing the Gaussian potential function approach to the nearest neighbour rule. The second one is to weight the instances that are included in the neighbourhood. The instances are weighted inversely proportional to the a priori class probability. The experimental results show that for one of the investigated base classifiers, the usage of the KNN neighbourhood significantly improves the classification results. What is more, the application of the weighting schema also offers a significant improvement.
Collapse
|
424
|
Yoosefzadeh-Najafabadi M, Earl HJ, Tulpan D, Sulik J, Eskandari M. Application of Machine Learning Algorithms in Plant Breeding: Predicting Yield From Hyperspectral Reflectance in Soybean. FRONTIERS IN PLANT SCIENCE 2020; 11:624273. [PMID: 33510761 PMCID: PMC7835636 DOI: 10.3389/fpls.2020.624273] [Citation(s) in RCA: 56] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Accepted: 12/10/2020] [Indexed: 05/20/2023]
Abstract
Recent substantial advances in high-throughput field phenotyping have provided plant breeders with affordable and efficient tools for evaluating a large number of genotypes for important agronomic traits at early growth stages. Nevertheless, the implementation of large datasets generated by high-throughput phenotyping tools such as hyperspectral reflectance in cultivar development programs is still challenging due to the essential need for intensive knowledge in computational and statistical analyses. In this study, the robustness of three common machine learning (ML) algorithms, multilayer perceptron (MLP), support vector machine (SVM), and random forest (RF), were evaluated for predicting soybean (Glycine max) seed yield using hyperspectral reflectance. For this aim, the hyperspectral reflectance data for the whole spectra ranged from 395 to 1005 nm, which were collected at the R4 and R5 growth stages on 250 soybean genotypes grown in four environments. The recursive feature elimination (RFE) approach was performed to reduce the dimensionality of the hyperspectral reflectance data and select variables with the largest importance values. The results indicated that R5 is more informative stage for measuring hyperspectral reflectance to predict seed yields. The 395 nm reflectance band was also identified as the high ranked band in predicting the soybean seed yield. By considering either full or selected variables as the input variables, the ML algorithms were evaluated individually and combined-version using the ensemble-stacking (E-S) method to predict the soybean yield. The RF algorithm had the highest performance with a value of 84% yield classification accuracy among all the individual tested algorithms. Therefore, by selecting RF as the metaClassifier for E-S method, the prediction accuracy increased to 0.93, using all variables, and 0.87, using selected variables showing the success of using E-S as one of the ensemble techniques. This study demonstrated that soybean breeders could implement E-S algorithm using either the full or selected spectra reflectance to select the high-yielding soybean genotypes, among a large number of genotypes, at early growth stages.
Collapse
Affiliation(s)
| | - Hugh J. Earl
- Department of Plant Agriculture, University of Guelph, Guelph, ON, Canada
| | - Dan Tulpan
- Department of Animal Biosciences, University of Guelph, Guelph, ON, Canada
| | - John Sulik
- Department of Plant Agriculture, University of Guelph, Guelph, ON, Canada
| | - Milad Eskandari
- Department of Plant Agriculture, University of Guelph, Guelph, ON, Canada
- *Correspondence: Milad Eskandari,
| |
Collapse
|
425
|
Krzhizhanovskaya VV, Závodszky G, Lees MH, Dongarra JJ, Sloot PMA, Brissos S, Teixeira J. Clustering and Weighted Scoring in Geometric Space Support Vector Machine Ensemble for Highly Imbalanced Data Classification. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7303710 DOI: 10.1007/978-3-030-50423-6_10] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Learning from imbalanced datasets is a challenging task for standard classification algorithms. In general, there are two main approaches to solve the problem of imbalanced data: algorithm-level and data-level solutions. This paper deals with the second approach. In particular, this paper shows a new proposition for calculating the weighted score function to use in the integration phase of the multiple classification system. The presented research includes experimental evaluation over multiple, open-source, highly imbalanced datasets, presenting the results of comparing the proposed algorithm with three other approaches in the context of six performance measures. Comprehensive experimental results show that the proposed algorithm has better performance measures than the other ensemble methods for highly imbalanced datasets.
Collapse
|
426
|
Sánchez-Medina AJ, Galván-Sánchez I, Fernández-Monroy M. Applying artificial intelligence to explore sexual cyberbullying behaviour. Heliyon 2020; 6:e03218. [PMID: 32042968 PMCID: PMC7002833 DOI: 10.1016/j.heliyon.2020.e03218] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Revised: 09/02/2019] [Accepted: 01/10/2020] [Indexed: 11/29/2022] Open
Abstract
Sexual cyberbullying is becoming a serious problem in today's society. In the workplace, this issue is more complex because of the power imbalance between potential perpetrators and victims. Preventing sexual cyberbullying in organizations is very important for a safety and respectful workplace. Occupational Safety and Health (OSH) standards establish certain policies to be considered to create an organizational culture based on zero tolerance to sexual cyberbullying. The research aims to broaden knowledge about personality and sexual cyberbullying. Therefore, this paper proposes a crucial tool to explore potential sexual cyberbullying behaviour. This study analysed how personality traits, particularly those related to the Dark Triad (psychopathy, Machiavellianism and narcissism), might influence this behaviour. Participants (N = 374) were Spanish young adults, using the convenience sampling to recruit them. The methodology focused on the use of structural equation modelling and ensemble classification tree. First, we tested the proposed hypotheses with structural equation method based on covariance using the Lavaan R-package. Second, for the ensemble of classification trees, we applied the package randomForest and Adabag (bagging and boosting) in R. Results proposed high levels of psychopathy and Machiavellianism are more likely to be related to sexual cyberbullying behaviours. Organizations could use the tool proposed in this research to develop internal policies and procedures for detection and deterrence of potential cyberbullying behaviours. By raising awareness about cyberbullying behaviour including its conceptualisation and measurement in training courses, organizations might build an organizational culture based on a respectful workplace without sexual cyberbullying behaviours.
Collapse
Affiliation(s)
| | | | - Margarita Fernández-Monroy
- Instituto Universitario de Ciencias y Tecnologías Cibernéticas, Universidad de Las Palmas de Gran Canaria, Spain
| |
Collapse
|
427
|
KNN-Based Overlapping Samples Filter Approach for Classification of Imbalanced Data. SOFTWARE ENGINEERING RESEARCH, MANAGEMENT AND APPLICATIONS 2020. [DOI: 10.1007/978-3-030-24344-9_4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
|
428
|
Maji RK, Khatua S, Ghosh Z. A Supervised Ensemble Approach for Sensitive microRNA Target Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:37-46. [PMID: 30040648 DOI: 10.1109/tcbb.2018.2858252] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
MicroRNAs, a class of small non-coding RNAs, regulate important biological functions via post-transcriptional regulation of messenger RNAs (mRNAs). Despite rapid development in miRNA research, precise experimental methods to determine miRNA target interactions are still lacking. This motivated us to explore the in silico target interaction features and incorporate them in predictive modeling. We propose a systematic approach towards developing a sensitive miRNA target prediction model to explore the interplay of target recognition features. In the first step, we have employed a supervised ensemble under-sampling approach to address the problem of imbalance in the training dataset due to a larger number of negative instances. Various feature selection techniques were evaluated to obtain the optimal feature subset that best recognizes the true miRNA-mRNA targets. In the second step, we have built our optimal model, miRTPred, a novel blending ensemble-based approach that combines the predictions of the best performing traditional and classical ensemble models, through a weighted voting classifier, achieving a sensitivity of 87 percent and F1-score of 0.88 for 3'UTR region of the mRNA transcript. miRTPred outperforms popular machine learning (ML) and non-ML approaches to target prediction algorithms. miRTPred is freely available at http://bicresources.jcbose.ac.in/zhumur/mirtpred.
Collapse
|
429
|
Li Y, Umbach DM, Bingham A, Li QJ, Zhuang Y, Li L. Putative biomarkers for predicting tumor sample purity based on gene expression data. BMC Genomics 2019; 20:1021. [PMID: 31881847 PMCID: PMC6933652 DOI: 10.1186/s12864-019-6412-8] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2019] [Accepted: 12/18/2019] [Indexed: 12/29/2022] Open
Abstract
Background Tumor purity is the percent of cancer cells present in a sample of tumor tissue. The non-cancerous cells (immune cells, fibroblasts, etc.) have an important role in tumor biology. The ability to determine tumor purity is important to understand the roles of cancerous and non-cancerous cells in a tumor. Methods We applied a supervised machine learning method, XGBoost, to data from 33 TCGA tumor types to predict tumor purity using RNA-seq gene expression data. Results Across the 33 tumor types, the median correlation between observed and predicted tumor-purity ranged from 0.75 to 0.87 with small root mean square errors, suggesting that tumor purity can be accurately predicted υσινγ expression data. We further confirmed that expression levels of a ten-gene set (CSF2RB, RHOH, C1S, CCDC69, CCL22, CYTIP, POU2AF1, FGR, CCL21, and IL7R) were predictive of tumor purity regardless of tumor type. We tested whether our set of ten genes could accurately predict tumor purity of a TCGA-independent data set. We showed that expression levels from our set of ten genes were highly correlated (ρ = 0.88) with the actual observed tumor purity. Conclusions Our analyses suggested that the ten-gene set may serve as a biomarker for tumor purity prediction using gene expression data.
Collapse
Affiliation(s)
- Yuanyuan Li
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina 27709, USA MD A3-03, Durham, NC, 27709, USA.
| | - David M Umbach
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina 27709, USA MD A3-03, Durham, NC, 27709, USA
| | - Adrienna Bingham
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina 27709, USA MD A3-03, Durham, NC, 27709, USA
| | - Qi-Jing Li
- Department of Immunology, Duke University, Durham, North, Carolina, 27710, USA
| | - Yuan Zhuang
- Department of Immunology, Duke University, Durham, North, Carolina, 27710, USA
| | - Leping Li
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Research Triangle Park, North Carolina 27709, USA MD A3-03, Durham, NC, 27709, USA
| |
Collapse
|
430
|
Nalluri MR, Kannan K, Gao XZ, Roy DS. Multiobjective hybrid monarch butterfly optimization for imbalanced disease classification problem. INT J MACH LEARN CYB 2019. [DOI: 10.1007/s13042-019-01047-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
|
431
|
Czarnowski I, Jędrzejowicz P. Data reduction and stacking for imbalanced data classification. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2019. [DOI: 10.3233/jifs-179335] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Ireneusz Czarnowski
- Department of Information Systems, Gdynia Maritime University, Morska, Gdynia, Poland
| | - Piotr Jędrzejowicz
- Department of Information Systems, Gdynia Maritime University, Morska, Gdynia, Poland
| |
Collapse
|
432
|
GIS Based Novel Hybrid Computational Intelligence Models for Mapping Landslide Susceptibility: A Case Study at Da Lat City, Vietnam. SUSTAINABILITY 2019. [DOI: 10.3390/su11247118] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Landslides affect properties and the lives of a large number of people in many hilly parts of Vietnam and in the world. Damages caused by landslides can be reduced by understanding distribution, nature, mechanisms and causes of landslides with the help of model studies for better planning and risk management of the area. Development of landslide susceptibility maps is one of the main steps in landslide management. In this study, the main objective is to develop GIS based hybrid computational intelligence models to generate landslide susceptibility maps of the Da Lat province, which is one of the landslide prone regions of Vietnam. Novel hybrid models of alternating decision trees (ADT) with various ensemble methods, namely bagging, dagging, MultiBoostAB, and RealAdaBoost, were developed namely B-ADT, D-ADT, MBAB-ADT, RAB-ADT, respectively. Data of 72 past landslide events was used in conjunction with 11 landslide conditioning factors (curvature, distance from geological boundaries, elevation, land use, Normalized Difference Vegetation Index (NDVI), relief amplitude, stream density, slope, lithology, weathering crust and soil) in the development and validation of the models. Area under the receiver operating characteristic (ROC) curve (AUC), and several statistical measures were applied to validate these models. Results indicated that performance of all the models was good (AUC value greater than 0.8) but B-ADT model performed the best (AUC= 0.856). Landslide susceptibility maps generated using the proposed models would be helpful to decision makers in the risk management for land use planning and infrastructure development.
Collapse
|
433
|
Predictive Modeling of ICU Healthcare-Associated Infections from Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling Approach. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9245287] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Early detection of patients vulnerable to infections acquired in the hospital environment is a challenge in current health systems given the impact that such infections have on patient mortality and healthcare costs. This work is focused on both the identification of risk factors and the prediction of healthcare-associated infections in intensive-care units by means of machine-learning methods. The aim is to support decision making addressed at reducing the incidence rate of infections. In this field, it is necessary to deal with the problem of building reliable classifiers from imbalanced datasets. We propose a clustering-based undersampling strategy to be used in combination with ensemble classifiers. A comparative study with data from 4616 patients was conducted in order to validate our proposal. We applied several single and ensemble classifiers both to the original dataset and to data preprocessed by means of different resampling methods. The results were analyzed by means of classic and recent metrics specifically designed for imbalanced data classification. They revealed that the proposal is more efficient in comparison with other approaches.
Collapse
|
434
|
Fault Diagnosis of Rotating Electrical Machines Using Multi-Label Classification. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9235086] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Fault Detection and Diagnosis of electrical machine and drive systems are of utmost importance in modern industrial automation. The widespread use of Machine Learning techniques has made it possible to replace traditional motor fault detection techniques with more efficient solutions that are capable of early fault recognition by using large amounts of sensory data. However, the detection of concurrent failures is still a challenge in the presence of disturbing noises or when the multiple faults cause overlapping features. Multi-label classification has recently gained popularity in various application domains as an efficient method for fault detection and monitoring of systems with promising results. The contribution of this work is to propose a novel methodology for multi-label classification for simultaneously diagnosing multiple faults and evaluating the fault severity under noisy conditions. In this research, the Electrical Signature Analysis as well as traditional vibration data have been considered for modeling. Furthermore, the performance of various multi-label classification models is compared. Current and vibration signals are acquired under normal and fault conditions. The applicability of the proposed method is experimentally validated under diverse fault conditions such as unbalance and misalignment.
Collapse
|
435
|
|
436
|
Prediction of Breast Cancer from Imbalance Respect Using Cluster-Based Undersampling Method. JOURNAL OF HEALTHCARE ENGINEERING 2019; 2019:7294582. [PMID: 31737241 PMCID: PMC6817921 DOI: 10.1155/2019/7294582] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/17/2018] [Revised: 04/03/2019] [Accepted: 06/10/2019] [Indexed: 11/18/2022]
Abstract
To overcome the two-class imbalanced problem existing in the diagnosis of breast cancer, a hybrid of K-means and Boosted C5.0 (K-Boosted C5.0) is proposed which is based on undersampling. K-means is utilized to select the informative samples near the boundary. During the training phase, the K-means algorithm clusters the majority and minority instances and selects a similar number of instances from each cluster. Boosted C5.0 is then used as the classifier. As there is one different instance selection factor via clustering that encourages the diversity of the training subspace in K-Boosted C5.0, it would be a great advantage to get better performance. To test the performance of the new hybrid classifier, it is implemented on 12 small-scale and 2 large-scale datasets, which are the often used datasets in class imbalanced learning. The extensive experimental results show that our proposed hybrid method outperforms most of the competitive algorithms in terms of Matthews' correlation coefficient (MCC) and accuracy indices. It can be a good alternative to the well-known machine learning methods.
Collapse
|
437
|
Wang Z, Wang B, Zhou Y, Li D, Yin Y. Weight-based multiple empirical kernel learning with neighbor discriminant constraint for heart failure mortality prediction. J Biomed Inform 2019; 101:103340. [PMID: 31756495 DOI: 10.1016/j.jbi.2019.103340] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2018] [Revised: 06/14/2019] [Accepted: 11/10/2019] [Indexed: 11/16/2022]
Abstract
Heart Failure (HF) is one of the most common causes of hospitalization and is burdened by short-term (in-hospital) and long-term (6-12 month) mortality. Accurate prediction of HF mortality plays a critical role in evaluating early treatment effects. However, due to the lack of a simple and effective prediction model, mortality prediction of HF is difficult, resulting in a low rate of control. To handle this issue, we propose a Weight-based Multiple Empirical Kernel Learning with Neighbor Discriminant Constraint (WMEKL-NDC) method for HF mortality prediction. In our method, feature selection by calculating the F-value of each feature is first performed to identify the crucial clinical features. Then, different weights are assigned to each empirical kernel space according to the centered kernel alignment criterion. To make use of the discriminant information of samples, neighbor discriminant constraint is finally integrated into multiple empirical kernel learning framework. Extensive experiments were performed on a real clinical dataset containing 10, 198 in-patients records collected from Shanghai Shuguang Hospital in March 2009 and April 2016. Experimental results demonstrate that our proposed WMEKL-NDC method achieves a highly competitive performance for HF mortality prediction of in-hospital, 30-day and 1-year. Compared with the state-of-the-art multiple kernel learning and baseline algorithms, our proposed WMEKL-NDC is more accurate on mortality prediction Moreover, top 10 crucial clinical features are identified together with their meanings, which are very useful to assist clinicians in the treatment of HF disease.
Collapse
Affiliation(s)
- Zhe Wang
- Key Laboratory of Advanced Control and Optimization for Chemical Processes, Ministry of Education, East China University of Science and Technology, Shanghai 200237, China; Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai 200237, China.
| | - Bolu Wang
- Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Yangming Zhou
- Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai 200237, China.
| | - Dongdong Li
- Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Yichao Yin
- Shanghai Shuguang Hospital, Shanghai 200021, China
| |
Collapse
|
438
|
Comparative Analysis of Rainfall Prediction Models Using Machine Learning in Islands with Complex Orography: Tenerife Island. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9224931] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
We present a comparative study between predictive monthly rainfall models for islands of complex orography using machine learning techniques. The models have been developed for the island of Tenerife (Canary Islands). Weather forecasting is influenced both by the local geographic characteristics as well as by the time horizon comprised. Accuracy of mid-term rainfall prediction on islands with complex orography is generally low when carried out with atmospheric models. Predictive models based on algorithms such as Random Forest or Extreme Gradient Boosting among others were analyzed. The predictors used in the models include weather predictors measured in two main meteorological stations, reanalysis predictors from the National Oceanic and Atmospheric Administration, and the global predictor North Atlantic Oscillation, all of them obtained over a period of time of more than four decades. When comparing the proposed models, we evaluated accuracy, kappa and interpretability of the model obtained, as well as the relevance of the predictors used. The results show that global predictors such as the North Atlantic Oscillation Index (NAO) have a very low influence, while the local Geopotential Height (GPH) predictor is relatively more important. Machine learning prediction models are a relevant proposition for predicting medium-term precipitation in similar geographical regions.
Collapse
|
439
|
Ensemble learning via constraint projection and undersampling technique for class-imbalance problem. Soft comput 2019. [DOI: 10.1007/s00500-019-04501-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
440
|
|
441
|
Affinity and class probability-based fuzzy support vector machine for imbalanced data sets. Neural Netw 2019; 122:289-307. [PMID: 31739268 DOI: 10.1016/j.neunet.2019.10.016] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2018] [Revised: 09/13/2019] [Accepted: 10/28/2019] [Indexed: 11/21/2022]
Abstract
The learning problem from imbalanced data sets poses a major challenge in data mining community. Although conventional support vector machine can generally show relatively robust performance in dealing with the classification problems of imbalanced data sets, it treats all training samples with the same contribution for learning, which results in the final decision boundary biasing toward the majority class especially in the presence of outliers or noises. In this paper, we propose a new affinity and class probability-based fuzzy support vector machine technique (ACFSVM). The affinity of a majority class sample is calculated according to support vector description domain (SVDD) model trained only by the given majority class training samples in kernel space similar to that used for FSVM learning. The obtained affinity can be used for identifying possible outliers and some border samples existing in the majority class training samples. In order to eliminate the effect of noises, we employ the kernel k-nearest neighbor method to determine the class probability of the majority class samples in the same kernel space as before. The samples with lower class probabilities are more likely to be noises and their contribution for learning seems to be reduced by their low memberships constructed by combining the affinities and the class probabilities. Thus, ACFSVM can pay more attention to the majority class samples with higher affinities and class probabilities while reducing their effects of the ones with lower affinities and class probabilities, eventually skewing the final classification boundary toward the majority class. In addition, the minority class samples are assigned relative high memberships to guarantee their importance for the model learning. The extensive experimental results on the different imbalanced datasets from UCI repository demonstrate that the proposed approach can achieve better generalization performance in terms of G-Mean, F-Measure, and AUC as compared to the other existing imbalanced dataset classification techniques.
Collapse
|
442
|
Shukla S, Raghuwanshi BS. Online sequential class-specific extreme learning machine for binary imbalanced learning. Neural Netw 2019; 119:235-248. [DOI: 10.1016/j.neunet.2019.08.018] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2019] [Revised: 07/03/2019] [Accepted: 08/15/2019] [Indexed: 12/25/2022]
|
443
|
Zhu Z, Wang Z, Li D, Du W. Tree-based space partition and merging ensemble learning framework for imbalanced problems. Inf Sci (N Y) 2019. [DOI: 10.1016/j.ins.2019.06.033] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
444
|
Wang Z, Wang B, Cheng Y, Li D, Zhang J. Cost-sensitive Fuzzy Multiple Kernel Learning for imbalanced problem. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2019.06.065] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
445
|
High-Resolution Remote Sensing Imagery Classification of Imbalanced Data Using Multistage Sampling Method and Deep Neural Networks. REMOTE SENSING 2019. [DOI: 10.3390/rs11212523] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Class imbalance is a key issue for the application of deep learning for remote sensing image classification because a model generated by imbalanced samples training has low classification accuracy for minority classes. In this study, an accurate classification approach using the multistage sampling method and deep neural networks was proposed to classify imbalanced data. We first balance samples by multistage sampling to obtain the training sets. Then, a state-of-the-art model is adopted by combining the advantages of atrous spatial pyramid pooling (ASPP) and Encoder-Decoder for pixel-wise classification, which are two different types of fully convolutional networks (FCNs) that can obtain contextual information of multiple levels in the Encoder stage. The details and spatial dimensions of targets are restored using such information during the Decoder stage. We employ four deep learning-based classification algorithms (basic FCN, FCN-8S, ASPP, and Encoder-Decoder with ASPP of our approach) on multistage training sets (original, MUS1, and MUS2) of WorldView-3 images in southeastern Qinghai-Tibet Plateau and GF-2 images in northeastern Beijing for comparison. The experiments show that, compared with existing sets (original, MUS1, and identical) and existing method (cost weighting), the MUS2 training set of multistage sampling significantly enhance the classification performance for minority classes. Our approach shows distinct advantages for imbalanced data.
Collapse
|
446
|
Kwon S, Bae H, Jo J, Yoon S. Comprehensive ensemble in QSAR prediction for drug discovery. BMC Bioinformatics 2019; 20:521. [PMID: 31655545 PMCID: PMC6815455 DOI: 10.1186/s12859-019-3135-4] [Citation(s) in RCA: 89] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2019] [Accepted: 10/09/2019] [Indexed: 12/04/2022] Open
Abstract
Background Quantitative structure-activity relationship (QSAR) is a computational modeling method for revealing relationships between structural properties of chemical compounds and biological activities. QSAR modeling is essential for drug discovery, but it has many constraints. Ensemble-based machine learning approaches have been used to overcome constraints and obtain reliable predictions. Ensemble learning builds a set of diversified models and combines them. However, the most prevalent approach random forest and other ensemble approaches in QSAR prediction limit their model diversity to a single subject. Results The proposed ensemble method consistently outperformed thirteen individual models on 19 bioassay datasets and demonstrated superiority over other ensemble approaches that are limited to a single subject. The comprehensive ensemble method is publicly available at http://data.snu.ac.kr/QSAR/. Conclusions We propose a comprehensive ensemble method that builds multi-subject diversified models and combines them through second-level meta-learning. In addition, we propose an end-to-end neural network-based individual classifier that can automatically extract sequential features from a simplified molecular-input line-entry system (SMILES). The proposed individual models did not show impressive results as a single model, but it was considered the most important predictor when combined, according to the interpretation of the meta-learning.
Collapse
Affiliation(s)
- Sunyoung Kwon
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, 08826, South Korea.,Clova AI Research, NAVER Corp., Seongnam, 13561, South Korea
| | - Ho Bae
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 08826, South Korea
| | - Jeonghee Jo
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 08826, South Korea
| | - Sungroh Yoon
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, 08826, South Korea. .,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 08826, South Korea. .,Biological Sciences, Seoul National University, Seoul, 08826, South Korea. .,ASRI and INMC, Seoul National University, Seoul, 08826, South Korea. .,Institute of Engineering Research, Seoul National University, Seoul, 08826, South Korea.
| |
Collapse
|
447
|
Liu T, Fan W, Wu C. A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical dataset. Artif Intell Med 2019; 101:101723. [PMID: 31813482 DOI: 10.1016/j.artmed.2019.101723] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2019] [Revised: 08/12/2019] [Accepted: 09/06/2019] [Indexed: 10/25/2022]
Abstract
BACKGROUND AND OBJECTIVE Cerebral stroke has become a significant global public health issue in recent years. The ideal solution to this concern is to prevent in advance by controlling related metabolic factors. However, it is difficult for medical staff to decide whether special precautions are needed for a potential patient only based on the monitoring of physiological indicators unless they are obviously abnormal. This paper will develop a hybrid machine learning approach to predict cerebral stroke for clinical diagnosis based on the physiological data with incompleteness and class imbalance. METHODS Two steps are involved in the whole process. Firstly, random forest regression is adopted to impute missing values before classification. Secondly, an automated hyperparameter optimization(AutoHPO) based on deep neural network(DNN) is applied to stroke prediction on an imbalanced dataset. RESULTS The medical dataset contains 43,400 records of potential patients which includes 783 occurrences of stroke. The false negative rate from our prediction approach is only 19.1%, which has reduced by an average of 51.5% in comparison to other traditional approaches. The false positive rate, accuracy and sensitivity predicted by the proposed approach are respectively 33.1, 71.6, and 67.4%. CONCLUSION The approach proposed in this paper has effectively reduced the false negative rate with a relatively high overall accuracy, which means a successful decrease in the misdiagnosis rate for stroke prediction. The results are more reliable and valid as the reference in stroke prognosis, and also can be acquired conveniently at a low cost.
Collapse
Affiliation(s)
- Tianyu Liu
- Department of Automation, Tsinghua University,Beijing, China
| | - Wenhui Fan
- Department of Automation, Tsinghua University,Beijing, China.
| | - Cheng Wu
- Department of Automation, Tsinghua University,Beijing, China
| |
Collapse
|
448
|
Cruz RMO, Souza MA, Sabourin R, Cavalcanti GDC. Dynamic Ensemble Selection and Data Preprocessing for Multi-Class Imbalance Learning. INT J PATTERN RECOGN 2019. [DOI: 10.1142/s0218001419400093] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Class imbalance refers to classification problems in which many more instances are available for certain classes than for others. Such imbalanced datasets require special attention because traditional classifiers generally favor the majority class which has a large number of instances. Ensemble of classifiers has been reported to yield promising results. However, the majority of ensemble methods applied to imbalance learning are static ones. Moreover, they only deal with binary imbalanced problems. Hence, this paper presents an empirical analysis of Dynamic Selection techniques and data preprocessing methods for dealing with multi-class imbalanced problems. We considered five variations of preprocessing methods and 14 Dynamic Selection schemes. Our experiments conducted on 26 multi-class imbalanced problems show that the dynamic ensemble improves the AUC and the [Formula: see text]-mean as compared to the static ensemble. Moreover, data preprocessing plays an important role in such cases.
Collapse
Affiliation(s)
- Rafael M. O. Cruz
- Laboratoire d’Imagerie, de Vision et d’Intelligence Artificielle, École de Technologie Supérieure, Université du Québec, Montreal, QC, Canada H3C 1K3, Canada
| | - Mariana A. Souza
- Centro de Informática, Universidade Federal de Pernambuco, Recife, PE 50.670-420, Brazil
| | - Robert Sabourin
- Laboratoire d’Imagerie, de Vision et d’Intelligence Artificielle, École de Technologie Supérieure, Université du Québec, Montreal, QC, Canada H3C 1K3, Canada
| | | |
Collapse
|
449
|
|
450
|
Borowska K, Stepaniuk J. A rough-granular approach to the imbalanced data classification problem. Appl Soft Comput 2019. [DOI: 10.1016/j.asoc.2019.105607] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|