551
|
Zhang ZL, Luo XG, González S, García S, Herrera F. DRCW-ASEG: One-versus-One distance-based relative competence weighting with adaptive synthetic example generation for multi-class imbalanced datasets. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.01.039] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
552
|
Brieuc MSO, Waters CD, Drinan DP, Naish KA. A practical introduction to Random Forest for genetic association studies in ecology and evolution. Mol Ecol Resour 2018; 18:755-766. [PMID: 29504715 DOI: 10.1111/1755-0998.12773] [Citation(s) in RCA: 63] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2017] [Revised: 02/08/2018] [Accepted: 02/17/2018] [Indexed: 12/25/2022]
Abstract
Large genomic studies are becoming increasingly common with advances in sequencing technology, and our ability to understand how genomic variation influences phenotypic variation between individuals has never been greater. The exploration of such relationships first requires the identification of associations between molecular markers and phenotypes. Here, we explore the use of Random Forest (RF), a powerful machine-learning algorithm, in genomic studies to discern loci underlying both discrete and quantitative traits, particularly when studying wild or nonmodel organisms. RF is becoming increasingly used in ecological and population genetics because, unlike traditional methods, it can efficiently analyse thousands of loci simultaneously and account for nonadditive interactions. However, understanding both the power and limitations of Random Forest is important for its proper implementation and the interpretation of results. We therefore provide a practical introduction to the algorithm and its use for identifying associations between molecular markers and phenotypes, discussing such topics as data limitations, algorithm initiation and optimization, as well as interpretation. We also provide short R tutorials as examples, with the aim of providing a guide to the implementation of the algorithm. Topics discussed here are intended to serve as an entry point for molecular ecologists interested in employing Random Forest to identify trait associations in genomic data sets.
Collapse
Affiliation(s)
- Marine S O Brieuc
- School of Aquatic and Fishery Sciences, University of Washington, Seattle, WA, USA.,Center for Ecological and Evolutionary Synthesis (CEES), Department of Biosciences, University of Oslo, Oslo, Norway
| | - Charles D Waters
- School of Aquatic and Fishery Sciences, University of Washington, Seattle, WA, USA
| | - Daniel P Drinan
- School of Aquatic and Fishery Sciences, University of Washington, Seattle, WA, USA
| | - Kerry A Naish
- School of Aquatic and Fishery Sciences, University of Washington, Seattle, WA, USA
| |
Collapse
|
553
|
An Ensemble Based Evolutionary Approach to the Class Imbalance Problem with Applications in CBIR. APPLIED SCIENCES-BASEL 2018. [DOI: 10.3390/app8040495] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
554
|
A bi-objective hybrid algorithm for the classification of imbalanced noisy and borderline data sets. Pattern Anal Appl 2018. [DOI: 10.1007/s10044-018-0693-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
555
|
Ensemble Genetic Fuzzy Neuro Model Applied for the Emergency Medical Service via Unbalanced Data Evaluation. Symmetry (Basel) 2018. [DOI: 10.3390/sym10030071] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
556
|
Huang C, Xu H, Xie L, Zhu J, Xu C, Tang Y. Large-scale semantic web image retrieval using bimodal deep learning techniques. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2017.11.043] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
557
|
Rajesh KN, Dhuli R. Classification of imbalanced ECG beats using re-sampling techniques and AdaBoost ensemble classifier. Biomed Signal Process Control 2018. [DOI: 10.1016/j.bspc.2017.12.004] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
558
|
Zhai J, Zhang S, Zhang M, Liu X. Fuzzy integral-based ELM ensemble for imbalanced big data classification. Soft comput 2018. [DOI: 10.1007/s00500-018-3085-1] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
559
|
Jalilvand A, Akbari B, Zare Mirakabad F. S-FLN: A sequence-based hierarchical approach for functional linkage network construction. J Theor Biol 2018; 437:149-162. [PMID: 29080781 DOI: 10.1016/j.jtbi.2017.10.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2016] [Revised: 07/27/2017] [Accepted: 10/18/2017] [Indexed: 11/24/2022]
Abstract
The functional linkage network (FLN) construction is a primary and important step in drug discovery and disease gene prioritization methods. In order to construct FLN, several methods have been introduced based on integration of various biological data. Although, there are impressive ideas behind these methods, they suffer from low quality of the biological data. In this paper, a hierarchical sequence-based approach is proposed to construct FLN. The proposed approach, denoted as S-FLN (Sequence-based Functional Linkage Network), uses the sequence of proteins as the primary data in three main steps. Firstly, the physicochemical properties of amino-acids are employed to describe the functionality of proteins. As the sequence of proteins is a more comprehensive and accurate primary data, more reliable relations are achieved. Secondly, seven different descriptor methods are used to extract feature vectors from the proteins sequences. Advantage of different descriptor methods lead to obtain diverse ensemble learners in the next step. Finally, a two-layer ensemble learning structure is proposed to calculated the score of protein pairs. The proposed approach has been evaluated using two biological datasets, S.Cerevisiae and H.Pylori, and resulted in 93.9% and 91.15% precision rates, respectively. The results of various experiments indicate the efficiency and validity of the proposed approach.
Collapse
Affiliation(s)
- A Jalilvand
- Department of Electronic and computer engineering,Tarbiat Modares University, Tehran, Iran
| | - B Akbari
- Department of Electronic and computer engineering,Tarbiat Modares University, Tehran, Iran.
| | - F Zare Mirakabad
- Department of Mathematics and Computer Science, Amirkabir University of Technology, Tehran, Iran
| |
Collapse
|
560
|
Tang X, Chen L. A self-adaptive evolutionary weighted extreme learning machine for binary imbalance learning. PROGRESS IN ARTIFICIAL INTELLIGENCE 2018. [DOI: 10.1007/s13748-017-0136-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
561
|
Seyednasrollah F, Mäkelä J, Pitkänen N, Juonala M, Hutri-Kähönen N, Lehtimäki T, Viikari J, Kelly T, Li C, Bazzano L, Elo LL, Raitakari OT. Prediction of Adulthood Obesity Using Genetic and Childhood Clinical Risk Factors in the Cardiovascular Risk in Young Finns Study. ACTA ACUST UNITED AC 2018; 10:CIRCGENETICS.116.001554. [PMID: 28620069 DOI: 10.1161/circgenetics.116.001554] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2016] [Accepted: 12/06/2016] [Indexed: 12/21/2022]
Abstract
BACKGROUND Obesity is a known risk factor for cardiovascular disease. Early prediction of obesity is essential for prevention. The aim of this study is to assess the use of childhood clinical factors and the genetic risk factors in predicting adulthood obesity using machine learning methods. METHODS AND RESULTS A total of 2262 participants from the Cardiovascular Risk in YFS (Young Finns Study) were followed up from childhood (age 3-18 years) to adulthood for 31 years. The data were divided into training (n=1625) and validation (n=637) set. The effect of known genetic risk factors (97 single-nucleotide polymorphisms) was investigated as a weighted genetic risk score of all 97 single-nucleotide polymorphisms (WGRS97) or a subset of 19 most significant single-nucleotide polymorphisms (WGRS19) using boosting machine learning technique. WGRS97 and WGRS19 were validated using external data (n=369) from BHS (Bogalusa Heart Study). WGRS19 improved the accuracy of predicting adulthood obesity in training (area under the curve [AUC=0.787 versus AUC=0.744, P<0.0001) and validation data (AUC=0.769 versus AUC=0.747, P=0.026). WGRS97 improved the accuracy in training (AUC=0.782 versus AUC=0.744, P<0.0001) but not in validation data (AUC=0.749 versus AUC=0.747, P=0.785). Higher WGRS19 associated with higher body mass index at 9 years and WGRS97 at 6 years. Replication in BHS confirmed our findings that WGRS19 and WGRS97 are associated with body mass index. CONCLUSIONS WGRS19 improves prediction of adulthood obesity. Predictive accuracy is highest among young children (3-6 years), whereas among older children (9-18 years) the risk can be identified using childhood clinical factors. The model is helpful in screening children with high risk of developing obesity.
Collapse
Affiliation(s)
- Fatemeh Seyednasrollah
- From the Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Finland (F.S., J.M., L.L.E.); Department of Mathematics and Statistics (F.S.), Research Centre of Applied and Preventive Cardiovascular Medicine (N.P., O.T.R.), and Department of Medicine (M.J., J.V.), University of Turku, Finland; Division of Medicine (M.J., J.V.) and Clinical Physiology and Nuclear Medicine (O.T.R.), Turku University Hospital, Finland; Department of Pediatrics (N.H.-K.) and School of Medicine (T.L.), University of Tampere, Finland; Tampere University Hospital, Finland (N.H.-K.); Department of Clinical Chemistry, Fimlab Laboratories, Tampere, Finland (T.L.); Tulane University Health Sciences Center, New Orleans, LA (T.K., L.B.); and Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens (C.L.)
| | - Johanna Mäkelä
- From the Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Finland (F.S., J.M., L.L.E.); Department of Mathematics and Statistics (F.S.), Research Centre of Applied and Preventive Cardiovascular Medicine (N.P., O.T.R.), and Department of Medicine (M.J., J.V.), University of Turku, Finland; Division of Medicine (M.J., J.V.) and Clinical Physiology and Nuclear Medicine (O.T.R.), Turku University Hospital, Finland; Department of Pediatrics (N.H.-K.) and School of Medicine (T.L.), University of Tampere, Finland; Tampere University Hospital, Finland (N.H.-K.); Department of Clinical Chemistry, Fimlab Laboratories, Tampere, Finland (T.L.); Tulane University Health Sciences Center, New Orleans, LA (T.K., L.B.); and Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens (C.L.).
| | - Niina Pitkänen
- From the Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Finland (F.S., J.M., L.L.E.); Department of Mathematics and Statistics (F.S.), Research Centre of Applied and Preventive Cardiovascular Medicine (N.P., O.T.R.), and Department of Medicine (M.J., J.V.), University of Turku, Finland; Division of Medicine (M.J., J.V.) and Clinical Physiology and Nuclear Medicine (O.T.R.), Turku University Hospital, Finland; Department of Pediatrics (N.H.-K.) and School of Medicine (T.L.), University of Tampere, Finland; Tampere University Hospital, Finland (N.H.-K.); Department of Clinical Chemistry, Fimlab Laboratories, Tampere, Finland (T.L.); Tulane University Health Sciences Center, New Orleans, LA (T.K., L.B.); and Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens (C.L.)
| | - Markus Juonala
- From the Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Finland (F.S., J.M., L.L.E.); Department of Mathematics and Statistics (F.S.), Research Centre of Applied and Preventive Cardiovascular Medicine (N.P., O.T.R.), and Department of Medicine (M.J., J.V.), University of Turku, Finland; Division of Medicine (M.J., J.V.) and Clinical Physiology and Nuclear Medicine (O.T.R.), Turku University Hospital, Finland; Department of Pediatrics (N.H.-K.) and School of Medicine (T.L.), University of Tampere, Finland; Tampere University Hospital, Finland (N.H.-K.); Department of Clinical Chemistry, Fimlab Laboratories, Tampere, Finland (T.L.); Tulane University Health Sciences Center, New Orleans, LA (T.K., L.B.); and Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens (C.L.)
| | - Nina Hutri-Kähönen
- From the Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Finland (F.S., J.M., L.L.E.); Department of Mathematics and Statistics (F.S.), Research Centre of Applied and Preventive Cardiovascular Medicine (N.P., O.T.R.), and Department of Medicine (M.J., J.V.), University of Turku, Finland; Division of Medicine (M.J., J.V.) and Clinical Physiology and Nuclear Medicine (O.T.R.), Turku University Hospital, Finland; Department of Pediatrics (N.H.-K.) and School of Medicine (T.L.), University of Tampere, Finland; Tampere University Hospital, Finland (N.H.-K.); Department of Clinical Chemistry, Fimlab Laboratories, Tampere, Finland (T.L.); Tulane University Health Sciences Center, New Orleans, LA (T.K., L.B.); and Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens (C.L.)
| | - Terho Lehtimäki
- From the Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Finland (F.S., J.M., L.L.E.); Department of Mathematics and Statistics (F.S.), Research Centre of Applied and Preventive Cardiovascular Medicine (N.P., O.T.R.), and Department of Medicine (M.J., J.V.), University of Turku, Finland; Division of Medicine (M.J., J.V.) and Clinical Physiology and Nuclear Medicine (O.T.R.), Turku University Hospital, Finland; Department of Pediatrics (N.H.-K.) and School of Medicine (T.L.), University of Tampere, Finland; Tampere University Hospital, Finland (N.H.-K.); Department of Clinical Chemistry, Fimlab Laboratories, Tampere, Finland (T.L.); Tulane University Health Sciences Center, New Orleans, LA (T.K., L.B.); and Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens (C.L.)
| | - Jorma Viikari
- From the Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Finland (F.S., J.M., L.L.E.); Department of Mathematics and Statistics (F.S.), Research Centre of Applied and Preventive Cardiovascular Medicine (N.P., O.T.R.), and Department of Medicine (M.J., J.V.), University of Turku, Finland; Division of Medicine (M.J., J.V.) and Clinical Physiology and Nuclear Medicine (O.T.R.), Turku University Hospital, Finland; Department of Pediatrics (N.H.-K.) and School of Medicine (T.L.), University of Tampere, Finland; Tampere University Hospital, Finland (N.H.-K.); Department of Clinical Chemistry, Fimlab Laboratories, Tampere, Finland (T.L.); Tulane University Health Sciences Center, New Orleans, LA (T.K., L.B.); and Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens (C.L.)
| | - Tanika Kelly
- From the Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Finland (F.S., J.M., L.L.E.); Department of Mathematics and Statistics (F.S.), Research Centre of Applied and Preventive Cardiovascular Medicine (N.P., O.T.R.), and Department of Medicine (M.J., J.V.), University of Turku, Finland; Division of Medicine (M.J., J.V.) and Clinical Physiology and Nuclear Medicine (O.T.R.), Turku University Hospital, Finland; Department of Pediatrics (N.H.-K.) and School of Medicine (T.L.), University of Tampere, Finland; Tampere University Hospital, Finland (N.H.-K.); Department of Clinical Chemistry, Fimlab Laboratories, Tampere, Finland (T.L.); Tulane University Health Sciences Center, New Orleans, LA (T.K., L.B.); and Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens (C.L.)
| | - Changwei Li
- From the Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Finland (F.S., J.M., L.L.E.); Department of Mathematics and Statistics (F.S.), Research Centre of Applied and Preventive Cardiovascular Medicine (N.P., O.T.R.), and Department of Medicine (M.J., J.V.), University of Turku, Finland; Division of Medicine (M.J., J.V.) and Clinical Physiology and Nuclear Medicine (O.T.R.), Turku University Hospital, Finland; Department of Pediatrics (N.H.-K.) and School of Medicine (T.L.), University of Tampere, Finland; Tampere University Hospital, Finland (N.H.-K.); Department of Clinical Chemistry, Fimlab Laboratories, Tampere, Finland (T.L.); Tulane University Health Sciences Center, New Orleans, LA (T.K., L.B.); and Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens (C.L.)
| | - Lydia Bazzano
- From the Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Finland (F.S., J.M., L.L.E.); Department of Mathematics and Statistics (F.S.), Research Centre of Applied and Preventive Cardiovascular Medicine (N.P., O.T.R.), and Department of Medicine (M.J., J.V.), University of Turku, Finland; Division of Medicine (M.J., J.V.) and Clinical Physiology and Nuclear Medicine (O.T.R.), Turku University Hospital, Finland; Department of Pediatrics (N.H.-K.) and School of Medicine (T.L.), University of Tampere, Finland; Tampere University Hospital, Finland (N.H.-K.); Department of Clinical Chemistry, Fimlab Laboratories, Tampere, Finland (T.L.); Tulane University Health Sciences Center, New Orleans, LA (T.K., L.B.); and Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens (C.L.)
| | - Laura L Elo
- From the Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Finland (F.S., J.M., L.L.E.); Department of Mathematics and Statistics (F.S.), Research Centre of Applied and Preventive Cardiovascular Medicine (N.P., O.T.R.), and Department of Medicine (M.J., J.V.), University of Turku, Finland; Division of Medicine (M.J., J.V.) and Clinical Physiology and Nuclear Medicine (O.T.R.), Turku University Hospital, Finland; Department of Pediatrics (N.H.-K.) and School of Medicine (T.L.), University of Tampere, Finland; Tampere University Hospital, Finland (N.H.-K.); Department of Clinical Chemistry, Fimlab Laboratories, Tampere, Finland (T.L.); Tulane University Health Sciences Center, New Orleans, LA (T.K., L.B.); and Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens (C.L.)
| | - Olli T Raitakari
- From the Turku Centre for Biotechnology, University of Turku and Åbo Akademi University, Finland (F.S., J.M., L.L.E.); Department of Mathematics and Statistics (F.S.), Research Centre of Applied and Preventive Cardiovascular Medicine (N.P., O.T.R.), and Department of Medicine (M.J., J.V.), University of Turku, Finland; Division of Medicine (M.J., J.V.) and Clinical Physiology and Nuclear Medicine (O.T.R.), Turku University Hospital, Finland; Department of Pediatrics (N.H.-K.) and School of Medicine (T.L.), University of Tampere, Finland; Tampere University Hospital, Finland (N.H.-K.); Department of Clinical Chemistry, Fimlab Laboratories, Tampere, Finland (T.L.); Tulane University Health Sciences Center, New Orleans, LA (T.K., L.B.); and Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens (C.L.)
| |
Collapse
|
562
|
Zhang X, Zhuang Y, Wang W, Pedrycz W. Transfer Boosting With Synthetic Instances for Class Imbalanced Object Recognition. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:357-370. [PMID: 28026795 DOI: 10.1109/tcyb.2016.2636370] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
A challenging problem in object recognition is to train a robust classifier with small and imbalanced data set. In such cases, the learned classifier tends to overfit the training data and has low prediction accuracy on the minority class. In this paper, we address the problem of class imbalanced object recognition by combining synthetic minorities over-sampling technique (SMOTE) and instance-based transfer boosting to rebalance the skewed class distribution. We present ways of generating synthetic instances under the learning framework of transfer Adaboost. A novel weighted SMOTE technique (WSMOTE) is proposed to generate weighted synthetic instances with weighted source and target instances at each boosting round. Based on WSMOTE, we propose a novel class imbalanced transfer boosting algorithm called WSMOTE-TrAdaboost and experimentally demonstrate its effectiveness on four datasets (Office, Caltech256, SUN2012, and VOC2012) for object recognition application. Bag-of-words model with SURF features and histogram of oriented gradient features are separately used to represent an image. We experimentally demonstrated the effectiveness and robustness of our approaches by comparing it with several baseline algorithms in boosting family for class imbalanced learning.
Collapse
|
563
|
Vuttipittayamongkol P, Elyan E, Petrovski A, Jayne C. Overlap-Based Undersampling for Improving Imbalanced Data Classification. INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING – IDEAL 2018 2018. [DOI: 10.1007/978-3-030-03493-1_72] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
|
564
|
|
565
|
Oversample Based Large Scale Support Vector Machine for Online Class Imbalance Problem. BIG DATA ANALYTICS 2018. [DOI: 10.1007/978-3-030-04780-1_24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022] Open
|
566
|
Collell G, Prelec D, Patil KR. A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data. Neurocomputing 2018; 275:330-340. [PMID: 29398782 PMCID: PMC5750819 DOI: 10.1016/j.neucom.2017.08.035] [Citation(s) in RCA: 74] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Class imbalance presents a major hurdle in the application of classification methods. A commonly taken approach is to learn ensembles of classifiers using rebalanced data. Examples include bootstrap averaging (bagging) combined with either undersampling or oversampling of the minority class examples. However, rebalancing methods entail asymmetric changes to the examples of different classes, which in turn can introduce their own biases. Furthermore, these methods often require specifying the performance measure of interest a priori, i.e., before learning. An alternative is to employ the threshold moving technique, which applies a threshold to the continuous output of a model, offering the possibility to adapt to a performance measure a posteriori, i.e., a plug-in method. Surprisingly, little attention has been paid to this combination of a bagging ensemble and threshold-moving. In this paper, we study this combination and demonstrate its competitiveness. Contrary to the other resampling methods, we preserve the natural class distribution of the data resulting in well-calibrated posterior probabilities. Additionally, we extend the proposed method to handle multiclass data. We validated our method on binary and multiclass benchmark data sets by using both, decision trees and neural networks as base classifiers. We perform analyses that provide insights into the proposed method.
Collapse
Affiliation(s)
- Guillem Collell
- MIT Sloan Neuroeconomics Lab, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.,Computer Science Department, KU Leuven, Heverlee 3001, Belgium
| | - Drazen Prelec
- MIT Sloan Neuroeconomics Lab, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.,Department of Economics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.,Brain & Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Kaustubh R Patil
- MIT Sloan Neuroeconomics Lab, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.,Institute of Neuroscience and Medicine, Brain & Behaviour (INM-7), Research Centre Jülich, Jülich 52425, Germany
| |
Collapse
|
567
|
Tuyisenge V, Trebaul L, Bhattacharjee M, Chanteloup-Forêt B, Saubat-Guigui C, Mîndruţă I, Rheims S, Maillard L, Kahane P, Taussig D, David O. Automatic bad channel detection in intracranial electroencephalographic recordings using ensemble machine learning. Clin Neurophysiol 2017; 129:548-554. [PMID: 29353183 PMCID: PMC5819872 DOI: 10.1016/j.clinph.2017.12.013] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2017] [Revised: 11/02/2017] [Accepted: 12/01/2017] [Indexed: 11/15/2022]
Abstract
OBJECTIVE Intracranial electroencephalographic (iEEG) recordings contain "bad channels", which show non-neuronal signals. Here, we developed a new method that automatically detects iEEG bad channels using machine learning of seven signal features. METHODS The features quantified signals' variance, spatial-temporal correlation and nonlinear properties. Because the number of bad channels is usually much lower than the number of good channels, we implemented an ensemble bagging classifier known to be optimal in terms of stability and predictive accuracy for datasets with imbalanced class distributions. This method was applied on stereo-electroencephalographic (SEEG) signals recording during low frequency stimulations performed in 206 patients from 5 clinical centers. RESULTS We found that the classification accuracy was extremely good: It increased with the number of subjects used to train the classifier and reached a plateau at 99.77% for 110 subjects. The classification performance was thus not impacted by the multicentric nature of data. CONCLUSIONS The proposed method to automatically detect bad channels demonstrated convincing results and can be envisaged to be used on larger datasets for automatic quality control of iEEG data. SIGNIFICANCE This is the first method proposed to classify bad channels in iEEG and should allow to improve the data selection when reviewing iEEG signals.
Collapse
Affiliation(s)
- Viateur Tuyisenge
- Univ. Grenoble Alpes, Grenoble Institut des Neurosciences, GIN, F-38000 Grenoble, France; Inserm, U1216, F-38000 Grenoble, France
| | - Lena Trebaul
- Univ. Grenoble Alpes, Grenoble Institut des Neurosciences, GIN, F-38000 Grenoble, France; Inserm, U1216, F-38000 Grenoble, France
| | - Manik Bhattacharjee
- Univ. Grenoble Alpes, Grenoble Institut des Neurosciences, GIN, F-38000 Grenoble, France; Inserm, U1216, F-38000 Grenoble, France
| | - Blandine Chanteloup-Forêt
- Univ. Grenoble Alpes, Grenoble Institut des Neurosciences, GIN, F-38000 Grenoble, France; Inserm, U1216, F-38000 Grenoble, France
| | - Carole Saubat-Guigui
- Univ. Grenoble Alpes, Grenoble Institut des Neurosciences, GIN, F-38000 Grenoble, France; Inserm, U1216, F-38000 Grenoble, France
| | - Ioana Mîndruţă
- Neurology Department, University Emergency Hospital, Bucharest, Romania; Neurology Department, Carol Davila University of Medicine and Pharmacy, Bucharest, Romania
| | - Sylvain Rheims
- Department of Functional Neurology and Epileptology, Hospices Civils de Lyon, Lyon, France; Lyon Neuroscience Research Center, INSERM U1028, CNRS UMR 5292, Lyon, France; Epilepsy Institute (IDEE), Lyon, France
| | - Louis Maillard
- Research Center for Automatic Control (CRAN), University of Lorraine, CNRS, UMR 7039, Vandoeuvre, France; Department of Neurology, Central University Hospital, CHU de Nancy, Nancy, France; Medical Faculty, University of Lorraine, Nancy, France
| | - Philippe Kahane
- Univ. Grenoble Alpes, Grenoble Institut des Neurosciences, GIN, F-38000 Grenoble, France; Inserm, U1216, F-38000 Grenoble, France; Laboratory of Neurophysiopathology of Epilepsy, Centre Hospitalier Universitaire Grenoble-Alpes, Grenoble, France
| | - Delphine Taussig
- Department of Pediatric Neurosurgery, Fondation Rothschild, F-75940 Paris, France
| | - Olivier David
- Univ. Grenoble Alpes, Grenoble Institut des Neurosciences, GIN, F-38000 Grenoble, France; Inserm, U1216, F-38000 Grenoble, France.
| |
Collapse
|
568
|
A synthetic neighborhood generation based ensemble learning for the imbalanced data classification. APPL INTELL 2017. [DOI: 10.1007/s10489-017-1088-8] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
569
|
Kang Q, Chen X, Li S, Zhou M. A Noise-Filtered Under-Sampling Scheme for Imbalanced Classification. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:4263-4274. [PMID: 28113413 DOI: 10.1109/tcyb.2016.2606104] [Citation(s) in RCA: 59] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Under-sampling is a popular data preprocessing method in dealing with class imbalance problems, with the purposes of balancing datasets to achieve a high classification rate and avoiding the bias toward majority class examples. It always uses full minority data in a training dataset. However, some noisy minority examples may reduce the performance of classifiers. In this paper, a new under-sampling scheme is proposed by incorporating a noise filter before executing resampling. In order to verify the efficiency, this scheme is implemented based on four popular under-sampling methods, i.e., Undersampling + Adaboost, RUSBoost, UnderBagging, and EasyEnsemble through benchmarks and significance analysis. Furthermore, this paper also summarizes the relationship between algorithm performance and imbalanced ratio. Experimental results indicate that the proposed scheme can improve the original undersampling-based methods with significance in terms of three popular metrics for imbalanced classification, i.e., the area under the curve, -measure, and -mean.
Collapse
|
570
|
Li L, Ngan CK. A weight-adjusted-voting framework on an ensemble of classifiers for improving sensitivity. INTELL DATA ANAL 2017. [DOI: 10.3233/ida-163184] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Lin Li
- Department of Computer Science and Software Engineering, Seattle University, Seattle, WA 98122, USA
| | - Chun-Kit Ngan
- Division of Engineering and Information Science, The Pennsylvania State University, Malvern, PA 19355, USA
| |
Collapse
|
571
|
|
572
|
Liu C, Wang W, Tu G, Xiang Y, Wang S, Lv F. A new Centroid-Based Classification model for text categorization. Knowl Based Syst 2017. [DOI: 10.1016/j.knosys.2017.08.020] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
573
|
Stegmayer G, Yones C, Kamenetzky L, Milone DH. High Class-Imbalance in pre-miRNA Prediction: A Novel Approach Based on deepSOM. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1316-1326. [PMID: 27295687 DOI: 10.1109/tcbb.2016.2576459] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The computational prediction of novel microRNA within a full genome involves identifying sequences having the highest chance of being a miRNA precursor (pre-miRNA). These sequences are usually named candidates to miRNA. The well-known pre-miRNAs are usually only a few in comparison to the hundreds of thousands of potential candidates to miRNA that have to be analyzed, which makes this task a high class-imbalance classification problem. The classical way of approaching it has been training a binary classifier in a supervised manner, using well-known pre-miRNAs as positive class and artificially defining the negative class. However, although the selection of positive labeled examples is straightforward, it is very difficult to build a set of negative examples in order to obtain a good set of training samples for a supervised method. In this work, we propose a novel and effective way of approaching this problem using machine learning, without the definition of negative examples. The proposal is based on clustering unlabeled sequences of a genome together with well-known miRNA precursors for the organism under study, which allows for the quick identification of the best candidates to miRNA as those sequences clustered with known precursors. Furthermore, we propose a deep model to overcome the problem of having very few positive class labels. They are always maintained in the deep levels as positive class while less likely pre-miRNA sequences are filtered level after level. Our approach has been compared with other methods for pre-miRNAs prediction in several species, showing effective predictivity of novel miRNAs. Additionally, we will show that our approach has a lower training time and allows for a better graphical navegability and interpretation of the results. A web-demo interface to try deepSOM is available at http://fich.unl.edu.ar/sinc/web-demo/deepsom/.
Collapse
|
574
|
|
575
|
Wankhade KK, Jondhale KC, Thool VR. A hybrid approach for classification of rare class data. Knowl Inf Syst 2017. [DOI: 10.1007/s10115-017-1114-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
576
|
Ortigosa-Hernández J, Inza I, Lozano JA. Measuring the class-imbalance extent of multi-class problems. Pattern Recognit Lett 2017. [DOI: 10.1016/j.patrec.2017.08.002] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
577
|
|
578
|
|
579
|
Hyperspectral Image Classification Based on Semi-Supervised Rotation Forest. REMOTE SENSING 2017. [DOI: 10.3390/rs9090924] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Ensemble learning is widely used to combine varieties of weak learners in order to generate a relatively stronger learner by reducing either the bias or the variance of the individual learners. Rotation forest (RoF), combining feature extraction and classifier ensembles, has been successfully applied to hyperspectral (HS) image classification by promoting the diversity of base classifiers since last decade. Generally, RoF uses principal component analysis (PCA) as the rotation tool, which is commonly acknowledged as an unsupervised feature extraction method, and does not consider the discriminative information about classes. Sometimes, however, it turns out to be sub-optimal for classification tasks. Therefore, in this paper, we propose an improved RoF algorithm, in which semi-supervised local discriminant analysis is used as the feature rotation tool. The proposed algorithm, named semi-supervised rotation forest (SSRoF), aims to take advantage of both the discriminative information and local structural information provided by the limited labeled and massive unlabeled samples, thus providing better class separability for subsequent classifications. In order to promote the diversity of features, we also adjust the semi-supervised local discriminant analysis into a weighted form, which can balance the contributions of labeled and unlabeled samples. Experiments on several hyperspectral images demonstrate the effectiveness of our proposed algorithm compared with several state-of-the-art ensemble learning approaches.
Collapse
|
580
|
|
581
|
Lim P, Goh CK, Tan KC. Evolutionary Cluster-Based Synthetic Oversampling Ensemble (ECO-Ensemble) for Imbalance Learning. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:2850-2861. [PMID: 27337735 DOI: 10.1109/tcyb.2016.2579658] [Citation(s) in RCA: 44] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Class imbalance problems, where the number of samples in each class is unequal, is prevalent in numerous real world machine learning applications. Traditional methods which are biased toward the majority class are ineffective due to the relative severity of misclassifying rare events. This paper proposes a novel evolutionary cluster-based oversampling ensemble framework, which combines a novel cluster-based synthetic data generation method with an evolutionary algorithm (EA) to create an ensemble. The proposed synthetic data generation method is based on contemporary ideas of identifying oversampling regions using clusters. The novel use of EA serves a twofold purpose of optimizing the parameters of the data generation method while generating diverse examples leveraging on the characteristics of EAs, reducing overall computational cost. The proposed method is evaluated on a set of 40 imbalance datasets obtained from the University of California, Irvine, database, and outperforms current state-of-the-art ensemble algorithms tackling class imbalance problems.
Collapse
|
582
|
Segev N, Harel M, Mannor S, Crammer K, El-Yaniv R. Learn on Source, Refine on Target: A Model Transfer Learning Framework with Random Forests. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2017; 39:1811-1824. [PMID: 28113392 DOI: 10.1109/tpami.2016.2618118] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/02/2023]
Abstract
We propose novel model transfer-learning methods that refine a decision forest model M learned within a "source" domain using a training set sampled from a "target" domain, assumed to be a variation of the source. We present two random forest transfer algorithms. The first algorithm searches greedily for locally optimal modifications of each tree structure by trying to locally expand or reduce the tree around individual nodes. The second algorithm does not modify structure, but only the parameter (thresholds) associated with decision nodes. We also propose to combine both methods by considering an ensemble that contains the union of the two forests. The proposed methods exhibit impressive experimental results over a range of problems.
Collapse
|
583
|
Tian Y, Zhang H, Xu W, Zhang H, Yang L, Zheng S, Shi Y. Spectral Entropy Can Predict Changes of Working Memory Performance Reduced by Short-Time Training in the Delayed-Match-to-Sample Task. Front Hum Neurosci 2017; 11:437. [PMID: 28912701 PMCID: PMC5583228 DOI: 10.3389/fnhum.2017.00437] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2017] [Accepted: 08/15/2017] [Indexed: 11/13/2022] Open
Abstract
Spectral entropy, which was generated by applying the Shannon entropy concept to the power distribution of the Fourier-transformed electroencephalograph (EEG), was utilized to measure the uniformity of power spectral density underlying EEG when subjects performed the working memory tasks twice, i.e., before and after training. According to Signed Residual Time (SRT) scores based on response speed and accuracy trade-off, 20 subjects were divided into two groups, namely high-performance and low-performance groups, to undertake working memory (WM) tasks. We found that spectral entropy derived from the retention period of WM on channel FC4 exhibited a high correlation with SRT scores. To this end, spectral entropy was used in support vector machine classifier with linear kernel to differentiate these two groups. Receiver operating characteristics analysis and leave-one out cross-validation (LOOCV) demonstrated that the averaged classification accuracy (CA) was 90.0 and 92.5% for intra-session and inter-session, respectively, indicating that spectral entropy could be used to distinguish these two different WM performance groups successfully. Furthermore, the support vector regression prediction model with radial basis function kernel and the root-mean-square error of prediction revealed that spectral entropy could be utilized to predict SRT scores on individual WM performance. After testing the changes in SRT scores and spectral entropy for each subject by short-time training, we found that 16 in 20 subjects’ SRT scores were clearly promoted after training and 15 in 20 subjects’ SRT scores showed consistent changes with spectral entropy before and after training. The findings revealed that spectral entropy could be a promising indicator to predict individual’s WM changes by training and further provide a novel application about WM for brain–computer interfaces.
Collapse
Affiliation(s)
- Yin Tian
- Bio-information College, Chongqing University of Posts and TelecommunicationsChongqing, China
| | - Huiling Zhang
- Bio-information College, Chongqing University of Posts and TelecommunicationsChongqing, China
| | - Wei Xu
- Bio-information College, Chongqing University of Posts and TelecommunicationsChongqing, China
| | - Haiyong Zhang
- Bio-information College, Chongqing University of Posts and TelecommunicationsChongqing, China
| | - Li Yang
- Bio-information College, Chongqing University of Posts and TelecommunicationsChongqing, China
| | - Shuxing Zheng
- Bio-information College, Chongqing University of Posts and TelecommunicationsChongqing, China
| | - Yupan Shi
- Bio-information College, Chongqing University of Posts and TelecommunicationsChongqing, China
| |
Collapse
|
584
|
KNN-based maximum margin and minimum volume hyper-sphere machine for imbalanced data classification. INT J MACH LEARN CYB 2017. [DOI: 10.1007/s13042-017-0720-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
585
|
A unified methodology based on sparse field level sets and boosting algorithms for false positives reduction in lung nodules detection. Int J Comput Assist Radiol Surg 2017; 13:397-409. [DOI: 10.1007/s11548-017-1656-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2017] [Accepted: 07/31/2017] [Indexed: 01/15/2023]
|
586
|
Aguayo L, Barreto GA. Novelty Detection in Time Series Using Self-Organizing Neural Networks: A Comprehensive Evaluation. Neural Process Lett 2017. [DOI: 10.1007/s11063-017-9679-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
587
|
A Novel Neutrosophic Weighted Extreme Learning Machine for Imbalanced Data Set. Symmetry (Basel) 2017. [DOI: 10.3390/sym9080142] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
588
|
Improving virtual screening predictive accuracy of Human kallikrein 5 inhibitors using machine learning models. Comput Biol Chem 2017; 69:110-119. [DOI: 10.1016/j.compbiolchem.2017.05.007] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2016] [Revised: 12/18/2016] [Accepted: 05/26/2017] [Indexed: 12/23/2022]
|
589
|
Yom-Tov E. Predicting Drug Recalls From Internet Search Engine Queries. IEEE JOURNAL OF TRANSLATIONAL ENGINEERING IN HEALTH AND MEDICINE-JTEHM 2017; 5:4400106. [PMID: 28845371 PMCID: PMC5568020 DOI: 10.1109/jtehm.2017.2732945] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/30/2016] [Revised: 05/21/2017] [Accepted: 07/23/2017] [Indexed: 01/01/2023]
Abstract
Batches of pharmaceuticals are sometimes recalled from the market when a safety issue or a defect is detected in specific production runs of a drug. Such problems are usually detected when patients or healthcare providers report abnormalities to medical authorities. Here, we test the hypothesis that defective production lots can be detected earlier by monitoring queries to Internet search engines. We extracted queries from the USA to the Bing search engine, which mentioned one of the 5195 pharmaceutical drugs during 2015 and all recall notifications issued by the Food and Drug Administration (FDA) during that year. By using attributes that quantify the change in query volume at the state level, we attempted to predict if a recall of a specific drug will be ordered by FDA in a time horizon ranging from 1 to 40 days in future. Our results show that future drug recalls can indeed be identified with an AUC of 0.791 and a lift at 5% of approximately 6 when predicting a recall occurring one day ahead. This performance degrades as prediction is made for longer periods ahead. The most indicative attributes for prediction are sudden spikes in query volume about a specific medicine in each state. Recalls of prescription drugs and those estimated to be of medium-risk are more likely to be identified using search query data. These findings suggest that aggregated Internet search engine data can be used to facilitate in early warning of faulty batches of medicines.
Collapse
|
590
|
Korfiatis VC, Tassani S, Matsopoulos GK, Korfiatis VC, Tassani S, Matsopoulos GK. A New Ensemble Classification System For Fracture Zone Prediction Using Imbalanced Micro-CT Bone Morphometrical Data. IEEE J Biomed Health Inform 2017; 22:1189-1196. [PMID: 28692998 DOI: 10.1109/jbhi.2017.2723463] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Trabecular bone fractures constitute a major health issue for the modern societies, with the currently established prediction methods of fracture risk, such as bone mineral density (BMD), resulting in errors up to 40%. Fracture-zone prediction based on bone's microstructure has been recently proposed as an alternative prediction method of fracture risk. In this paper, a classification system (CS) for the automatic fracture-zone prediction based on an Ensemble of Imbalanced Learning methods is proposed, following the observation that the percentage of the actual fractured bone area is significantly smaller than the intact bone in the case of a fracture event. The sample is divided into Volumes of Interest (VOIs) of specific size and 29 morphometrical parameters are calculated from each VOI, which serve as input features for the CS in order for it to separate the input patterns in to two classes: fractured and nonfractured. To this end, two well-established Imbalanced Learning methods, namely Random Undersampling and Synthetic Minority Oversampling, and two popular classification algorithms, namely Multilayer Perceptrons and Support Vector Machines, are tested and combined accordingly, to provide the best possible performance on a dataset that contains 45 specimens' pre- and postfailure scans. The best combination is then compared with three well-established Ensembles of Imbalanced Learning methods, namely RUSBoost, UnderBagging and SMOTEBagging. The experimental results clearly show that the proposed CS outperforms the competition, scoring in some occasions more than 90% in G-Mean and Area under Curve metrics. Finally, an investigation on the significance of the various trabecular bone's biomechanical parameters is made using the sequential forward floating selection technique, in order to identify possible biomarkers for fracture-zone prediction.
Collapse
|
591
|
|
592
|
Large Earthquake Magnitude Prediction in Chile with Imbalanced Classifiers and Ensemble Learning. APPLIED SCIENCES-BASEL 2017. [DOI: 10.3390/app7060625] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
593
|
Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants. Sci Rep 2017; 7:2959. [PMID: 28592878 PMCID: PMC5462751 DOI: 10.1038/s41598-017-03011-5] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2016] [Accepted: 04/21/2017] [Indexed: 12/15/2022] Open
Abstract
Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ML cannot begin to be effective until a sufficient number of instances have been found. Most of state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We present a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach that outperforms state-of-the-art methods in two different contexts: the prediction of non-coding variants associated with Mendelian and with complex diseases. We show that imbalance-aware ML is a key issue for the design of robust and accurate prediction algorithms and we provide a method and an easy-to-use software tool that can be effectively applied to this challenging prediction task.
Collapse
|
594
|
Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem. Neurocomputing 2017. [DOI: 10.1016/j.neucom.2017.03.011] [Citation(s) in RCA: 73] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
595
|
Mauša G, Galinac Grbac T. Co-evolutionary multi-population genetic programming for classification in software defect prediction: An empirical case study. Appl Soft Comput 2017. [DOI: 10.1016/j.asoc.2017.01.050] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
596
|
Synthetic semi-supervised learning in imbalanced domains: Constructing a model for donor-recipient matching in liver transplantation. Knowl Based Syst 2017. [DOI: 10.1016/j.knosys.2017.02.020] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
597
|
Akkasi A, Varoğlu E, Dimililer N. Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text. APPL INTELL 2017. [DOI: 10.1007/s10489-017-0920-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
598
|
Applying the Temporal Abstraction Technique to the Prediction of Chronic Kidney Disease Progression. J Med Syst 2017; 41:85. [PMID: 28401396 DOI: 10.1007/s10916-017-0732-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2017] [Accepted: 04/03/2017] [Indexed: 01/05/2023]
Abstract
Chronic kidney disease (CKD) has attracted considerable attention in the public health domain in recent years. Researchers have exerted considerable effort in attempting to identify critical factors that may affect the deterioration of CKD. In clinical practice, the physical conditions of CKD patients are regularly recorded. The data of CKD patients are recorded as a high-dimensional time-series. Therefore, how to analyze these time-series data for identifying the factors affecting CKD deterioration becomes an interesting topic. This study aims at developing prediction models for stage 4 CKD patients to determine whether their eGFR level decreased to less than 15 ml/min/1.73m2 (end-stage renal disease, ESRD) 6 months after collecting their final laboratory test information by evaluating time-related features. A total of 463 CKD patients collected from January 2004 to December 2013 at one of the biggest dialysis centers in southern Taiwan were included in the experimental evaluation. We integrated the temporal abstraction (TA) technique with data mining methods to develop CKD progression prediction models. Specifically, the TA technique was used to extract vital features (TA-related features) from high-dimensional time-series data, after which several data mining techniques, including C4.5, classification and regression tree (CART), support vector machine, and adaptive boosting (AdaBoost), were applied to develop CKD progression prediction models. The results revealed that incorporating temporal information into the prediction models increased the efficiency of the models. The AdaBoost+CART model exhibited the most accurate prediction among the constructed models (Accuracy: 0.662, Sensitivity: 0.620, Specificity: 0.704, and AUC: 0.715). A number of TA-related features were found to be associated with the deterioration of renal function. These features can provide further clinical information to explain the progression of CKD. TA-related features extracted by long-term tracking of changes in laboratory test values can enable early diagnosis of ESRD. The developed models using these features can facilitate medical personnel in making clinical decisions to provide appropriate diagnoses and improved care quality to patients with CKD.
Collapse
|
599
|
Fernández A, Carmona CJ, José del Jesus M, Herrera F. A Pareto-based Ensemble with Feature and Instance Selection for Learning from Multi-Class Imbalanced Datasets. Int J Neural Syst 2017. [DOI: 10.1142/s0129065717500289] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Imbalanced classification is related to those problems that have an uneven distribution among classes. In addition to the former, when instances are located into the overlapped areas, the correct modeling of the problem becomes harder. Current solutions for both issues are often focused on the binary case study, as multi-class datasets require an additional effort to be addressed. In this research, we overcome these problems by carrying out a combination between feature and instance selections. Feature selection will allow simplifying the overlapping areas easing the generation of rules to distinguish among the classes. Selection of instances from all classes will address the imbalance itself by finding the most appropriate class distribution for the learning task, as well as possibly removing noise and difficult borderline examples. For the sake of obtaining an optimal joint set of features and instances, we embedded the searching for both parameters in a Multi-Objective Evolutionary Algorithm, using the C4.5 decision tree as baseline classifier in this wrapper approach. The multi-objective scheme allows taking a double advantage: the search space becomes broader, and we may provide a set of different solutions in order to build an ensemble of classifiers. This proposal has been contrasted versus several state-of-the-art solutions on imbalanced classification showing excellent results in both binary and multi-class problems.
Collapse
Affiliation(s)
- Alberto Fernández
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada 18071, Spain
| | - Cristobal José Carmona
- Department of Civil Engineering, University of Burgos, Burgos 09006, Spain
- Leicester School of Pharmacy, De Montfort University, Leicester, LE1 9BH, UK
| | | | - Francisco Herrera
- Department of Computer Science and Artificial Intelligence, University of Granada, Granada 18071, Spain
- Faculty of Computing and Information Technology — North Jeddah, King Abdulaziz University (KAU), Jeddah 80200, Saudi Arabia
| |
Collapse
|
600
|
Mathews LM, Seetha H. On Improving the Classification of Imbalanced Data. CYBERNETICS AND INFORMATION TECHNOLOGIES 2017. [DOI: 10.1515/cait-2017-0004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Abstract
Mining of imbalanced data isachallenging task due to its complex inherent characteristics. The conventional classifiers such as the nearest neighbor severely bias towards the majority class, as minority class data are under-represented and outnumbered. This paper focuses on building an improved Nearest Neighbor Classifier foratwo class imbalanced data. Three oversampling techniques are presented, for generation of artificial instances for the minority class for balancing the distribution among the classes. Experimental results showed that the proposed methods outperformed the conventional classifier.
Collapse
Affiliation(s)
- Lincy Meera Mathews
- School of Information Technology and Engineering, VIT University, Vellore, Tamil Nadu, India
| | - Hari Seetha
- School of Computing Science & Engineering, VIT University, Vellore, Tamil Nadu, India
| |
Collapse
|