151
|
Emotion recognition from single-trial EEG based on kernel Fisher's emotion pattern and imbalanced quasiconformal kernel support vector machine. SENSORS 2014; 14:13361-88. [PMID: 25061837 PMCID: PMC4179000 DOI: 10.3390/s140813361] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/29/2014] [Revised: 07/11/2014] [Accepted: 07/18/2014] [Indexed: 11/17/2022]
Abstract
Electroencephalogram-based emotion recognition (EEG-ER) has received increasing attention in the fields of health care, affective computing, and brain-computer interface (BCI). However, satisfactory ER performance within a bi-dimensional and non-discrete emotional space using single-trial EEG data remains a challenging task. To address this issue, we propose a three-layer scheme for single-trial EEG-ER. In the first layer, a set of spectral powers of different EEG frequency bands are extracted from multi-channel single-trial EEG signals. In the second layer, the kernel Fisher's discriminant analysis method is applied to further extract features with better discrimination ability from the EEG spectral powers. The feature vector produced by layer 2 is called a kernel Fisher's emotion pattern (KFEP), and is sent into layer 3 for further classification where the proposed imbalanced quasiconformal kernel support vector machine (IQK-SVM) serves as the emotion classifier. The outputs of the three layer EEG-ER system include labels of emotional valence and arousal. Furthermore, to collect effective training and testing datasets for the current EEG-ER system, we also use an emotion-induction paradigm in which a set of pictures selected from the International Affective Picture System (IAPS) are employed as emotion induction stimuli. The performance of the proposed three-layer solution is compared with that of other EEG spectral power-based features and emotion classifiers. Results on 10 healthy participants indicate that the proposed KFEP feature performs better than other spectral power features, and IQK-SVM outperforms traditional SVM in terms of the EEG-ER accuracy. Our findings also show that the proposed EEG-ER scheme achieves the highest classification accuracies of valence (82.68%) and arousal (84.79%) among all testing methods.
Collapse
|
152
|
|
153
|
Anwar N, Jones G, Ganesh S. Measurement of data complexity for classification problems with unbalanced data. Stat Anal Data Min 2014. [DOI: 10.1002/sam.11228] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Nafees Anwar
- Institute of Fundamental Sciences (Statistics); Massey University; Private Bag 11222, Palmerston North Manawatu 4412 New Zealand
| | - Geoff Jones
- Institute of Fundamental Sciences (Statistics); Massey University; Private Bag 11222, Palmerston North Manawatu 4412 New Zealand
| | - Siva Ganesh
- Institute of Fundamental Sciences (Statistics); Massey University; Private Bag 11222, Palmerston North Manawatu 4412 New Zealand
| |
Collapse
|
154
|
Cao P, Yang J, Li W, Zhao D, Zaiane O. Ensemble-based hybrid probabilistic sampling for imbalanced data learning in lung nodule CAD. Comput Med Imaging Graph 2014; 38:137-50. [DOI: 10.1016/j.compmedimag.2013.12.003] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2013] [Revised: 10/19/2013] [Accepted: 12/02/2013] [Indexed: 01/15/2023]
|
155
|
Predicting pupylation sites in prokaryotic proteins using pseudo-amino acid composition and extreme learning machine. Neurocomputing 2014. [DOI: 10.1016/j.neucom.2012.11.058] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
156
|
Cheng Lu, Mandal M. Toward Automatic Mitotic Cell Detection and Segmentation in Multispectral Histopathological Images. IEEE J Biomed Health Inform 2014; 18:594-605. [DOI: 10.1109/jbhi.2013.2277837] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
157
|
Dubey R, Zhou J, Wang Y, Thompson PM, Ye J. Analysis of sampling techniques for imbalanced data: An n = 648 ADNI study. Neuroimage 2014; 87:220-41. [PMID: 24176869 PMCID: PMC3946903 DOI: 10.1016/j.neuroimage.2013.10.005] [Citation(s) in RCA: 84] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2013] [Revised: 09/10/2013] [Accepted: 10/07/2013] [Indexed: 02/07/2023] Open
Abstract
Many neuroimaging applications deal with imbalanced imaging data. For example, in Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, the mild cognitive impairment (MCI) cases eligible for the study are nearly two times the Alzheimer's disease (AD) patients for structural magnetic resonance imaging (MRI) modality and six times the control cases for proteomics modality. Constructing an accurate classifier from imbalanced data is a challenging task. Traditional classifiers that aim to maximize the overall prediction accuracy tend to classify all data into the majority class. In this paper, we study an ensemble system of feature selection and data sampling for the class imbalance problem. We systematically analyze various sampling techniques by examining the efficacy of different rates and types of undersampling, oversampling, and a combination of over and undersampling approaches. We thoroughly examine six widely used feature selection algorithms to identify significant biomarkers and thereby reduce the complexity of the data. The efficacy of the ensemble techniques is evaluated using two different classifiers including Random Forest and Support Vector Machines based on classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, and specificity measures. Our extensive experimental results show that for various problem settings in ADNI, (1) a balanced training set obtained with K-Medoids technique based undersampling gives the best overall performance among different data sampling techniques and no sampling approach; and (2) sparse logistic regression with stability selection achieves competitive performance among various feature selection algorithms. Comprehensive experiments with various settings show that our proposed ensemble model of multiple undersampled datasets yields stable and promising results.
Collapse
Affiliation(s)
- Rashmi Dubey
- School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, USA; Center for Evolutionary Medicine and Informatics, The Biodesign Institute, Arizona State University, Tempe, AZ, USA
| | - Jiayu Zhou
- School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, USA; Center for Evolutionary Medicine and Informatics, The Biodesign Institute, Arizona State University, Tempe, AZ, USA
| | - Yalin Wang
- School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, USA
| | - Paul M Thompson
- Imaging Genetics Center, Laboratory of Neuro Imaging, UCLA School of Medicine, Los Angeles, CA, USA
| | - Jieping Ye
- School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, USA; Center for Evolutionary Medicine and Informatics, The Biodesign Institute, Arizona State University, Tempe, AZ, USA.
| |
Collapse
|
158
|
An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2010.12.016] [Citation(s) in RCA: 88] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
159
|
Design of an evolutionary approach for intrusion detection. ScientificWorldJournal 2013; 2013:962185. [PMID: 24376390 PMCID: PMC3858966 DOI: 10.1155/2013/962185] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2013] [Accepted: 09/16/2013] [Indexed: 11/21/2022] Open
Abstract
A novel evolutionary approach is proposed for effective intrusion detection based on benchmark datasets. The proposed approach can generate a pool of noninferior individual solutions and ensemble solutions thereof. The generated ensembles can be used to detect the intrusions accurately. For intrusion detection problem, the proposed approach could consider conflicting objectives
simultaneously like detection rate of each attack class, error rate, accuracy, diversity, and so forth. The proposed approach can generate
a pool of noninferior solutions and ensembles thereof having optimized trade-offs values of multiple conflicting objectives.
In this paper, a three-phase, approach is proposed to generate solutions to a simple chromosome design in the first phase. In the
first phase, a Pareto front of noninferior individual solutions is approximated. In the second phase of the proposed approach,
the entire solution set is further refined to determine effective ensemble solutions considering solution interaction. In this phase,
another improved Pareto front of ensemble solutions over that of individual solutions is approximated. The ensemble solutions in
improved Pareto front reported improved detection results based on benchmark datasets for intrusion detection. In the third phase,
a combination method like majority voting method is used to fuse the predictions of individual solutions for determining prediction
of ensemble solution. Benchmark datasets, namely, KDD cup 1999 and ISCX 2012 dataset, are used to demonstrate and validate
the performance of the proposed approach for intrusion detection. The proposed approach can discover individual solutions and
ensemble solutions thereof with a good support and a detection rate from benchmark datasets (in comparison with well-known
ensemble methods like bagging and boosting). In addition, the proposed approach is a generalized classification approach that is applicable to the problem of any field having multiple conflicting objectives, and a dataset can be represented in the form of labelled instances in terms of its features.
Collapse
|
160
|
López V, Fernández A, García S, Palade V, Herrera F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf Sci (N Y) 2013. [DOI: 10.1016/j.ins.2013.07.007] [Citation(s) in RCA: 932] [Impact Index Per Article: 77.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
161
|
PAKDD’12 best paper: generating balanced classifier-independent training samples from unlabeled data. Knowl Inf Syst 2013. [DOI: 10.1007/s10115-013-0683-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
162
|
Xue W, Zhang J. Dealing with Imbalanced Dataset: A Re-sampling Method Based on the Improved SMOTE Algorithm. COMMUN STAT-SIMUL C 2013. [DOI: 10.1080/03610918.2012.728274] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
163
|
Wei Q, Dunbrack RL. The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS One 2013; 8:e67863. [PMID: 23874456 PMCID: PMC3706434 DOI: 10.1371/journal.pone.0067863] [Citation(s) in RCA: 149] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2012] [Accepted: 05/23/2013] [Indexed: 12/03/2022] Open
Abstract
Training and testing of conventional machine learning models on binary classification problems depend on the proportions of the two outcomes in the relevant data sets. This may be especially important in practical terms when real-world applications of the classifier are either highly imbalanced or occur in unknown proportions. Intuitively, it may seem sensible to train machine learning models on data similar to the target data in terms of proportions of the two binary outcomes. However, we show that this is not the case using the example of prediction of deleterious and neutral phenotypes of human missense mutations in human genome data, for which the proportion of the binary outcome is unknown. Our results indicate that using balanced training data (50% neutral and 50% deleterious) results in the highest balanced accuracy (the average of True Positive Rate and True Negative Rate), Matthews correlation coefficient, and area under ROC curves, no matter what the proportions of the two phenotypes are in the testing data. Besides balancing the data by undersampling the majority class, other techniques in machine learning include oversampling the minority class, interpolating minority-class data points and various penalties for misclassifying the minority class. However, these techniques are not commonly used in either the missense phenotype prediction problem or in the prediction of disordered residues in proteins, where the imbalance problem is substantial. The appropriate approach depends on the amount of available data and the specific problem at hand.
Collapse
Affiliation(s)
- Qiong Wei
- Institute for Cancer Research, Fox Chase Cancer Center, Philadelphia, Pennsylvania, United States of America
| | - Roland L. Dunbrack
- Institute for Cancer Research, Fox Chase Cancer Center, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
164
|
Kundu K, Costa F, Huber M, Reth M, Backofen R. Semi-supervised prediction of SH2-peptide interactions from imbalanced high-throughput data. PLoS One 2013; 8:e62732. [PMID: 23690949 PMCID: PMC3656881 DOI: 10.1371/journal.pone.0062732] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2013] [Accepted: 03/22/2013] [Indexed: 01/08/2023] Open
Abstract
Src homology 2 (SH2) domains are the largest family of the peptide-recognition modules (PRMs) that bind to phosphotyrosine containing peptides. Knowledge about binding partners of SH2-domains is key for a deeper understanding of different cellular processes. Given the high binding specificity of SH2, in-silico ligand peptide prediction is of great interest. Currently however, only a few approaches have been published for the prediction of SH2-peptide interactions. Their main shortcomings range from limited coverage, to restrictive modeling assumptions (they are mainly based on position specific scoring matrices and do not take into consideration complex amino acids inter-dependencies) and high computational complexity. We propose a simple yet effective machine learning approach for a large set of known human SH2 domains. We used comprehensive data from micro-array and peptide-array experiments on 51 human SH2 domains. In order to deal with the high data imbalance problem and the high signal-to-noise ration, we casted the problem in a semi-supervised setting. We report competitive predictive performance w.r.t. state-of-the-art. Specifically we obtain 0.83 AUC ROC and 0.93 AUC PR in comparison to 0.71 AUC ROC and 0.87 AUC PR previously achieved by the position specific scoring matrices (PSSMs) based SMALI approach. Our work provides three main contributions. First, we showed that better models can be obtained when the information on the non-interacting peptides (negative examples) is also used. Second, we improve performance when considering high order correlations between the ligand positions employing regularization techniques to effectively avoid overfitting issues. Third, we developed an approach to tackle the data imbalance problem using a semi-supervised strategy. Finally, we performed a genome-wide prediction of human SH2-peptide binding, uncovering several findings of biological relevance. We make our models and genome-wide predictions, for all the 51 SH2-domains, freely available to the scientific community under the following URLs: http://www.bioinf.uni-freiburg.de/Software/SH2PepInt/SH2PepInt.tar.gz and http://www.bioinf.uni-freiburg.de/Software/SH2PepInt/Genome-wide-predictions.tar.gz, respectively.
Collapse
Affiliation(s)
- Kousik Kundu
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
- Centre for Biological Signalling Studies (BIOSS), University of Freiburg, Freiburg, Germany
| | - Fabrizio Costa
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
| | - Michael Huber
- Institute of Biochemistry and Molecular Immunology, University Clinic, RWTH Aachen University, Aachen, Germany
| | - Michael Reth
- Centre for Biological Signalling Studies (BIOSS), University of Freiburg, Freiburg, Germany
- Department of Molecular Immunology, Max Planck Institute of Immunology, Freiburg, Germany
| | - Rolf Backofen
- Bioinformatics Group, Department of Computer Science, University of Freiburg, Freiburg, Germany
- Centre for Biological Signalling Studies (BIOSS), University of Freiburg, Freiburg, Germany
- Centre for Biological Systems Analysis (ZBSA), University of Freiburg, Freiburg, Germany
- Center for non-coding RNA in Technology and Health, University of Copenhagen, Frederiksberg, Denmark
- * E-mail:
| |
Collapse
|
165
|
Yu DJ, Hu J, Tang ZM, Shen HB, Yang J, Yang JY. Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling. Neurocomputing 2013. [DOI: 10.1016/j.neucom.2012.10.012] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
|
166
|
Jacques J, Taillard J, Delerue D, Jourdan L, Dhaenens C. MOCA-I: Discovering Rules and Guiding Decision Maker in the Context of Partial Classification in Large and Imbalanced Datasets. LECTURE NOTES IN COMPUTER SCIENCE 2013. [DOI: 10.1007/978-3-642-44973-4_5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
167
|
Overlapping, Rare Examples and Class Decomposition in Learning Classifiers from Imbalanced Data. EMERGING PARADIGMS IN MACHINE LEARNING 2013. [DOI: 10.1007/978-3-642-28699-5_11] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/07/2022]
|
168
|
Class Imbalance in the Prediction of Dementia from Neuropsychological Data. PROGRESS IN ARTIFICIAL INTELLIGENCE 2013. [DOI: 10.1007/978-3-642-40669-0_13] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
169
|
Parvin H, Ansari S, Parvin S. Proposing a New Method for Non-relative Imbalanced Dataset. ADVANCES IN INTELLIGENT SYSTEMS AND COMPUTING 2013. [DOI: 10.1007/978-3-642-32922-7_31] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
|
170
|
|
171
|
A novel framework for class imbalance learning using intelligent under-sampling. PROGRESS IN ARTIFICIAL INTELLIGENCE 2012. [DOI: 10.1007/s13748-012-0038-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
172
|
DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets. DATA KNOWL ENG 2012. [DOI: 10.1016/j.datak.2012.08.001] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
173
|
Menardi G, Torelli N. Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 2012. [DOI: 10.1007/s10618-012-0295-5] [Citation(s) in RCA: 162] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
174
|
Cheng CH. Discovering knowledge of medical quality in total hip arthroplasty (THA). Arch Gerontol Geriatr 2012; 55:323-30. [DOI: 10.1016/j.archger.2011.09.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2011] [Revised: 08/31/2011] [Accepted: 09/01/2011] [Indexed: 11/26/2022]
|
175
|
Chen YS, Cheng CH. Identifying the medical practice after total hip arthroplasty using an integrated hybrid approach. Comput Biol Med 2012; 42:826-40. [PMID: 22795228 DOI: 10.1016/j.compbiomed.2012.06.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2012] [Revised: 06/13/2012] [Accepted: 06/20/2012] [Indexed: 10/28/2022]
Abstract
A critical option of total hip arthroplasty (THA) is considered only when tried more conservative treatments but continued to have pain, stiffness, or problems with the function of ones hip. THA plays one of major concerns under the waves of the rapid growth of aging populations and the constrained health care resources in Taiwan. Moreover, prior studies indicated that imbalanced class distribution problems do exist in the constructed classification model and cause seriously negative effects on model performances in the health care industry. Therefore, this study proposes an integrated hybrid approach to provide an alternate method for classifying the quality (e.g., the staying length in hospital) of medical practice with an imbalanced class problem after performing a THA procedure for hip replacement patients and their doctors in the health care industry. The proposed approach is constituted by seven components: expert knowledge, global discretization, imbalanced bootstrap technique, reduct and core methods, rough sets, rule induction, and rule filter. The proposed approach is illustrated in practice by examining an experimental dataset from the National Health Insurance Research Database (NHIRD) in Taiwan. The experimental results reveal that the proposed approach has better performance than the listed methods under evaluation criteria. The output created by the rough set LEM2 algorithm is a comprehensible decision rule set that can be applied in knowledge-based health care services as desired. The analytical results provide useful THA information for both academics and practitioners and these results could be applicable to other diseases or to other countries with similar social and cultural practices.
Collapse
Affiliation(s)
- You-Shyang Chen
- Department of Information Management, Hwa Hsia Institute of Technology, Chung Ho District, New Taipei City, Taiwan.
| | | |
Collapse
|
176
|
Zhang YN, Yu DJ, Li SS, Fan YX, Huang Y, Shen HB. Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features. BMC Bioinformatics 2012; 13:118. [PMID: 22651691 PMCID: PMC3424114 DOI: 10.1186/1471-2105-13-118] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2011] [Accepted: 05/31/2012] [Indexed: 12/23/2022] Open
Abstract
Background Adenosine-5′-triphosphate (ATP) is one of multifunctional nucleotides and plays an important role in cell biology as a coenzyme interacting with proteins. Revealing the binding sites between protein and ATP is significantly important to understand the functionality of the proteins and the mechanisms of protein-ATP complex. Results In this paper, we propose a novel framework for predicting the proteins’ functional residues, through which they can bind with ATP molecules. The new prediction protocol is achieved by combination of sequence evolutional information and bi-profile sampling of multi-view sequential features and the sequence derived structural features. The hypothesis for this strategy is single-view feature can only represent partial target’s knowledge and multiple sources of descriptors can be complementary. Conclusions Prediction performances evaluated by both 5-fold and leave-one-out jackknife cross-validation tests on two benchmark datasets consisting of 168 and 227 non-homologous ATP binding proteins respectively demonstrate the efficacy of the proposed protocol. Our experimental results also reveal that the residue structural characteristics of real protein-ATP binding sites are significant different from those normal ones, for example the binding residues do not show high solvent accessibility propensities, and the bindings prefer to occur at the conjoint points between different secondary structure segments. Furthermore, results also show that performance is affected by the imbalanced training datasets by testing multiple ratios between positive and negative samples in the experiments. Increasing the dataset scale is also demonstrated useful for improving the prediction performances.
Collapse
Affiliation(s)
- Ya-Nan Zhang
- Department of Automation, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | | | | | | | | | | |
Collapse
|
177
|
Klement W, Wilk S, Michalowski W, Farion KJ, Osmond MH, Verter V. Predicting the need for CT imaging in children with minor head injury using an ensemble of Naive Bayes classifiers. Artif Intell Med 2012; 54:163-70. [DOI: 10.1016/j.artmed.2011.11.005] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2011] [Revised: 10/18/2011] [Accepted: 11/24/2011] [Indexed: 10/14/2022]
|
178
|
García V, Sánchez J, Mollineda R. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl Based Syst 2012. [DOI: 10.1016/j.knosys.2011.06.013] [Citation(s) in RCA: 113] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
179
|
Zhang Y, Zhang D, Mi G, Ma D, Li G, Guo Y, Li M, Zhu M. Using ensemble methods to deal with imbalanced data in predicting protein-protein interactions. Comput Biol Chem 2012; 36:36-41. [PMID: 22286086 DOI: 10.1016/j.compbiolchem.2011.12.003] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2011] [Revised: 12/02/2011] [Accepted: 12/21/2011] [Indexed: 11/30/2022]
Abstract
In proteins, the number of interacting pairs is usually much smaller than the number of non-interacting ones. So the imbalanced data problem will arise in the field of protein-protein interactions (PPIs) prediction. In this article, we introduce two ensemble methods to solve the imbalanced data problem. These ensemble methods combine the based-cluster under-sampling technique and the fusion classifiers. And then we evaluate the ensemble methods using a dataset from Database of Interacting Proteins (DIP) with 10-fold cross validation. All the prediction models achieve area under the receiver operating characteristic curve (AUC) value about 95%. Our results show that the ensemble classifiers are quite effective in predicting PPIs; we also gain some valuable conclusions on the performance of ensemble methods for PPIs in imbalanced data. The prediction software and all dataset employed in the work can be obtained for free at http://cic.scu.edu.cn/bioinformatics/Ensemble_PPIs/index.html.
Collapse
Affiliation(s)
- Yongqing Zhang
- College of Computer Science, Sichuan University, Chengdu 610065, PR China
| | | | | | | | | | | | | | | |
Collapse
|
180
|
Identification of Different Types of Minority Class Examples in Imbalanced Data. LECTURE NOTES IN COMPUTER SCIENCE 2012. [DOI: 10.1007/978-3-642-28931-6_14] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
181
|
|
182
|
MENA LUIS, GONZALEZ JESUSA. SYMBOLIC ONE-CLASS LEARNING FROM IMBALANCED DATASETS: APPLICATION IN MEDICAL DIAGNOSIS. INT J ARTIF INTELL T 2011. [DOI: 10.1142/s0218213009000135] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
When working with real-world applications we often find imbalanced datasets, those for which there exists a majority class with normal data and a minority class with abnormal or important data. In this work, we make an overview of the class imbalance problem; we review consequences, possible causes and existing strategies to cope with the inconveniences associated to this problem. As an effort to contribute to the solution of this problem, we propose a new rule induction algorithm named Rule Extraction for MEdical Diagnosis (REMED), as a symbolic one-class learning approach. For the evaluation of the proposed method, we use different medical diagnosis datasets taking into account quantitative metrics, comprehensibility, and reliability. We performed a comparison of REMED versus C4.5 and RIPPER combined with over-sampling and cost-sensitive strategies. This empirical analysis of the REMED algorithm showed it to be quantitatively competitive with C4.5 and RIPPER in terms of the area under the Receiver Operating Characteristic curve (AUC) and the geometric mean, but overcame them in terms of comprehensibility and reliability. Results of our experiments show that REMED generated rules systems with a larger degree of abstraction and patterns closer to well-known abnormal values associated to each considered medical dataset.
Collapse
Affiliation(s)
- LUIS MENA
- Department of Computer Science, Faculty of Engineering, University of Zulia, Maracaibo, Venezuela
- National Institute of Astrophysics, Optics and Electronics, Puebla, Mexico
| | - JESUS A. GONZALEZ
- Department of Computer Science, National Institute of Astrophysics, Optics and Electronics, Puebla, Mexico
| |
Collapse
|
183
|
Discovering medical quality of total hip arthroplasty by rough set classifier with imbalanced class. ACTA ACUST UNITED AC 2011. [DOI: 10.1007/s11135-011-9624-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
184
|
Automatic defect detection for TFT-LCD array process using quasiconformal kernel support vector data description. Int J Mol Sci 2011; 12:5762-81. [PMID: 22016625 PMCID: PMC3189749 DOI: 10.3390/ijms12095762] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2011] [Revised: 08/08/2011] [Accepted: 08/16/2011] [Indexed: 11/16/2022] Open
Abstract
Defect detection has been considered an efficient way to increase the yield rate of panels in thin film transistor liquid crystal display (TFT-LCD) manufacturing. In this study we focus on the array process since it is the first and key process in TFT-LCD manufacturing. Various defects occur in the array process, and some of them could cause great damage to the LCD panels. Thus, how to design a method that can robustly detect defects from the images captured from the surface of LCD panels has become crucial. Previously, support vector data description (SVDD) has been successfully applied to LCD defect detection. However, its generalization performance is limited. In this paper, we propose a novel one-class machine learning method, called quasiconformal kernel SVDD (QK-SVDD) to address this issue. The QK-SVDD can significantly improve generalization performance of the traditional SVDD by introducing the quasiconformal transformation into a predefined kernel. Experimental results, carried out on real LCD images provided by an LCD manufacturer in Taiwan, indicate that the proposed QK-SVDD not only obtains a high defect detection rate of 96%, but also greatly improves generalization performance of SVDD. The improvement has shown to be over 30%. In addition, results also show that the QK-SVDD defect detector is able to accomplish the task of defect detection on an LCD image within 60 ms.
Collapse
|
185
|
Learning SVM with weighted maximum margin criterion for classification of imbalanced data. ACTA ACUST UNITED AC 2011. [DOI: 10.1016/j.mcm.2010.11.040] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
186
|
|
187
|
Addressing the Classification with Imbalanced Data: Open Problems and New Challenges on Class Distribution. LECTURE NOTES IN COMPUTER SCIENCE 2011. [DOI: 10.1007/978-3-642-21219-2_1] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
188
|
Abstract
In recent years, learning from imbalanced data has attracted growing attention from both academia and industry due to the explosive growth of applications that use and produce imbalanced data. However, because of the complex characteristics of imbalanced data, many real-world solutions struggle to provide robust efficiency in learning-based applications. In an effort to address this problem, this paper presents Ranked Minority Oversampling in Boosting (RAMOBoost), which is a RAMO technique based on the idea of adaptive synthetic data generation in an ensemble learning system. Briefly, RAMOBoost adaptively ranks minority class instances at each learning iteration according to a sampling probability distribution that is based on the underlying data distribution, and can adaptively shift the decision boundary toward difficult-to-learn minority and majority class instances by using a hypothesis assessment procedure. Simulation analysis on 19 real-world datasets assessed over various metrics-including overall accuracy, precision, recall, F-measure, G-mean, and receiver operation characteristic analysis-is used to illustrate the effectiveness of this method.
Collapse
Affiliation(s)
- Sheng Chen
- Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ 07030, USA.
| | | | | |
Collapse
|
189
|
Qi M, Lu Y, Wang J, Kong J. Prediction of Microporous Aluminophosphate AlPO4-5 Based on Resampling Using Partial Least Squares and Logistic Discrimination. Mol Inform 2010; 29:203-10. [DOI: 10.1002/minf.200900052] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2009] [Accepted: 01/08/2010] [Indexed: 11/11/2022]
|
190
|
Classification Potential vs. Classification Accuracy: A Comprehensive Study of Evolutionary Algorithms with Biomedical Datasets. LECTURE NOTES IN COMPUTER SCIENCE 2010. [DOI: 10.1007/978-3-642-17508-4_9] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
191
|
Napierała K, Stefanowski J, Wilk S. Learning from Imbalanced Data in Presence of Noisy and Borderline Examples. ROUGH SETS AND CURRENT TRENDS IN COMPUTING 2010. [DOI: 10.1007/978-3-642-13529-3_18] [Citation(s) in RCA: 108] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
192
|
|
193
|
|
194
|
|
195
|
Tsang-Hsiang Cheng, Hu PH. A Data-Driven Approach to Manage the Length of Stay for Appendectomy Patients. ACTA ACUST UNITED AC 2009. [DOI: 10.1109/tsmca.2009.2025510] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
196
|
Drown D, Khoshgoftaar T, Seliya N. Evolutionary Sampling and Software Quality Modeling of High-Assurance Systems. ACTA ACUST UNITED AC 2009. [DOI: 10.1109/tsmca.2009.2020804] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
197
|
Haibo He, Garcia E. Learning from Imbalanced Data. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2009; 21:1263-1284. [PMID: 0 DOI: 10.1109/tkde.2008.239] [Citation(s) in RCA: 2078] [Impact Index Per Article: 129.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
|
198
|
A joint investigation of misclassification treatments and imbalanced datasets on neural network performance. Neural Comput Appl 2009. [DOI: 10.1007/s00521-009-0239-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
199
|
Chen MC, Chen LS, Hsu CC, Zeng WR. An information granulation based data mining approach for classifying imbalanced data. Inf Sci (N Y) 2008. [DOI: 10.1016/j.ins.2008.03.018] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
200
|
Orriols-Puig A, Bernadó-Mansilla E. Evolutionary rule-based systems for imbalanced data sets. Soft comput 2008. [DOI: 10.1007/s00500-008-0319-7] [Citation(s) in RCA: 75] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|