1901
|
|
1902
|
Xia SX, Meng FR, Liu B, Zhou Y. A Kernel Clustering-Based Possibilistic Fuzzy Extreme Learning Machine for Class Imbalance Learning. Cognit Comput 2014. [DOI: 10.1007/s12559-014-9256-1] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
1903
|
Zieba M. Service-oriented medical system for supporting decisions with missing and imbalanced data. IEEE J Biomed Health Inform 2014; 18:1533-40. [PMID: 24816614 DOI: 10.1109/jbhi.2014.2322281] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
In this paper, we propose a service-oriented support decision system (SOSDS) for diagnostic problems that is insensitive to the problems of the imbalanced data and missing values of the attributes, which are widely observed in the medical domain. The system is composed of distributed Web services, which implement machine-learning solutions dedicated to constructing the decision models directly from the datasets impaired by the high percentage of missing values of the attributes and imbalanced class distribution. The issue of the imbalanced data is solved by the application of a cost-sensitive support vector machine and the problem of missing values of attributes is handled by proposing the novel ensemble-based approach that splits the incomplete data space into complete subspaces that are further used to construct base learners. We evaluate the quality of the SOSDS components using three ontological datasets.
Collapse
|
1904
|
Abdi L, Hashemi S. To combat multi-class imbalanced problems by means of over-sampling and boosting techniques. Soft comput 2014. [DOI: 10.1007/s00500-014-1291-z] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
1905
|
Sweeney EM, Vogelstein JT, Cuzzocreo JL, Calabresi PA, Reich DS, Crainiceanu CM, Shinohara RT. A comparison of supervised machine learning algorithms and feature vectors for MS lesion segmentation using multimodal structural MRI. PLoS One 2014; 9:e95753. [PMID: 24781953 PMCID: PMC4004572 DOI: 10.1371/journal.pone.0095753] [Citation(s) in RCA: 33] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2014] [Accepted: 03/28/2014] [Indexed: 11/18/2022] Open
Abstract
Machine learning is a popular method for mining and analyzing large collections of medical data. We focus on a particular problem from medical research, supervised multiple sclerosis (MS) lesion segmentation in structural magnetic resonance imaging (MRI). We examine the extent to which the choice of machine learning or classification algorithm and feature extraction function impacts the performance of lesion segmentation methods. As quantitative measures derived from structural MRI are important clinical tools for research into the pathophysiology and natural history of MS, the development of automated lesion segmentation methods is an active research field. Yet, little is known about what drives performance of these methods. We evaluate the performance of automated MS lesion segmentation methods, which consist of a supervised classification algorithm composed with a feature extraction function. These feature extraction functions act on the observed T1-weighted (T1-w), T2-weighted (T2-w) and fluid-attenuated inversion recovery (FLAIR) MRI voxel intensities. Each MRI study has a manual lesion segmentation that we use to train and validate the supervised classification algorithms. Our main finding is that the differences in predictive performance are due more to differences in the feature vectors, rather than the machine learning or classification algorithms. Features that incorporate information from neighboring voxels in the brain were found to increase performance substantially. For lesion segmentation, we conclude that it is better to use simple, interpretable, and fast algorithms, such as logistic regression, linear discriminant analysis, and quadratic discriminant analysis, and to develop the features to improve performance.
Collapse
Affiliation(s)
- Elizabeth M. Sweeney
- Department of Biostatistics, The Johns Hopkins University, Baltimore, Maryland, United States of America
- Translational Neuroradiology Unit, Neuroimmunology Branch, National Institute of Neurological Disease and Stroke, National Institute of Health, Bethesda, Maryland, United States of America
- * E-mail:
| | - Joshua T. Vogelstein
- Department of Statistical Science, Duke University, Durham, North Carolina, United States of America
- Center for the Developing Brain, Child Mind Institute, New York, New York, United States of America
| | - Jennifer L. Cuzzocreo
- Department of Radiology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, United States of America
| | - Peter A. Calabresi
- Department of Neurology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, United States of America
| | - Daniel S. Reich
- Department of Biostatistics, The Johns Hopkins University, Baltimore, Maryland, United States of America
- Translational Neuroradiology Unit, Neuroimmunology Branch, National Institute of Neurological Disease and Stroke, National Institute of Health, Bethesda, Maryland, United States of America
- Department of Radiology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, United States of America
- Department of Neurology, The Johns Hopkins University School of Medicine, Baltimore, Maryland, United States of America
| | - Ciprian M. Crainiceanu
- Department of Biostatistics, The Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Russell T. Shinohara
- Department of Biostatistics and Epidemiology, Center for Clinical Epidemiology and Biostatistics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
1906
|
An improved systematic approach to predicting transcription factor target genes using support vector machine. PLoS One 2014; 9:e94519. [PMID: 24743548 PMCID: PMC3990533 DOI: 10.1371/journal.pone.0094519] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2012] [Accepted: 03/17/2014] [Indexed: 11/21/2022] Open
Abstract
Biological prediction of transcription factor binding sites and their corresponding transcription factor target genes (TFTGs) makes great contribution to understanding the gene regulatory networks. However, these approaches are based on laborious and time-consuming biological experiments. Numerous computational approaches have shown great potential to circumvent laborious biological methods. However, the majority of these algorithms provide limited performances and fail to consider the structural property of the datasets. We proposed a refined systematic computational approach for predicting TFTGs. Based on previous work done on identifying auxin response factor target genes from Arabidopsis thaliana co-expression data, we adopted a novel reverse-complementary distance-sensitive n-gram profile algorithm. This algorithm converts each upstream sub-sequence into a high-dimensional vector data point and transforms the prediction task into a classification problem using support vector machine-based classifier. Our approach showed significant improvement compared to other computational methods based on the area under curve value of the receiver operating characteristic curve using 10-fold cross validation. In addition, in the light of the highly skewed structure of the dataset, we also evaluated other metrics and their associated curves, such as precision-recall curves and cost curves, which provided highly satisfactory results.
Collapse
|
1907
|
Window size impact in human activity recognition. SENSORS 2014; 14:6474-99. [PMID: 24721766 PMCID: PMC4029702 DOI: 10.3390/s140406474] [Citation(s) in RCA: 171] [Impact Index Per Article: 15.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/10/2013] [Revised: 03/19/2014] [Accepted: 03/26/2014] [Indexed: 11/17/2022]
Abstract
Signal segmentation is a crucial stage in the activity recognition process; however, this has been rarely and vaguely characterized so far. Windowing approaches are normally used for segmentation, but no clear consensus exists on which window size should be preferably employed. In fact, most designs normally rely on figures used in previous works, but with no strict studies that support them. Intuitively, decreasing the window size allows for a faster activity detection, as well as reduced resources and energy needs. On the contrary, large data windows are normally considered for the recognition of complex activities. In this work, we present an extensive study to fairly characterize the windowing procedure, to determine its impact within the activity recognition process and to help clarify some of the habitual assumptions made during the recognition system design. To that end, some of the most widely used activity recognition procedures are evaluated for a wide range of window sizes and activities. From the evaluation, the interval 1-2 s proves to provide the best trade-off between recognition speed and accuracy. The study, specifically intended for on-body activity recognition systems, further provides designers with a set of guidelines devised to facilitate the system definition and configuration according to the particular application requirements and target activities.
Collapse
|
1908
|
Huang A, Martin ER, Vance JM, Cai X. Detecting genetic interactions in pathway-based genome-wide association studies. Genet Epidemiol 2014; 38:300-9. [PMID: 24719383 DOI: 10.1002/gepi.21803] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2013] [Revised: 01/06/2014] [Accepted: 02/28/2014] [Indexed: 12/13/2022]
Abstract
Pathway-based genome-wide association studies (GWAS) can exploit collective effects of causal variants in a pathway to increase power of detection. However, current methods for pathway-based GWAS do not consider epistatic effects of genetic variants, although interactions between genetic variants may play an important role in influencing complex traits. In this paper, we employed a Bayesian Lasso logistic regression model for pathway-based GWAS to include all possible main effects and a large number of pairwise interactions of single nucleotide polymorphisms (SNPs) in a pathway, and then inferred the model with an efficient group empirical Bayesian Lasso (EBLasso) method. Using the inferred model, the statistical significance of a pathway was tested with the Wald statistics. Reliable effects in a significant pathway were selected using the stability selection technique. Extensive computer simulations demonstrated that our group EBlasso method significantly outperformed two competitive methods in most simulation setups and offered similar performance in other simulation setups. When applying to a GWAS dataset for Parkinson disease, EBLasso identified three significant pathways including the primary bile acid biosynthesis pathway, the neuroactive ligand-receptor interaction, and the MAPK signaling pathway. All effects identified in the primary bile acid biosynthesis pathway and many of effects in the other two pathways were epistatic effects. The group EBLasso method provides a valuable tool for pathway-based GWAS to identify main and epistatic effects of genetic variants.
Collapse
Affiliation(s)
- Anhui Huang
- Department of Electrical and Computer Engineering, University of Miami, Coral Gables, Florida, United States of America
| | | | | | | |
Collapse
|
1909
|
Kumar NS, Rao KN, Govardhan A, Reddy KS, Mahmood AM. Undersampled $$K$$ K -means approach for handling imbalanced distributed data. PROGRESS IN ARTIFICIAL INTELLIGENCE 2014. [DOI: 10.1007/s13748-014-0045-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
1910
|
Zhang Y, Fu P, Liu W, Chen G. Imbalanced data classification based on scaling kernel-based support vector machine. Neural Comput Appl 2014. [DOI: 10.1007/s00521-014-1584-2] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
1911
|
Cao P, Yang J, Li W, Zhao D, Zaiane O. Ensemble-based hybrid probabilistic sampling for imbalanced data learning in lung nodule CAD. Comput Med Imaging Graph 2014; 38:137-50. [DOI: 10.1016/j.compmedimag.2013.12.003] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2013] [Revised: 10/19/2013] [Accepted: 12/02/2013] [Indexed: 01/15/2023]
|
1912
|
Galar M, Fernández A, Barrenechea E, Herrera F. Empowering difficult classes with a similarity-based aggregation in multi-class classification problems. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2013.12.053] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
1913
|
Gonzalez-Abril L, Nuñez H, Angulo C, Velasco F. GSVM: An SVM for handling imbalanced accuracy between classes inbi-classification problems. Appl Soft Comput 2014. [DOI: 10.1016/j.asoc.2013.12.013] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
1914
|
Frick A, Tanneberger F, Bellebaum J. Model-based selection of areas for the restoration of Acrocephalus paludicola habitats in NE Germany. ENVIRONMENTAL MANAGEMENT 2014; 53:728-738. [PMID: 24446053 DOI: 10.1007/s00267-014-0234-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/07/2013] [Accepted: 01/08/2014] [Indexed: 06/03/2023]
Abstract
The global Aquatic Warbler (Acrocephalus paludicola, Vieillot, 1817) population has suffered a major decline due to the large-scale destruction of its natural habitat (fen mires). The species is at risk of extinction, especially in NE Germany/NW Poland. In this study, we developed habitat suitability models based on satellite and environmental data to identify potential areas for habitat restoration on which further surveys and planning should be focused. To create a reliable model, we used all Aquatic Warbler presences in the study area since 1990 as well as additional potentially suitable habitats identified in the field. We combined the presence/absence regression tree algorithm Cubist with the presence-only algorithm Maxent since both commonly outperform other algorithms. To integrate the separate model results, we present a new way to create a metamodel using the initial model results as variables. Additionally, a histogram approach was applied to further reduce the final search area to the most promising sites. Accuracy increased when using both remote sensing and environmental data. It was highest for the integrated metamodel (Cohen's Kappa of 0.4, P < 0.001). The final result of this study supports the selection of the most promising sites for Aquatic Warbler habitat restoration.
Collapse
Affiliation(s)
- Annett Frick
- LUP, Große Weinmeisterstraße 3a, 14469, Potsdam, Germany,
| | | | | |
Collapse
|
1915
|
Bendell CJ, Liu S, Aumentado-Armstrong T, Istrate B, Cernek PT, Khan S, Picioreanu S, Zhao M, Murgita RA. Transient protein-protein interface prediction: datasets, features, algorithms, and the RAD-T predictor. BMC Bioinformatics 2014; 15:82. [PMID: 24661439 PMCID: PMC4021185 DOI: 10.1186/1471-2105-15-82] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2013] [Accepted: 02/14/2014] [Indexed: 11/14/2022] Open
Abstract
Background Transient protein-protein interactions (PPIs), which underly most biological processes, are a prime target for therapeutic development. Immense progress has been made towards computational prediction of PPIs using methods such as protein docking and sequence analysis. However, docking generally requires high resolution structures of both of the binding partners and sequence analysis requires that a significant number of recurrent patterns exist for the identification of a potential binding site. Researchers have turned to machine learning to overcome some of the other methods’ restrictions by generalising interface sites with sets of descriptive features. Best practices for dataset generation, features, and learning algorithms have not yet been identified or agreed upon, and an analysis of the overall efficacy of machine learning based PPI predictors is due, in order to highlight potential areas for improvement. Results The presence of unknown interaction sites as a result of limited knowledge about protein interactions in the testing set dramatically reduces prediction accuracy. Greater accuracy in labelling the data by enforcing higher interface site rates per domain resulted in an average 44% improvement across multiple machine learning algorithms. A set of 10 biologically unrelated proteins that were consistently predicted on with high accuracy emerged through our analysis. We identify seven features with the most predictive power over multiple datasets and machine learning algorithms. Through our analysis, we created a new predictor, RAD-T, that outperforms existing non-structurally specializing machine learning protein interface predictors, with an average 59% increase in MCC score on a dataset with a high number of interactions. Conclusion Current methods of evaluating machine-learning based PPI predictors tend to undervalue their performance, which may be artificially decreased by the presence of un-identified interaction sites. Changes to predictors’ training sets will be integral to the future progress of interface prediction by machine learning methods. We reveal the need for a larger test set of well studied proteins or domain-specific scoring algorithms to compensate for poor interaction site identification on proteins in general.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Robert A Murgita
- Department of Microbiology and Immunology, McGill, Montreal, CA.
| |
Collapse
|
1916
|
Xu J, Yang G, Yin Y, Man H, He H. Sparse-Representation-Based Classification with Structure-Preserving Dimension Reduction. Cognit Comput 2014. [DOI: 10.1007/s12559-014-9252-5] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
1917
|
Speaker state classification based on fusion of asymmetric simple partial least squares (SIMPLS) and support vector machines. COMPUT SPEECH LANG 2014. [DOI: 10.1016/j.csl.2013.06.002] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
1918
|
Yang P, Yoo PD, Fernando J, Zhou BB, Zhang Z, Zomaya AY. Sample Subset Optimization Techniques for Imbalanced and Ensemble Learning Problems in Bioinformatics Applications. IEEE TRANSACTIONS ON CYBERNETICS 2014; 44:445-55. [PMID: 24108722 DOI: 10.1109/tcyb.2013.2257480] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Data sampling is a widely used technique in a broad range of machine learning problems. Traditional sampling approaches generally rely on random resampling from a given dataset. However, these approaches do not take into consideration additional information, such as sample quality and usefulness. We recently proposed a data sampling technique, called sample subset optimization (SSO). The SSO technique relies on a cross-validation procedure for identifying and selecting the most useful samples as subsets. In this paper, we describe the application of SSO techniques to imbalanced and ensemble learning problems, respectively. For imbalanced learning, the SSO technique is employed as an under-sampling technique for identifying a subset of highly discriminative samples in the majority class. In ensemble learning, the SSO technique is utilized as a generic ensemble technique where multiple optimized subsets of samples from each class are selected for building an ensemble classifier. We demonstrate the utilities and advantages of the proposed techniques on a variety of bioinformatics applications where class imbalance, small sample size, and noisy data are prevalent.
Collapse
|
1919
|
|
1920
|
García V, Mollineda RA, Sánchez JS. A bias correction function for classification performance assessment in two-class imbalanced problems. Knowl Based Syst 2014. [DOI: 10.1016/j.knosys.2014.01.021] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
1921
|
Predicting pupylation sites in prokaryotic proteins using pseudo-amino acid composition and extreme learning machine. Neurocomputing 2014. [DOI: 10.1016/j.neucom.2012.11.058] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
|
1922
|
|
1923
|
Song T, Gu H. Discriminative motif discovery via simulated evolution and random under-sampling. PLoS One 2014; 9:e87670. [PMID: 24551063 PMCID: PMC3923751 DOI: 10.1371/journal.pone.0087670] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2013] [Accepted: 12/29/2013] [Indexed: 11/22/2022] Open
Abstract
Conserved motifs in biological sequences are closely related to their structure and functions. Recently, discriminative motif discovery methods have attracted more and more attention. However, little attention has been devoted to the data imbalance problem, which is one of the main reasons affecting the performance of the discriminative models. In this article, a simulated evolution method is applied to solve the multi-class imbalance problem at the stage of data preprocessing, and at the stage of Hidden Markov Models (HMMs) training, a random under-sampling method is introduced for the imbalance between the positive and negative datasets. It is shown that, in the task of discovering targeting motifs of nine subcellular compartments, the motifs found by our method are more conserved than the methods without considering data imbalance problem and recover the most known targeting motifs from Minimotif Miner and InterPro. Meanwhile, we use the found motifs to predict protein subcellular localization and achieve higher prediction precision and recall for the minority classes.
Collapse
Affiliation(s)
- Tao Song
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Hong Gu
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
- * E-mail:
| |
Collapse
|
1924
|
Dubey R, Zhou J, Wang Y, Thompson PM, Ye J. Analysis of sampling techniques for imbalanced data: An n = 648 ADNI study. Neuroimage 2014; 87:220-41. [PMID: 24176869 PMCID: PMC3946903 DOI: 10.1016/j.neuroimage.2013.10.005] [Citation(s) in RCA: 84] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2013] [Revised: 09/10/2013] [Accepted: 10/07/2013] [Indexed: 02/07/2023] Open
Abstract
Many neuroimaging applications deal with imbalanced imaging data. For example, in Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, the mild cognitive impairment (MCI) cases eligible for the study are nearly two times the Alzheimer's disease (AD) patients for structural magnetic resonance imaging (MRI) modality and six times the control cases for proteomics modality. Constructing an accurate classifier from imbalanced data is a challenging task. Traditional classifiers that aim to maximize the overall prediction accuracy tend to classify all data into the majority class. In this paper, we study an ensemble system of feature selection and data sampling for the class imbalance problem. We systematically analyze various sampling techniques by examining the efficacy of different rates and types of undersampling, oversampling, and a combination of over and undersampling approaches. We thoroughly examine six widely used feature selection algorithms to identify significant biomarkers and thereby reduce the complexity of the data. The efficacy of the ensemble techniques is evaluated using two different classifiers including Random Forest and Support Vector Machines based on classification accuracy, area under the receiver operating characteristic curve (AUC), sensitivity, and specificity measures. Our extensive experimental results show that for various problem settings in ADNI, (1) a balanced training set obtained with K-Medoids technique based undersampling gives the best overall performance among different data sampling techniques and no sampling approach; and (2) sparse logistic regression with stability selection achieves competitive performance among various feature selection algorithms. Comprehensive experiments with various settings show that our proposed ensemble model of multiple undersampled datasets yields stable and promising results.
Collapse
Affiliation(s)
- Rashmi Dubey
- School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, USA; Center for Evolutionary Medicine and Informatics, The Biodesign Institute, Arizona State University, Tempe, AZ, USA
| | - Jiayu Zhou
- School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, USA; Center for Evolutionary Medicine and Informatics, The Biodesign Institute, Arizona State University, Tempe, AZ, USA
| | - Yalin Wang
- School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, USA
| | - Paul M Thompson
- Imaging Genetics Center, Laboratory of Neuro Imaging, UCLA School of Medicine, Los Angeles, CA, USA
| | - Jieping Ye
- School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, USA; Center for Evolutionary Medicine and Informatics, The Biodesign Institute, Arizona State University, Tempe, AZ, USA.
| |
Collapse
|
1925
|
Donner-Banzhoff N, Haasenritter J, Hüllermeier E, Viniol A, Bösner S, Becker A. Response to van den Bruel and Perera: the comprehensive diagnostic study: a new solution to old problems? J Clin Epidemiol 2014; 67:135-6. [DOI: 10.1016/j.jclinepi.2013.09.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2013] [Accepted: 09/13/2013] [Indexed: 10/26/2022]
|
1926
|
López V, Fernández A, Herrera F. On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2013.09.038] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
1927
|
Hu BG. What are the differences between Bayesian classifiers and mutual-information classifiers? IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2014; 25:249-264. [PMID: 24807026 DOI: 10.1109/tnnls.2013.2274799] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
In this paper, both Bayesian and mutual-information classifiers are examined for binary classifications with or without a reject option. The general decision rules are derived for Bayesian classifiers with distinctions on error types and reject types. A formal analysis is conducted to reveal the parameter redundancy of cost terms when abstaining classifications are enforced. The redundancy implies an intrinsic problem of nonconsistency for interpreting cost terms. If no data are given to the cost terms, we demonstrate the weakness of Bayesian classifiers in class-imbalanced classifications. On the contrary, mutual-information classifiers are able to provide an objective solution from the given data, which shows a reasonable balance among error types and reject types. Numerical examples of using two types of classifiers are given for confirming the differences, including the extremely class-imbalanced cases. Finally, we briefly summarize the Bayesian and mutual-information classifiers in terms of their application advantages and disadvantages, respectively.
Collapse
|
1928
|
López V, Triguero I, Carmona CJ, García S, Herrera F. Addressing imbalanced classification with instance generation techniques: IPADE-ID. Neurocomputing 2014. [DOI: 10.1016/j.neucom.2013.01.050] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
1929
|
Lee BJ, Kim JY. A comparison of the predictive power of anthropometric indices for hypertension and hypotension risk. PLoS One 2014; 9:e84897. [PMID: 24465449 PMCID: PMC3900406 DOI: 10.1371/journal.pone.0084897] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2013] [Accepted: 11/19/2013] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND AND AIMS It is commonly accepted that body fat distribution is associated with hypertension, but the strongest anthropometric indicator of the risk of hypertension is still controversial. Furthermore, no studies on the association of hypotension with anthropometric indices have been reported. The objectives of the present study were to determine the best predictors of hypertension and hypotension among various anthropometric indices and to assess the use of combined indices as a method of improving the predictive power in adult Korean women and men. METHODS For 12789 subjects 21-85 years of age, we assessed 41 anthropometric indices using statistical analyses and data mining techniques to determine their ability to discriminate between hypertension and normotension as well as between hypotension and normotension. We evaluated the predictive power of combined indices using two machine learning algorithms and two variable subset selection techniques. RESULTS The best indicator for predicting hypertension was rib circumference in both women (p = <0.0001; OR = 1.813; AUC = 0.669) and men (p = <0.0001; OR = 1.601; AUC = 0.627); for hypotension, the strongest predictor was chest circumference in women (p = <0.0001; OR = 0.541; AUC = 0.657) and neck circumference in men (p = <0.0001; OR = 0.522; AUC = 0.672). In experiments using combined indices, the areas under the receiver operating characteristic curves (AUC) for the prediction of hypertension risk in women and men were 0.721 and 0.652, respectively, according to the logistic regression with wrapper-based variable selection; for hypotension, the corresponding values were 0.675 in women and 0.737 in men, according to the naïve Bayes with wrapper-based variable selection. CONCLUSIONS The best indicators of the risk of hypertension and the risk of hypotension may differ. The use of combined indices seems to slightly improve the predictive power for both hypertension and hypotension.
Collapse
Affiliation(s)
- Bum Ju Lee
- Medical Research Division, Korea Institute of Oriental Medicine, Yuseong-gu, Deajeon, Republic of Korea
| | - Jong Yeol Kim
- Medical Research Division, Korea Institute of Oriental Medicine, Yuseong-gu, Deajeon, Republic of Korea
| |
Collapse
|
1930
|
Fukasawa Y, Leung RKK, Tsui SKW, Horton P. Plus ça change - evolutionary sequence divergence predicts protein subcellular localization signals. BMC Genomics 2014; 15:46. [PMID: 24438075 PMCID: PMC3906766 DOI: 10.1186/1471-2164-15-46] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2013] [Accepted: 01/06/2014] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Protein subcellular localization is a central problem in understanding cell biology and has been the focus of intense research. In order to predict localization from amino acid sequence a myriad of features have been tried: including amino acid composition, sequence similarity, the presence of certain motifs or domains, and many others. Surprisingly, sequence conservation of sorting motifs has not yet been employed, despite its extensive use for tasks such as the prediction of transcription factor binding sites. RESULTS Here, we flip the problem around, and present a proof of concept for the idea that the lack of sequence conservation can be a novel feature for localization prediction. We show that for yeast, mammal and plant datasets, evolutionary sequence divergence alone has significant power to identify sequences with N-terminal sorting sequences. Moreover sequence divergence is nearly as effective when computed on automatically defined ortholog sets as on hand curated ones. Unfortunately, sequence divergence did not necessarily increase classification performance when combined with some traditional sequence features such as amino acid composition. However a post-hoc analysis of the proteins in which sequence divergence changes the prediction yielded some proteins with atypical (i.e. not MPP-cleaved) matrix targeting signals as well as a few misannotations. CONCLUSION We report the results of the first quantitative study of the effectiveness of evolutionary sequence divergence as a feature for protein subcellular localization prediction. We show that divergence is indeed useful for prediction, but it is not trivial to improve overall accuracy simply by adding this feature to classical sequence features. Nevertheless we argue that sequence divergence is a promising feature and show anecdotal examples in which it succeeds where other features fail.
Collapse
Affiliation(s)
- Yoshinori Fukasawa
- Department of Computational Biology, Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Japan
- Japan Society for the Promotion of Science, Tokyo Chiyoda, Japan
| | - Ross KK Leung
- Hong Kong Bioinformatics Centre and School of Biomedical Sciences, Chinese University of Hong Kong, Shatin, China
| | - Stephen KW Tsui
- Hong Kong Bioinformatics Centre and School of Biomedical Sciences, Chinese University of Hong Kong, Shatin, China
| | - Paul Horton
- Department of Computational Biology, Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Japan
- Computational Biology Research Center, Advanced Industrial Science and Technology, Tokyo, Japan
| |
Collapse
|
1931
|
Hao M, Wang Y, Bryant SH. An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Anal Chim Acta 2014; 806:117-27. [PMID: 24331047 PMCID: PMC3884825 DOI: 10.1016/j.aca.2013.10.050] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2013] [Revised: 10/25/2013] [Accepted: 10/28/2013] [Indexed: 01/28/2023]
Abstract
It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets from PubChem BioAssay. By applying the proposed combinatorial method, those data of rare samples (active compounds), for which usually poor results are generated, can be detected apparently with high balanced accuracy (Gmean). As a comparison with GLMBoost, Random Forest (RF) combined with SMOTE is also adopted to classify the same datasets. Our results show that the former (GLMBoost+SMOTE) not only exhibits higher performance as measured by the percentage of correct classification for the rare samples (Sensitivity) and Gmean, but also demonstrates greater computational efficiency than the latter (RF+SMOTE). Therefore, we hope that the proposed combinatorial algorithm based on GLMBoost and SMOTE could be extensively used to tackle the imbalanced classification problem.
Collapse
Affiliation(s)
- Ming Hao
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Yanli Wang
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | - Stephen H Bryant
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| |
Collapse
|
1932
|
|
1933
|
Kocyigit Y, Seker H. Hybrid imbalanced data classifier models for computational discovery of antibiotic drug targets. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2014; 2014:812-815. [PMID: 25570083 DOI: 10.1109/embc.2014.6943715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Identification of drug candidates is an important but also difficult process. Given drug resistance bacteria that we face, this process has become more important to identify protein candidates that demonstrate antibacterial activity. The aim of this study is therefore to develop a bioinformatics approach that is more capable of identifying a small but effective set of proteins that are expected to show antibacterial activity, subsequently to be used as antibiotic drug targets. As this is regarded as an imbalanced data classification problem due to smaller number of antibiotic drugs available, a hybrid classification model was developed and applied to the identification of antibiotic drugs. The model was developed by taking into account of various statistical models leading to the development of six different hybrid models. The best model has reached the accuracy of as high as 50% compared to earlier study with the accuracy of less than 1% as far as the proportion of the candidates identified and actual antibiotics in the candidate list is concerned.
Collapse
|
1934
|
Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Appl Soft Comput 2014. [DOI: 10.1016/j.asoc.2013.07.016] [Citation(s) in RCA: 89] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
1935
|
Kundu K, Costa F, Backofen R. A graph kernel approach for alignment-free domain-peptide interaction prediction with an application to human SH3 domains. Bioinformatics 2013; 29:i335-43. [PMID: 23813002 PMCID: PMC3694653 DOI: 10.1093/bioinformatics/btt220] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
MOTIVATION State-of-the-art experimental data for determining binding specificities of peptide recognition modules (PRMs) is obtained by high-throughput approaches like peptide arrays. Most prediction tools applicable to this kind of data are based on an initial multiple alignment of the peptide ligands. Building an initial alignment can be error-prone, especially in the case of the proline-rich peptides bound by the SH3 domains. RESULTS Here, we present a machine-learning approach based on an efficient graph-kernel technique to predict the specificity of a large set of 70 human SH3 domains, which are an important class of PRMs. The graph-kernel strategy allows us to (i) integrate several types of physico-chemical information for each amino acid, (ii) consider high-order correlations between these features and (iii) eliminate the need for an initial peptide alignment. We build specialized models for each human SH3 domain and achieve competitive predictive performance of 0.73 area under precision-recall curve, compared with 0.27 area under precision-recall curve for state-of-the-art methods based on position weight matrices. We show that better models can be obtained when we use information on the noninteracting peptides (negative examples), which is currently not used by the state-of-the art approaches based on position weight matrices. To this end, we analyze two strategies to identify subsets of high confidence negative data. The techniques introduced here are more general and hence can also be used for any other protein domains, which interact with short peptides (i.e. other PRMs). AVAILABILITY The program with the predictive models can be found at http://www.bioinf.uni-freiburg.de/Software/SH3PepInt/SH3PepInt.tar.gz. We also provide a genome-wide prediction for all 70 human SH3 domains, which can be found under http://www.bioinf.uni-freiburg.de/Software/SH3PepInt/Genome-Wide-Predictions.tar.gz. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kousik Kundu
- Bioinformatics Group, Department of Computer Science, Georges-Köhler-Allee 106, 79110 Freiburg, Germany
| | | | | |
Collapse
|
1936
|
Rios A, Kavuluru R. Supervised Extraction of Diagnosis Codes from EMRs: Role of Feature Selection, Data Selection, and Probabilistic Thresholding. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2013; 2013:66-73. [PMID: 28748228 DOI: 10.1109/ichi.2013.15] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Extracting diagnosis codes from medical records is a complex task carried out by trained coders by reading all the documents associated with a patient's visit. With the popularity of electronic medical records (EMRs), computational approaches to code extraction have been proposed in the recent years. Machine learning approaches to multi-label text classification provide an important methodology in this task given each EMR can be associated with multiple codes. In this paper, we study the the role of feature selection, training data selection, and probabilistic threshold optimization in improving different multi-label classification approaches. We conduct experiments based on two different datasets: a recent gold standard dataset used for this task and a second larger and more complex EMR dataset we curated from the University of Kentucky Medical Center. While conventional approaches achieve results comparable to the state-of-the-art on the gold standard dataset, on our complex in-house dataset, we show that feature selection, training data selection, and probabilistic thresholding provide significant gains in performance.
Collapse
Affiliation(s)
- Anthony Rios
- Department of Computer Science, University of Kentucky, Lexington, KY
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Dept. of Biostatistics, Department of Computer Science, University of Kentucky, Lexington, KY
| |
Collapse
|
1937
|
Zhang J, Cao P, Gross DP, Zaiane OR. On the application of multi-class classification in physical therapy recommendation. Health Inf Sci Syst 2013. [DOI: 10.1186/2047-2501-1-15] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Abstract
Recommending optimal rehabilitation intervention for injured workers that would lead to successful return-to-work (RTW) is a challenge for clinicians. Currently, the clinicians are unable to identify with complete confidence which intervention is best for a patient and the referral is often made in trial and error fashion. Only 58% recommendations are successful in our dataset. We aim to develop an interpretable decision support system using machine learning to assist the clinicians. We proposed an alternate ripper (ARIPPER) combined with a hybrid re-sampling technique, and a balanced weighted random forests (BWRF) ensemble method respectively, in order to tackle the multi-class imbalance, class overlap and noise problem in real world application data. The final models have shown promising potential in classification compared to human baseline and has been integrated into a web-based decision-support tool that requires additional validation in a clinical sample.
Collapse
|
1938
|
Sun Q, Muckatira S, Yuan L, Ji S, Newfeld S, Kumar S, Ye J. Image-level and group-level models for Drosophila gene expression pattern annotation. BMC Bioinformatics 2013; 14:350. [PMID: 24299119 PMCID: PMC3924186 DOI: 10.1186/1471-2105-14-350] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2013] [Accepted: 11/06/2013] [Indexed: 12/27/2022] Open
Abstract
Background Drosophila melanogaster has been established as a model organism for investigating the developmental gene interactions. The spatio-temporal gene expression patterns of Drosophila melanogaster can be visualized by in situ hybridization and documented as digital images. Automated and efficient tools for analyzing these expression images will provide biological insights into the gene functions, interactions, and networks. To facilitate pattern recognition and comparison, many web-based resources have been created to conduct comparative analysis based on the body part keywords and the associated images. With the fast accumulation of images from high-throughput techniques, manual inspection of images will impose a serious impediment on the pace of biological discovery. It is thus imperative to design an automated system for efficient image annotation and comparison. Results We present a computational framework to perform anatomical keywords annotation for Drosophila gene expression images. The spatial sparse coding approach is used to represent local patches of images in comparison with the well-known bag-of-words (BoW) method. Three pooling functions including max pooling, average pooling and Sqrt (square root of mean squared statistics) pooling are employed to transform the sparse codes to image features. Based on the constructed features, we develop both an image-level scheme and a group-level scheme to tackle the key challenges in annotating Drosophila gene expression pattern images automatically. To deal with the imbalanced data distribution inherent in image annotation tasks, the undersampling method is applied together with majority vote. Results on Drosophila embryonic expression pattern images verify the efficacy of our approach. Conclusion In our experiment, the three pooling functions perform comparably well in feature dimension reduction. The undersampling with majority vote is shown to be effective in tackling the problem of imbalanced data. Moreover, combining sparse coding and image-level scheme leads to consistent performance improvement in keywords annotation.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Jieping Ye
- Center for Evolutionary Medicine and Informatics, The Biodesign Institute, Arizona State University, Tempe, AZ, 85287, USA.
| |
Collapse
|
1939
|
Cano A, Zafra A, Ventura S. Weighted data gravitation classification for standard and imbalanced data. IEEE TRANSACTIONS ON CYBERNETICS 2013; 43:1672-1687. [PMID: 23757568 DOI: 10.1109/tsmcb.2012.2227470] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Gravitation is a fundamental interaction whose concept and effects applied to data classification become a novel data classification technique. The simple principle of data gravitation classification (DGC) is to classify data samples by comparing the gravitation between different classes. However, the calculation of gravitation is not a trivial problem due to the different relevance of data attributes for distance computation, the presence of noisy or irrelevant attributes, and the class imbalance problem. This paper presents a gravitation-based classification algorithm which improves previous gravitation models and overcomes some of their issues. The proposed algorithm, called DGC+, employs a matrix of weights to describe the importance of each attribute in the classification of each class, which is used to weight the distance between data samples. It improves the classification performance by considering both global and local data information, especially in decision boundaries. The proposal is evaluated and compared to other well-known instance-based classification techniques, on 35 standard and 44 imbalanced data sets. The results obtained from these experiments show the great performance of the proposed gravitation model, and they are validated using several nonparametric statistical tests.
Collapse
|
1940
|
Ghazikhani A, Monsefi R, Sadoghi Yazdi H. Ensemble of online neural networks for non-stationary and imbalanced data streams. Neurocomputing 2013. [DOI: 10.1016/j.neucom.2013.05.003] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
1941
|
|
1942
|
Monastyrskyy B, Kryshtafovych A, Moult J, Tramontano A, Fidelis K. Assessment of protein disorder region predictions in CASP10. Proteins 2013; 82 Suppl 2:127-37. [PMID: 23946100 DOI: 10.1002/prot.24391] [Citation(s) in RCA: 124] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2013] [Revised: 06/14/2013] [Accepted: 06/18/2013] [Indexed: 12/12/2022]
Abstract
The article presents the assessment of disorder region predictions submitted to CASP10. The evaluation is based on the three measures tested in previous CASPs: (i) balanced accuracy, (ii) the Matthews correlation coefficient for the binary predictions, and (iii) the area under the curve in the receiver operating characteristic (ROC) analysis of predictions using probability annotation. We also performed new analyses such as comparison of the submitted predictions with those obtained with a Naïve disorder prediction method and with predictions from the disorder prediction databases D2P2 and MobiDB. On average, the methods participating in CASP10 demonstrated slightly better performance than those in CASP9.
Collapse
|
1943
|
Zhang B, Yang C, Zhu H, Li Y, Gui W. Kinetic Modeling and Parameter Estimation for Competing Reactions in Copper Removal Process from Zinc Sulfate Solution. Ind Eng Chem Res 2013. [DOI: 10.1021/ie401619h] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Affiliation(s)
- Bin Zhang
- School
of Information Science
and Engineering, Central South University, Changsha 410083, China
| | - Chunhua Yang
- School
of Information Science
and Engineering, Central South University, Changsha 410083, China
| | - Hongqiu Zhu
- School
of Information Science
and Engineering, Central South University, Changsha 410083, China
| | - Yonggang Li
- School
of Information Science
and Engineering, Central South University, Changsha 410083, China
| | - Weihua Gui
- School
of Information Science
and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
1944
|
Wang KJ, Makond B, Wang KM. An improved survivability prognosis of breast cancer by using sampling and feature selection technique to solve imbalanced patient classification data. BMC Med Inform Decis Mak 2013; 13:124. [PMID: 24207108 PMCID: PMC3829096 DOI: 10.1186/1472-6947-13-124] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2013] [Accepted: 10/28/2013] [Indexed: 11/22/2022] Open
Abstract
Background Breast cancer is one of the most critical cancers and is a major cause of cancer death among women. It is essential to know the survivability of the patients in order to ease the decision making process regarding medical treatment and financial preparation. Recently, the breast cancer data sets have been imbalanced (i.e., the number of survival patients outnumbers the number of non-survival patients) whereas the standard classifiers are not applicable for the imbalanced data sets. The methods to improve survivability prognosis of breast cancer need for study. Methods Two well-known five-year prognosis models/classifiers [i.e., logistic regression (LR) and decision tree (DT)] are constructed by combining synthetic minority over-sampling technique (SMOTE) ,cost-sensitive classifier technique (CSC), under-sampling, bagging, and boosting. The feature selection method is used to select relevant variables, while the pruning technique is applied to obtain low information-burden models. These methods are applied on data obtained from the Surveillance, Epidemiology, and End Results database. The improvements of survivability prognosis of breast cancer are investigated based on the experimental results. Results Experimental results confirm that the DT and LR models combined with SMOTE, CSC, and under-sampling generate higher predictive performance consecutively than the original ones. Most of the time, DT and LR models combined with SMOTE and CSC use less informative burden/features when a feature selection method and a pruning technique are applied. Conclusions LR is found to have better statistical power than DT in predicting five-year survivability. CSC is superior to SMOTE, under-sampling, bagging, and boosting to improve the prognostic performance of DT and LR.
Collapse
Affiliation(s)
- Kung-Jeng Wang
- Department of Industrial Management, National Taiwan University of Science and Technology, Taipei 106, Taiwan.
| | | | | |
Collapse
|
1945
|
Lee S, Kang YM, Park H, Dong MS, Shin JM, No KT. Human Nephrotoxicity Prediction Models for Three Types of Kidney Injury Based on Data Sets of Pharmacological Compounds and Their Metabolites. Chem Res Toxicol 2013; 26:1652-9. [DOI: 10.1021/tx400249t] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Affiliation(s)
- Sehan Lee
- Bioinformatics and Molecular Design Research Center, Seoul 120-749, Korea
| | - Young-Mook Kang
- Department of Biotechnology, Yonsei University, Seoul 120-749, Korea
| | - Hyejin Park
- Bioinformatics and Molecular Design Research Center, Seoul 120-749, Korea
| | - Mi-Sook Dong
- School of Life Sciences and Biotechnology, Korea University, Seoul 136-701, Korea
| | - Jae-Min Shin
- Bioinformatics and Molecular Design Research Center, Seoul 120-749, Korea
| | - Kyoung Tai No
- Bioinformatics and Molecular Design Research Center, Seoul 120-749, Korea
- Department of Biotechnology, Yonsei University, Seoul 120-749, Korea
| |
Collapse
|
1946
|
López V, Fernández A, García S, Palade V, Herrera F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf Sci (N Y) 2013. [DOI: 10.1016/j.ins.2013.07.007] [Citation(s) in RCA: 932] [Impact Index Per Article: 77.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
1947
|
|
1948
|
|
1949
|
Abstract
Extreme learning machine (ELM) is an efficient and practical learning algorithm used for training single hidden layer feed-forward neural networks (SLFNs). ELM can provide good generalization performance at extremely fast learning speed. However, ELM suffers from instability and over-fitting, especially on relatively large datasets. Based on probabilistic SLFNs, an approach of fusion of extreme learning machine (F-ELM) with fuzzy integral is proposed in this paper. The proposed algorithm consists of three stages. Firstly, the bootstrap technique is employed to generate several subsets of original dataset. Secondly, probabilistic SLFNs are trained with ELM algorithm on each subset. Finally, the trained probabilistic SLFNs are fused with fuzzy integral. The experimental results show that the proposed approach can alleviate to some extent the problems mentioned above, and can increase the prediction accuracy.
Collapse
Affiliation(s)
- JUNHAI ZHAI
- College of Mathematics and Computer Science, Hebei University, Baoding, 071002, Hebei, China
- Laboratory of Machine Learning and Computational Intelligence, Hebei University, Baoding 071002, China
| | - HONGYU XU
- College of Mathematics and Computer Science, Hebei University, Baoding, 071002, Hebei, China
| | - YAN LI
- College of Mathematics and Computer Science, Hebei University, Baoding, 071002, Hebei, China
- Laboratory of Machine Learning and Computational Intelligence, Hebei University, Baoding 071002, China
| |
Collapse
|
1950
|
Long X, Fonseca P, Foussier J, Haakma R, Aarts RM. Sleep and wake classification with actigraphy and respiratory effort using dynamic warping. IEEE J Biomed Health Inform 2013; 18:1272-84. [PMID: 24108754 DOI: 10.1109/jbhi.2013.2284610] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
This paper proposes the use of dynamic warping (DW) methods for improving automatic sleep and wake classification using actigraphy and respiratory effort. DW is an algorithm that finds an optimal nonlinear alignment between two series allowing scaling and shifting. It is widely used to quantify (dis)similarity between two series. To compare the respiratory effort between sleep and wake states by means of (dis)similarity, we constructed two novel features based on DW. For a given epoch of a respiratory effort recording, the features search for the optimally aligned epoch within the same recording in time and frequency domain. This is expected to yield a high (or low) similarity score when this epoch is sleep (or wake). Since the comparison occurs throughout the entire-night recording of a subject, it may reduce the effects of within- and between-subject variations of the respiratory effort, and thus help discriminate between sleep and wake states. The DW-based features were evaluated using a linear discriminant classifier on a dataset of 15 healthy subjects. Results show that the DW-based features can provide a Cohen's Kappa coefficient of agreement κ = 0.59 which is significantly higher than the existing respiratory-based features and is comparable to actigraphy. After combining the actigraphy and the DW-based features, the classifier achieved a κ of 0.66 and an overall accuracy of 95.7%, outperforming an earlier actigraphy- and respiratory-based feature set ( κ = 0.62). The results are also comparable with those obtained using an actigraphy- and cardiorespiratory-based feature set but have the important advantage that they do not require an ECG signal to be recorded.
Collapse
|