2001
|
Giraldo-Forero AF, Jaramillo-Garzón JA, Ruiz-Muñoz JF, Castellanos-Domínguez CG. Managing Imbalanced Data Sets in Multi-label Problems: A Case Study with the SMOTE Algorithm. PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS 2013. [DOI: 10.1007/978-3-642-41822-8_42] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
2002
|
Abstract
Structure-activity relationship (SAR) and quantitative structure-activity relationship (QSAR) models are increasingly used in toxicology, ecotoxicology, and pharmacology for predicting the activity of the molecules from their physicochemical properties and/or their structural characteristics. However, the design of such models has many traps for unwary practitioners. Consequently, the purpose of this chapter is to give a practical guide for the computation of SAR and QSAR models, point out problems that may be encountered, and suggest ways of solving them. Attempts are also made to see how these models can be validated and interpreted.
Collapse
|
2003
|
|
2004
|
Matsuta Y, Ito M, Tohsato Y. ECOH: an enzyme commission number predictor using mutual information and a support vector machine. ACTA ACUST UNITED AC 2012; 29:365-72. [PMID: 23220570 DOI: 10.1093/bioinformatics/bts700] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION The enzyme nomenclature system, commonly known as the enzyme commission (EC) number, plays a key role in classifying and predicting enzymatic reactions. However, numerous reactions have been described in various pathways that do not have an official EC number, and the reactions are not expected to have an EC number assigned because of a lack of articles published on enzyme assays. To predict the EC number of a non-classified enzymatic reaction, we focus on the structural similarity of its substrate and product to the substrate and product of reactions that have been classified. RESULTS We propose a new method to assign EC numbers using a maximum common substructure algorithm, mutual information and a support vector machine, termed the Enzyme COmmission numbers Handler (ECOH). A jack-knife test shows that the sensitivity, precision and accuracy of the method in predicting the first three digits of the official EC number (i.e. the EC sub-subclass) are 86.1%, 87.4% and 99.8%, respectively. We furthermore demonstrate that, by examining the ranking in the candidate lists of EC sub-subclasses generated by the algorithm, the method can successfully predict the classification of 85 enzymatic reactions that fall into multiple EC sub-subclasses. The better performance of the ECOH as compared with existing methods and its flexibility in predicting EC numbers make it useful for predicting enzyme function. AVAILABILITY ECOH is freely available via the Internet at http://www.bioinfo.sk.ritsumei.ac.jp/apps/ecoh/. This program only works on 32-bit Windows.
Collapse
Affiliation(s)
- Yoshihiko Matsuta
- Department of Bioinformatics, College of Life Sciences, Ritsumeikan University, Shiga, Kusatsu 525-8577, Japan
| | | | | |
Collapse
|
2005
|
|
2006
|
The quest for the optimal class distribution: an approach for enhancing the effectiveness of learning via resampling methods for imbalanced data sets. PROGRESS IN ARTIFICIAL INTELLIGENCE 2012. [DOI: 10.1007/s13748-012-0034-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
2007
|
Liang G, Zhu X, Zhang C. The effect of varying levels of class distribution on bagging for different algorithms: An empirical study. INT J MACH LEARN CYB 2012. [DOI: 10.1007/s13042-012-0125-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
2008
|
Mensen A, Khatami R. Advanced EEG analysis using threshold-free cluster-enhancement and non-parametric statistics. Neuroimage 2012; 67:111-8. [PMID: 23123297 DOI: 10.1016/j.neuroimage.2012.10.027] [Citation(s) in RCA: 106] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2012] [Revised: 10/08/2012] [Accepted: 10/18/2012] [Indexed: 01/16/2023] Open
Abstract
Advances in EEG signal analysis and its combination with other investigative techniques make appropriate statistical analysis of large EEG datasets a crucial issue. With an increasing number of available channels and samples, as well as more exploratory experimental designs, it has become necessary to develop a statistical process with a high level of statistical integrity, signal sensitivity which nonetheless produces results which are interpretable to the common user. Threshold-free cluster-enhancement has recently been proposed as a useful analysis tool for fMRI datasets. This approach essentially takes into account both a data point's statistical intensity and neighbourhood to transform the original signal into a more intuitive understanding of 'real' differences between groups or conditions. Here we adapt this approach to optimally deal with EEG datasets and use permutation-based statistics to build an efficient statistical analysis. Furthermore we compare the results with several other non-parametric and parametric approaches currently available using realistic simulated EEG signals. The proposed method is shown to be generally more sensitive to the variety of signal types common to EEG datasets without the need for any arbitrary adjusting of parameters. Moreover, a unique p-value is produced for each channel-sample pair such that specific questions can still be asked of the dataset while providing general information regarding the large-scale experimental effects.
Collapse
Affiliation(s)
- Armand Mensen
- University of Zürich, Raemistrasse 71, CH-8006, Zurich, Switzerland; Clinic Barmelweid, CH-5017, Barmelweid, Switzerland.
| | | |
Collapse
|
2009
|
Sun Z, Song Q, Zhu X. Using Coding-Based Ensemble Learning to Improve Software Defect Prediction. ACTA ACUST UNITED AC 2012. [DOI: 10.1109/tsmcc.2012.2226152] [Citation(s) in RCA: 122] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
2010
|
DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets. DATA KNOWL ENG 2012. [DOI: 10.1016/j.datak.2012.08.001] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
2011
|
Liu N, Lin Z, Cao J, Koh Z, Zhang T, Huang GB, Ser W, Ong MEH. An Intelligent Scoring System and Its Application to Cardiac Arrest Prediction. ACTA ACUST UNITED AC 2012; 16:1324-31. [DOI: 10.1109/titb.2012.2212448] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
2012
|
Menardi G, Torelli N. Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 2012. [DOI: 10.1007/s10618-012-0295-5] [Citation(s) in RCA: 162] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
2013
|
Boosting for class-imbalanced datasets using genetically evolved supervised non-linear projections. PROGRESS IN ARTIFICIAL INTELLIGENCE 2012. [DOI: 10.1007/s13748-012-0028-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
2014
|
Surrounding neighborhood-based SMOTE for learning from imbalanced data sets. PROGRESS IN ARTIFICIAL INTELLIGENCE 2012. [DOI: 10.1007/s13748-012-0027-5] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
2015
|
Body G, Weladji RB, Holand Ø. The recursive model as a new approach to validate and monitor activity sensors. Behav Ecol Sociobiol 2012. [DOI: 10.1007/s00265-012-1414-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
2016
|
|
2017
|
Cao Y, He H, Man H. SOMKE: kernel density estimation over data streams by sequences of self-organizing maps. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2012; 23:1254-1268. [PMID: 24807522 DOI: 10.1109/tnnls.2012.2201167] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
In this paper, we propose a novel method SOMKE, for kernel density estimation (KDE) over data streams based on sequences of self-organizing map (SOM). In many stream data mining applications, the traditional KDE methods are infeasible because of the high computational cost, processing time, and memory requirement. To reduce the time and space complexity, we propose a SOM structure in this paper to obtain well-defined data clusters to estimate the underlying probability distributions of incoming data streams. The main idea of this paper is to build a series of SOMs over the data streams via two operations, that is, creating and merging the SOM sequences. The creation phase produces the SOM sequence entries for windows of the data, which obtains clustering information of the incoming data streams. The size of the SOM sequences can be further reduced by combining the consecutive entries in the sequence based on the measure of Kullback-Leibler divergence. Finally, the probability density functions over arbitrary time periods along the data streams can be estimated using such SOM sequences. We compare SOMKE with two other KDE methods for data streams, the M-kernel approach and the cluster kernel approach, in terms of accuracy and processing time for various stationary data streams. Furthermore, we also investigate the use of SOMKE over nonstationary (evolving) data streams, including a synthetic nonstationary data stream, a real-world financial data stream and a group of network traffic data streams. The simulation results illustrate the effectiveness and efficiency of the proposed approach.
Collapse
|
2018
|
Shuo Wang, Xin Yao. Multiclass Imbalance Problems: Analysis and Potential Solutions. ACTA ACUST UNITED AC 2012; 42:1119-30. [DOI: 10.1109/tsmcb.2012.2187280] [Citation(s) in RCA: 319] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
2019
|
Derrac J, Verbiest N, García S, Cornelis C, Herrera F. On the use of evolutionary feature selection for improving fuzzy rough set based prototype selection. Soft comput 2012. [DOI: 10.1007/s00500-012-0888-3] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
2020
|
Online cost-sensitive neural network classifiers for non-stationary and imbalanced data streams. Neural Comput Appl 2012. [DOI: 10.1007/s00521-012-1071-6] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
2021
|
Netzer M, Kugler KG, Müller LAJ, Weinberger KM, Graber A, Baumgartner C, Dehmer M. A network-based feature selection approach to identify metabolic signatures in disease. J Theor Biol 2012; 310:216-22. [PMID: 22771628 DOI: 10.1016/j.jtbi.2012.06.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2011] [Revised: 04/16/2012] [Accepted: 06/03/2012] [Indexed: 12/17/2022]
Abstract
The identification and interpretation of metabolic biomarkers is a challenging task. In this context, network-based approaches have become increasingly a key technology in systems biology allowing to capture complex interactions in biological systems. In this work, we introduce a novel network-based method to identify highly predictive biomarker candidates for disease. First, we infer two different types of networks: (i) correlation networks, and (ii) a new type of network called ratio networks. Based on these networks, we introduce scores to prioritize features using topological descriptors of the vertices. To evaluate our method we use an example dataset where quantitative targeted MS/MS analysis was applied to a total of 52 blood samples from 22 persons with obesity (BMI >30) and 30 healthy controls. Using our network-based feature selection approach we identified highly discriminating metabolites for obesity (F-score >0.85, accuracy >85%), some of which could be verified by the literature.
Collapse
Affiliation(s)
- Michael Netzer
- Research Group for Clinical Bioinformatics, Institute of Electrical and Biomedical Engineering, University for Health Sciences, Medical Informatics and Technology, 6060 Hall in Tyrol, Austria.
| | | | | | | | | | | | | |
Collapse
|
2022
|
He H, Cao Y. SSC: a classifier combination method based on signal strength. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2012; 23:1100-1117. [PMID: 24807136 DOI: 10.1109/tnnls.2012.2198227] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
We propose a new classifier combination method, the signal strength-based combining (SSC) approach, to combine the outputs of multiple classifiers to support the decision-making process in classification tasks. As ensemble learning methods have attracted growing attention from both academia and industry recently, it is critical to understand the fundamental issues of the combining rule. Motivated by the signal strength concept, our proposed SSC algorithm can effectively integrate the individual vote from different classifiers in an ensemble learning system. Comparative studies of our method with nine major existing combining rules, namely, geometric average rule, arithmetic average rule, median value rule, majority voting rule, Borda count, max and min rule, weighted average, and weighted majority voting rules, is presented. Furthermore, we also discuss the relationship of the proposed method with respect to margin-based classifiers, including the boosting method (AdaBoost.M1 and AdaBoost.M2) and support vector machines by margin analysis. Detailed analyses of margin distribution graphs are presented to discuss the characteristics of the proposed method. Simulation results for various real-world datasets illustrate the effectiveness of the proposed method.
Collapse
|
2023
|
Yang J, Liu Y, Zhu X, Liu Z, Zhang X. A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inf Process Manag 2012. [DOI: 10.1016/j.ipm.2011.12.005] [Citation(s) in RCA: 88] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
2024
|
|
2025
|
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. ACTA ACUST UNITED AC 2012. [DOI: 10.1109/tsmcc.2011.2161285] [Citation(s) in RCA: 1533] [Impact Index Per Article: 117.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
2026
|
Performance evaluation of multilayer perceptrons for discriminating and quantifying multiple kinds of odors with an electronic nose. Neural Netw 2012; 33:204-15. [PMID: 22717447 DOI: 10.1016/j.neunet.2012.05.009] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2012] [Revised: 05/14/2012] [Accepted: 05/23/2012] [Indexed: 11/21/2022]
Abstract
This paper studies several types and arrangements of perceptron modules to discriminate and quantify multiple odors with an electronic nose. We evaluate the following types of multilayer perceptron. (A) A single multi-output (SMO) perceptron both for discrimination and for quantification. (B) An SMO perceptron for discrimination followed by multiple multi-output (MMO) perceptrons for quantification. (C) An SMO perceptron for discrimination followed by multiple single-output (MSO) perceptrons for quantification. (D) MSO perceptrons for discrimination followed by MSO perceptrons for quantification, called the MSO-MSO perceptron model, under the following conditions: (D1) using a simple one-against-all (OAA) decomposition method; (D2) adopting a simple OAA decomposition method and virtual balance step; and (D3) employing a local OAA decomposition method, virtual balance step and local generalization strategy all together. The experimental results for 12 kinds of volatile organic compounds at 85 concentration levels in the training set and 155 concentration levels in the test set show that the MSO-MSO perceptron model with the D3 learning procedure is the most effective of those tested for discrimination and quantification of many kinds of odors.
Collapse
|
2027
|
|
2028
|
Zhang YN, Yu DJ, Li SS, Fan YX, Huang Y, Shen HB. Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features. BMC Bioinformatics 2012; 13:118. [PMID: 22651691 PMCID: PMC3424114 DOI: 10.1186/1471-2105-13-118] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2011] [Accepted: 05/31/2012] [Indexed: 12/23/2022] Open
Abstract
Background Adenosine-5′-triphosphate (ATP) is one of multifunctional nucleotides and plays an important role in cell biology as a coenzyme interacting with proteins. Revealing the binding sites between protein and ATP is significantly important to understand the functionality of the proteins and the mechanisms of protein-ATP complex. Results In this paper, we propose a novel framework for predicting the proteins’ functional residues, through which they can bind with ATP molecules. The new prediction protocol is achieved by combination of sequence evolutional information and bi-profile sampling of multi-view sequential features and the sequence derived structural features. The hypothesis for this strategy is single-view feature can only represent partial target’s knowledge and multiple sources of descriptors can be complementary. Conclusions Prediction performances evaluated by both 5-fold and leave-one-out jackknife cross-validation tests on two benchmark datasets consisting of 168 and 227 non-homologous ATP binding proteins respectively demonstrate the efficacy of the proposed protocol. Our experimental results also reveal that the residue structural characteristics of real protein-ATP binding sites are significant different from those normal ones, for example the binding residues do not show high solvent accessibility propensities, and the bindings prefer to occur at the conjoint points between different secondary structure segments. Furthermore, results also show that performance is affected by the imbalanced training datasets by testing multiple ratios between positive and negative samples in the experiments. Increasing the dataset scale is also demonstrated useful for improving the prediction performances.
Collapse
Affiliation(s)
- Ya-Nan Zhang
- Department of Automation, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | | | | | | | | | | |
Collapse
|
2029
|
Klement W, Wilk S, Michalowski W, Farion KJ, Osmond MH, Verter V. Predicting the need for CT imaging in children with minor head injury using an ensemble of Naive Bayes classifiers. Artif Intell Med 2012; 54:163-70. [DOI: 10.1016/j.artmed.2011.11.005] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2011] [Revised: 10/18/2011] [Accepted: 11/24/2011] [Indexed: 10/14/2022]
|
2030
|
García V, Sánchez J, Mollineda R. On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl Based Syst 2012. [DOI: 10.1016/j.knosys.2011.06.013] [Citation(s) in RCA: 113] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
2031
|
|
2032
|
Linguistic Fuzzy Rules in Data Mining: Follow-Up Mamdani Fuzzy Modeling Principle. COMBINING EXPERIMENTATION AND THEORY 2012. [DOI: 10.1007/978-3-642-24666-1_8] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
2033
|
García-López S, Jaramillo-Garzón JA, Castellanos-Domínguez G. Improving the prediction of sub-cellular locations of proteins with a particle swarm optimization-based boosting strategy. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2012; 2012:6313-6316. [PMID: 23367372 DOI: 10.1109/embc.2012.6347437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Learning from imbalanced data sets presents an important challenge to the machine learning community. Traditional classification methods, seeking to minimize the overall error rate of the whole training set, do not perform well on imbalanced data since they assume a relatively balanced class distribution and put too much strength on the majority class. This is a common scenario when predicting sub-cellular locations of proteins since proteins belonging to certain specific locations are naturally more abundant or have been more extensively studied. In this work, a new method to learn from imbalanced data, called SwarmBoost, is proposed in order to reduce overlapping and noise of imbalanced datasets and improve prediction performances. The method combines oversampling, subsampling based on particle swarm optimization and ensemble methods. Our results show that SwarmBoost equals and in several cases outperforms other common boosting algorithms like DataBoost-Im and AdaBoost, constituting a useful tool for improving sub-cellular location predictions.
Collapse
Affiliation(s)
- Sebastián García-López
- Grupo de Control y Procesamiento Digital de Señales, Universidad Nacional de Colombia, sede Manizales, Km 7. Vía al Magdalena, Manizales, Colombia.
| | | | | |
Collapse
|
2034
|
|
2035
|
Triguero I, Derrac J, Garcia S, Herrera F. A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification. ACTA ACUST UNITED AC 2012. [DOI: 10.1109/tsmcc.2010.2103939] [Citation(s) in RCA: 186] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
2036
|
Identification of Different Types of Minority Class Examples in Imbalanced Data. LECTURE NOTES IN COMPUTER SCIENCE 2012. [DOI: 10.1007/978-3-642-28931-6_14] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
|
2037
|
Palacios A, Sánchez L, Couso I. Equalizing imbalanced imprecise datasets for genetic fuzzy classifiers. INT J COMPUT INT SYS 2012. [DOI: 10.1080/18756891.2012.685292] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022] Open
|
2038
|
|
2039
|
SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl Inf Syst 2011. [DOI: 10.1007/s10115-011-0465-6] [Citation(s) in RCA: 141] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
|
2040
|
Hospedales TM, Li J, Gong S, Xiang T. Identifying Rare and Subtle Behaviors: A Weakly Supervised Joint Topic Model. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2011; 33:2451-2464. [PMID: 21519099 DOI: 10.1109/tpami.2011.81] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
One of the most interesting and desired capabilities for automated video behavior analysis is the identification of rarely occurring and subtle behaviors. This is of practical value because dangerous or illegal activities often have few or possibly only one prior example to learn from and are often subtle. Rare and subtle behavior learning is challenging for two reasons: (1) Contemporary modeling approaches require more data and supervision than may be available and (2) the most interesting and potentially critical rare behaviors are often visually subtle-occurring among more obvious typical behaviors or being defined by only small spatio-temporal deviations from typical behaviors. In this paper, we introduce a novel weakly supervised joint topic model which addresses these issues. Specifically, we introduce a multiclass topic model with partially shared latent structure and associated learning and inference algorithms. These contributions will permit modeling of behaviors from as few as one example, even without localization by the user and when occurring in clutter, and subsequent classification and localization of such behaviors online and in real time. We extensively validate our approach on two standard public-space data sets, where it clearly outperforms a batch of contemporary alternatives.
Collapse
|
2041
|
Haibo He, Sheng Chen, Kang Li, Xin Xu. Incremental Learning From Stream Data. ACTA ACUST UNITED AC 2011; 22:1901-14. [DOI: 10.1109/tnn.2011.2171713] [Citation(s) in RCA: 92] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
2042
|
Javadi M, Ebrahimpour R, Sajedin A, Faridi S, Zakernejad S. Improving ECG classification accuracy using an ensemble of neural network modules. PLoS One 2011; 6:e24386. [PMID: 22046232 PMCID: PMC3202523 DOI: 10.1371/journal.pone.0024386] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2011] [Accepted: 08/08/2011] [Indexed: 11/23/2022] Open
Abstract
This paper illustrates the use of a combined neural network model based on Stacked Generalization method for classification of electrocardiogram (ECG) beats. In conventional Stacked Generalization method, the combiner learns to map the base classifiers' outputs to the target data. We claim adding the input pattern to the base classifiers' outputs helps the combiner to obtain knowledge about the input space and as the result, performs better on the same task. Experimental results support our claim that the additional knowledge according to the input space, improves the performance of the proposed method which is called Modified Stacked Generalization. In particular, for classification of 14966 ECG beats that were not previously seen during training phase, the Modified Stacked Generalization method reduced the error rate for 12.41% in comparison with the best of ten popular classifier fusion methods including Max, Min, Average, Product, Majority Voting, Borda Count, Decision Templates, Weighted Averaging based on Particle Swarm Optimization and Stacked Generalization.
Collapse
Affiliation(s)
- Mehrdad Javadi
- Islamic Azad University, South Tehran Branch, Tehran, Iran.
| | | | | | | | | |
Collapse
|
2043
|
Gao M, Hong X, Chen S, Harris CJ. A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems. Neurocomputing 2011. [DOI: 10.1016/j.neucom.2011.06.010] [Citation(s) in RCA: 89] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
2044
|
Segura-Bedmar I, Martínez P, de Pablo-Sánchez C. Using a shallow linguistic kernel for drug–drug interaction extraction. J Biomed Inform 2011; 44:789-804. [DOI: 10.1016/j.jbi.2011.04.005] [Citation(s) in RCA: 89] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2010] [Revised: 04/14/2011] [Accepted: 04/19/2011] [Indexed: 11/26/2022]
|
2045
|
Automatic defect detection for TFT-LCD array process using quasiconformal kernel support vector data description. Int J Mol Sci 2011; 12:5762-81. [PMID: 22016625 PMCID: PMC3189749 DOI: 10.3390/ijms12095762] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2011] [Revised: 08/08/2011] [Accepted: 08/16/2011] [Indexed: 11/16/2022] Open
Abstract
Defect detection has been considered an efficient way to increase the yield rate of panels in thin film transistor liquid crystal display (TFT-LCD) manufacturing. In this study we focus on the array process since it is the first and key process in TFT-LCD manufacturing. Various defects occur in the array process, and some of them could cause great damage to the LCD panels. Thus, how to design a method that can robustly detect defects from the images captured from the surface of LCD panels has become crucial. Previously, support vector data description (SVDD) has been successfully applied to LCD defect detection. However, its generalization performance is limited. In this paper, we propose a novel one-class machine learning method, called quasiconformal kernel SVDD (QK-SVDD) to address this issue. The QK-SVDD can significantly improve generalization performance of the traditional SVDD by introducing the quasiconformal transformation into a predefined kernel. Experimental results, carried out on real LCD images provided by an LCD manufacturer in Taiwan, indicate that the proposed QK-SVDD not only obtains a high defect detection rate of 96%, but also greatly improves generalization performance of SVDD. The improvement has shown to be over 30%. In addition, results also show that the QK-SVDD defect detector is able to accomplish the task of defect detection on an LCD image within 60 ms.
Collapse
|
2046
|
Sontrop HMJ, Verhaegh WFJ, Reinders MJT, Moerland PD. An evaluation protocol for subtype-specific breast cancer event prediction. PLoS One 2011; 6:e21681. [PMID: 21760900 PMCID: PMC3132736 DOI: 10.1371/journal.pone.0021681] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2011] [Accepted: 06/05/2011] [Indexed: 12/31/2022] Open
Abstract
In recent years increasing evidence appeared that breast cancer may not constitute a single disease at the molecular level, but comprises a heterogeneous set of subtypes. This suggests that instead of building a single monolithic predictor, better predictors might be constructed that solely target samples of a designated subtype, which are believed to represent more homogeneous sets of samples. An unavoidable drawback of developing subtype-specific predictors, however, is that a stratification by subtype drastically reduces the number of samples available for their construction. As numerous studies have indicated sample size to be an important factor in predictor construction, it is therefore questionable whether the potential benefit of subtyping can outweigh the drawback of a severe loss in sample size. Factors like unequal class distributions and differences in the number of samples per subtype, further complicate comparisons. We present a novel experimental protocol that facilitates a comprehensive comparison between subtype-specific predictors and predictors that do not take subtype information into account. Emphasis lies on careful control of sample size as well as class and subtype distributions. The methodology is applied to a large breast cancer compendium involving over 1500 arrays, using a state-of-the-art subtyping scheme. We show that the resulting subtype-specific predictors outperform those that do not take subtype information into account, especially when taking sample size considerations into account.
Collapse
Affiliation(s)
| | - Wim F. J. Verhaegh
- Molecular Diagnostics Department, Philips Research, Eindhoven, The Netherlands
| | - Marcel J. T. Reinders
- Delft Bioinformatics Lab, Delft University of Technology, Delft, The Netherlands
- Netherlands Bioinformatics Centre, Nijmegen, The Netherlands
| | - Perry D. Moerland
- Bioinformatics Laboratory, Department of Clinical Epidemiology, Biostatistics, and Bioinformatics, Academic Medical Center, Amsterdam, The Netherlands
- Netherlands Bioinformatics Centre, Nijmegen, The Netherlands
- * E-mail:
| |
Collapse
|
2047
|
Fu SC, Imai K, Horton P. Prediction of leucine-rich nuclear export signal containing proteins with NESsential. Nucleic Acids Res 2011; 39:e111. [PMID: 21705415 PMCID: PMC3167595 DOI: 10.1093/nar/gkr493] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
The classical nuclear export signal (NES), also known as the leucine-rich NES, is a protein localization signal often involved in important processes such as signal transduction and cell cycle regulation. Although 15 years has passed since its discovery, limited structural information and high sequence diversity have hampered understanding of the NES. Several consensus sequences have been proposed to describe it, but they suffer from poor predictive power. On the other hand, the NetNES server provides the only computational method currently available. Although these two methods have been widely used to attempt to find the correct NES position within potential NES-containing proteins, their performance has not yet been evaluated on the basic task of identifying NES-containing proteins. We propose a new predictor, NESsential, which uses sequence derived meta-features, such as predicted disorder and solvent accessibility, in addition to primary sequence. We demonstrate that it can identify promising NES-containing candidate proteins (albeit at low coverage), but other methods cannot. We also quantitatively demonstrate that predicted disorder is a useful feature for prediction and investigate the different features of (predicted) ordered versus disordered NES's. Finally, we list 70 recently discovered NES-containing proteins, doubling the number available to the community.
Collapse
Affiliation(s)
- Szu-Chin Fu
- Department of Computational Biology, Graduate School of Frontier Science, University of Tokyo, Kashiwa 277-8561, Japan
| | | | | |
Collapse
|
2048
|
Weisman D, Liu H, Redfern J, Zhu L, Colón-Carmona A. Novel computational identification of highly selective biomarkers of pollutant exposure. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2011; 45:5132-5138. [PMID: 21542576 DOI: 10.1021/es200065f] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
The use of in vivo biosensors to acquire environmental pollution data is an emerging and promising paradigm. One major challenge is the identification of highly specific biomarkers that selectively report exposure to a target pollutant, while remaining quiescent under a diverse set of other, often unknown, environmental conditions. This study hypothesized that a microarray data mining approach can identify highly specific biomarkers, and, that the robustness property can generalize to unforeseen environmental conditions. Starting with Arabidopsis thaliana microarray data measuring responses to a variety of treatments, the study used the top scoring pair (TSP) algorithm to identify mRNA transcripts that respond uniquely to phenanthrene, a model polycyclic aromatic hydrocarbon. Subsequent in silico analysis with a larger set of microarray data indicated that the biomarkers remained robust under new conditions. Finally, in vivo experiments were performed with unforeseen conditions that mimic phenanthrene stress, and the biomarkers were assayed using qRT-PCR. In these experiments, the biomarkers always responded positively to phenanthrene, and never responded to the unforeseen conditions, thereby supporting the hypotheses. This data mining approach requires only microarray or next-generation RNA-seq data, and, in principle, can be applied to arbitrary biomonitoring organisms and chemical exposures.
Collapse
Affiliation(s)
- David Weisman
- Department of Biology, University of Massachusetts Boston, Boston, Massachusetts 02125, USA
| | | | | | | | | |
Collapse
|
2049
|
|
2050
|
|