501
|
González S, García S, Li ST, Herrera F. Chain based sampling for monotonic imbalanced classification. Inf Sci (N Y) 2019. [DOI: 10.1016/j.ins.2018.09.062] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
502
|
Raghuwanshi BS, Shukla S. Class imbalance learning using UnderBagging based kernelized extreme learning machine. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2018.10.056] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
503
|
Integration of feature vector selection and support vector machine for classification of imbalanced data. Appl Soft Comput 2019. [DOI: 10.1016/j.asoc.2018.11.045] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
504
|
|
505
|
A Robust Framework for Self-Care Problem Identification for Children with Disability. Symmetry (Basel) 2019. [DOI: 10.3390/sym11010089] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
Recently, a standard dataset namely SCADI (Self-Care Activities Dataset) based on the International Classification of Functioning, Disability, and Health for Children and Youth framework for self-care problems identification of children with physical and motor disabilities was introduced. This is a very interesting, important and challenging topic due to its usefulness in medical diagnosis. This study proposes a robust framework using a sampling technique and extreme gradient boosting (FSX) to improve the prediction performance for the SCADI dataset. The proposed framework first converts the original dataset to a new dataset with a smaller number of dimensions. Then, our proposed framework balances the new dataset in the previous step using oversampling techniques with different ratios. Next, extreme gradient boosting was used to diagnose the problems. The experiments in terms of prediction performance and feature importance were conducted to show the effectiveness of FSX as well as to analyse the results. The experimental results show that FSX that uses the Synthetic Minority Over-sampling Technique (SMOTE) for the oversampling module outperforms the ANN (Artificial Neural Network) -based approach, Support vector machine (SVM) and Random Forest for the SCADI dataset. The overall accuracy of the proposed framework reaches 85.4%, a pretty high performance, which can be used for self-care problem classification in medical diagnosis.
Collapse
|
506
|
SocialTERM-Extractor: Identifying and Predicting Social-Problem-Specific Key Noun Terms from a Large Number of Online News Articles Using Text Mining and Machine Learning Techniques. SUSTAINABILITY 2019. [DOI: 10.3390/su11010196] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
In the digital age, the abundant unstructured data on the Internet, particularly online news articles, provide opportunities for identifying social problems and understanding social systems for sustainability. However, the previous works have not paid attention to the social-problem-specific perspectives of such big data, and it is currently unclear how information technologies can use the big data to identify and manage the ongoing social problems. In this context, this paper introduces and focuses on social-problem-specific key noun terms, namely SocialTERMs, which can be used not only to search the Internet for social-problem-related data, but also to monitor the ongoing and future events of social problems. Moreover, to alleviate time-consuming human efforts in identifying the SocialTERMs, this paper designs and examines the SocialTERM-Extractor, which is an automatic approach for identifying the key noun terms of social-problem-related topics, namely SPRTs, in a large number of online news articles and predicting the SocialTERMs among the identified key noun terms. This paper has its novelty as the first trial to identify and predict the SocialTERMs from a large number of online news articles, and it contributes to literature by proposing three types of text-mining-based features, namely temporal weight, sentiment, and complex network structural features, and by comparing the performances of such features with various machine learning techniques including deep learning. Particularly, when applied to a large number of online news articles that had been published in South Korea over a 12-month period and mostly written in Korean, the experimental results showed that Boosting Decision Tree gave the best performances with the full feature sets. They showed that the SocialTERMs can be predicted with high performances by the proposed SocialTERM-Extractor. Eventually, this paper can be beneficial for individuals or organizations who want to explore and use social-problem-related data in a systematical manner for understanding and managing social problems even though they are unfamiliar with ongoing social problems.
Collapse
|
507
|
Zhang C, Tan KC, Li H, Hong GS. A Cost-Sensitive Deep Belief Network for Imbalanced Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:109-122. [PMID: 29993587 DOI: 10.1109/tnnls.2018.2832648] [Citation(s) in RCA: 60] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Imbalanced data with a skewed class distribution are common in many real-world applications. Deep Belief Network (DBN) is a machine learning technique that is effective in classification tasks. However, conventional DBN does not work well for imbalanced data classification because it assumes equal costs for each class. To deal with this problem, cost-sensitive approaches assign different misclassification costs for different classes without disrupting the true data sample distributions. However, due to lack of prior knowledge, the misclassification costs are usually unknown and hard to choose in practice. Moreover, it has not been well studied as to how cost-sensitive learning could improve DBN performance on imbalanced data problems. This paper proposes an evolutionary cost-sensitive deep belief network (ECS-DBN) for imbalanced classification. ECS-DBN uses adaptive differential evolution to optimize the misclassification costs based on the training data that presents an effective approach to incorporating the evaluation measure (i.e., G-mean) into the objective function. We first optimize the misclassification costs, and then apply them to DBN. Adaptive differential evolution optimization is implemented as the optimization algorithm that automatically updates its corresponding parameters without the need of prior domain knowledge. The experiments have shown that the proposed approach consistently outperforms the state of the art on both benchmark data sets and real-world data set for fault diagnosis in tool condition monitoring.
Collapse
|
508
|
Lapp L, Bouamrane MM, Kavanagh K, Roper M, Young D, Schraag S. Evaluation of Random Forest and Ensemble Methods at Predicting Complications Following Cardiac Surgery. Artif Intell Med 2019. [DOI: 10.1007/978-3-030-21642-9_48] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
|
509
|
Wang X, Li S, Tang T, Wang X, Xun J. Intelligent operation of heavy haul train with data imbalance: A machine learning method. Knowl Based Syst 2019. [DOI: 10.1016/j.knosys.2018.08.015] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
510
|
Abstract
The advent of DNA microarray datasets has stimulated a new line of research both in bioinformatics and in machine learning. This type of data is used to collect information from tissue and cell samples regarding gene expression differences that could be useful for disease diagnosis or for distinguishing specific types of tumor. Microarray data classification is a difficult challenge for machine learning researchers due to its high number of features and the small sample sizes. This chapter is devoted to reviewing the microarray databases most frequently used in the literature. We also make the interested reader aware of the problematic of data characteristics in this domain, such as the imbalance of the data, their complexity, and the so-called dataset shift.
Collapse
|
511
|
Affiliation(s)
- Hieu Pham
- Department of Industrial and Manufacturing Systems Engineering; Iowa State University; Ames Iowa
| | - Sigurdur Olafsson
- Department of Industrial and Manufacturing Systems Engineering; Iowa State University; Ames Iowa
| |
Collapse
|
512
|
Forbes JD, Chen CY, Knox NC, Marrie RA, El-Gabalawy H, de Kievit T, Alfa M, Bernstein CN, Van Domselaar G. A comparative study of the gut microbiota in immune-mediated inflammatory diseases-does a common dysbiosis exist? MICROBIOME 2018; 6:221. [PMID: 30545401 PMCID: PMC6292067 DOI: 10.1186/s40168-018-0603-4] [Citation(s) in RCA: 285] [Impact Index Per Article: 40.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/28/2018] [Accepted: 11/25/2018] [Indexed: 05/12/2023]
Abstract
BACKGROUND Immune-mediated inflammatory disease (IMID) represents a substantial health concern. It is widely recognized that IMID patients are at a higher risk for developing secondary inflammation-related conditions. While an ambiguous etiology is common to all IMIDs, in recent years, considerable knowledge has emerged regarding the plausible role of the gut microbiome in IMIDs. This study used 16S rRNA gene amplicon sequencing to compare the gut microbiota of patients with Crohn's disease (CD; N = 20), ulcerative colitis (UC; N = 19), multiple sclerosis (MS; N = 19), and rheumatoid arthritis (RA; N = 21) versus healthy controls (HC; N = 23). Biological replicates were collected from participants within a 2-month interval. This study aimed to identify common (or unique) taxonomic biomarkers of IMIDs using both differential abundance testing and a machine learning approach. RESULTS Significant microbial community differences between cohorts were observed (pseudo F = 4.56; p = 0.01). Richness and diversity were significantly different between cohorts (pFDR < 0.001) and were lowest in CD while highest in HC. Abundances of Actinomyces, Eggerthella, Clostridium III, Faecalicoccus, and Streptococcus (pFDR < 0.001) were significantly higher in all disease cohorts relative to HC, whereas significantly lower abundances were observed for Gemmiger, Lachnospira, and Sporobacter (pFDR < 0.001). Several taxa were found to be differentially abundant in IMIDs versus HC including significantly higher abundances of Intestinibacter in CD, Bifidobacterium in UC, and unclassified Erysipelotrichaceae in MS and significantly lower abundances of Coprococcus in CD, Dialister in MS, and Roseburia in RA. A machine learning approach to classify disease versus HC was highest for CD (AUC = 0.93 and AUC = 0.95 for OTU and genus features, respectively) followed by MS, RA, and UC. Gemmiger and Faecalicoccus were identified as important features for classification of subjects to CD and HC. In general, features identified by differential abundance testing were consistent with machine learning feature importance. CONCLUSIONS This study identified several gut microbial taxa with differential abundance patterns common to IMIDs. We also found differentially abundant taxa between IMIDs. These taxa may serve as biomarkers for the detection and diagnosis of IMIDs and suggest there may be a common component to IMID etiology.
Collapse
Affiliation(s)
- Jessica D. Forbes
- Department of Internal Medicine, University of Manitoba, Winnipeg, MB Canada
- University of Manitoba IBD Clinical and Research Centre, Winnipeg, MB Canada
- National Microbiology Laboratory, Public Health Agency of Canada, 1015 Arlington Street, Winnipeg, MB R3E 3R2 Canada
- Department of Medical Microbiology and Infectious Diseases, University of Manitoba, Winnipeg, MB Canada
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Canada
| | - Chih-yu Chen
- National Microbiology Laboratory, Public Health Agency of Canada, 1015 Arlington Street, Winnipeg, MB R3E 3R2 Canada
| | - Natalie C. Knox
- National Microbiology Laboratory, Public Health Agency of Canada, 1015 Arlington Street, Winnipeg, MB R3E 3R2 Canada
| | - Ruth-Ann Marrie
- Department of Internal Medicine, University of Manitoba, Winnipeg, MB Canada
- Department of Community Health Sciences, University of Manitoba, Winnipeg, MB Canada
| | - Hani El-Gabalawy
- Department of Internal Medicine, University of Manitoba, Winnipeg, MB Canada
- Arthritis Centre, University of Manitoba, Winnipeg, MB Canada
| | - Teresa de Kievit
- Department of Microbiology, University of Manitoba, Winnipeg, MB Canada
| | - Michelle Alfa
- Department of Medical Microbiology and Infectious Diseases, University of Manitoba, Winnipeg, MB Canada
| | - Charles N. Bernstein
- Department of Internal Medicine, University of Manitoba, Winnipeg, MB Canada
- University of Manitoba IBD Clinical and Research Centre, Winnipeg, MB Canada
| | - Gary Van Domselaar
- University of Manitoba IBD Clinical and Research Centre, Winnipeg, MB Canada
- National Microbiology Laboratory, Public Health Agency of Canada, 1015 Arlington Street, Winnipeg, MB R3E 3R2 Canada
- Department of Medical Microbiology and Infectious Diseases, University of Manitoba, Winnipeg, MB Canada
| |
Collapse
|
513
|
|
514
|
Embedding Undersampling Rotation Forest for Imbalanced Problem. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2018; 2018:6798042. [PMID: 30515200 PMCID: PMC6236578 DOI: 10.1155/2018/6798042] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/02/2018] [Revised: 09/16/2018] [Accepted: 09/26/2018] [Indexed: 11/17/2022]
Abstract
Rotation Forest is an ensemble learning approach achieving better performance comparing to Bagging and Boosting through building accurate and diverse classifiers using rotated feature space. However, like other conventional classifiers, Rotation Forest does not work well on the imbalanced data which are characterized as having much less examples of one class (minority class) than the other (majority class), and the cost of misclassifying minority class examples is often much more expensive than the contrary cases. This paper proposes a novel method called Embedding Undersampling Rotation Forest (EURF) to handle this problem (1) sampling subsets from the majority class and learning a projection matrix from each subset and (2) obtaining training sets by projecting re-undersampling subsets of the original data set to new spaces defined by the matrices and constructing an individual classifier from each training set. For the first method, undersampling is to force the rotation matrix to better capture the features of the minority class without harming the diversity between individual classifiers. With respect to the second method, the undersampling technique aims to improve the performance of individual classifiers on the minority class. The experimental results show that EURF achieves significantly better performance comparing to other state-of-the-art methods.
Collapse
|
515
|
Kumar S, Biswas SK, Devi D. TLUSBoost algorithm: a boosting solution for class imbalance problem. Soft comput 2018. [DOI: 10.1007/s00500-018-3629-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
516
|
Bayisa FL, Liu X, Garpebring A, Yu J. Statistical learning in computed tomography image estimation. Med Phys 2018; 45:5450-5460. [PMID: 30242845 DOI: 10.1002/mp.13204] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2018] [Revised: 08/08/2018] [Accepted: 09/06/2018] [Indexed: 01/25/2023] Open
Abstract
PURPOSE There is increasing interest in computed tomography (CT) image estimations from magnetic resonance (MR) images. The estimated CT images can be utilized for attenuation correction, patient positioning, and dose planning in diagnostic and radiotherapy workflows. This study aims to introduce a novel statistical learning approach for improving CT estimation from MR images and to compare the performance of our method with the existing model-based CT image estimation methods. METHODS The statistical learning approach proposed here consists of two stages. At the training stage, prior knowledge about tissue types from CT images was used together with a Gaussian mixture model (GMM) to explore CT image estimations from MR images. Since the prior knowledge is not available at the prediction stage, a classifier based on RUSBoost algorithm was trained to estimate the tissue types from MR images. For a new patient, the trained classifier and GMMs were used to predict CT image from MR images. The classifier and GMMs were validated by using voxel-level tenfold cross-validation and patient-level leave-one-out cross-validation, respectively. RESULTS The proposed approach has outperformance in CT estimation quality in comparison with the existing model-based methods, especially on bone tissues. Our method improved CT image estimation by 5% and 23% on the whole brain and bone tissues, respectively. CONCLUSIONS Evaluation of our method shows that it is a promising method to generate CT image substitutes for the implementation of fully MR-based radiotherapy and PET/MRI applications.
Collapse
Affiliation(s)
- Fekadu L Bayisa
- Department of Mathematics and Mathematical Statistics, Umeå University, Umeå, 901 87, Sweden
| | - Xijia Liu
- Department of Mathematics and Mathematical Statistics, Umeå University, Umeå, 901 87, Sweden
| | - Anders Garpebring
- Department of Radiation Sciences, Umeå University, Umeå, 901 87, Sweden
| | - Jun Yu
- Department of Mathematics and Mathematical Statistics, Umeå University, Umeå, 901 87, Sweden
| |
Collapse
|
517
|
Zhu M, Li Y, Wang Y. Design and experiment verification of a novel analysis framework for recognition of driver injury patterns: From a multi-class classification perspective. ACCIDENT; ANALYSIS AND PREVENTION 2018; 120:152-164. [PMID: 30138770 DOI: 10.1016/j.aap.2018.08.011] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/05/2018] [Revised: 06/22/2018] [Accepted: 08/11/2018] [Indexed: 06/08/2023]
Abstract
Detecting driver injury patterns is a typical classification problem. Crash data sets are highly skewed where fatalities and severe injuries are often less represented compared to other events. The severity prediction performance of the existing models is poor due to the highly imbalanced samples of different severity levels within a given dataset. This paper proposes a machine learning based analysis framework from a multi-class classification perspective for accurate recognition of the driver injury patterns. The proposed framework includes preprocessing, classification, evaluation and application of a given dataset. This framework is verified based on the three years single-vehicle ROR (run-off-road) crash records collected in Washington State from 2011 to 2013. At first, thirteen most important safety-related variables are recognized through random forests. Then, the four driver's injury severity levels viz., fatal/serious injury, evident injury, possible injury, and no injury are predicted by integrating the decomposed binary neural network models to achieve better performance. Finally, a sensitivity analysis is carried out to interpret variables' impacts on the decomposed injury severity levels. The study shows that lack of restraint, female drivers, truck usage, driver impairment, driver distraction, vehicle overturn (rollover), dawn/dusk, and overtaking are the leading factors contributing to the driver fatalities or severe injuries in a single-vehicle ROR crash. Most of the findings are consistent with the previous studies. The experimental results validate the effectiveness of the proposed framework which can be further applied for pattern recognition in traffic safety research.
Collapse
Affiliation(s)
- Mengtao Zhu
- School of Information and Electronics, Beijing Institute of Technology, Beijing, 100081, PR China
| | - Yunjie Li
- School of Information and Electronics, Beijing Institute of Technology, Beijing, 100081, PR China; Department of Civil and Environmental Engineering, University of Washington, Seattle, WA, 98195, USA.
| | - Yinhai Wang
- Department of Civil and Environmental Engineering, University of Washington, Seattle, WA, 98195, USA
| |
Collapse
|
518
|
Segura-Bedmar I, Colón-Ruíz C, Tejedor-Alonso MÁ, Moro-Moro M. Predicting of anaphylaxis in big data EMR by exploring machine learning approaches. J Biomed Inform 2018; 87:50-59. [DOI: 10.1016/j.jbi.2018.09.012] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2018] [Revised: 08/31/2018] [Accepted: 09/24/2018] [Indexed: 11/26/2022]
|
519
|
Quality flaw prediction in Spanish Wikipedia: A case of study with verifiability flaws. Inf Process Manag 2018. [DOI: 10.1016/j.ipm.2018.08.003] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
520
|
Abstract
Class-imbalance is very common in real world. However, conventional advanced methods do not work well on imbalanced data due to imbalanced class distribution. This paper proposes a simple but effective Hybrid-based Ensemble (HE) to deal with two-class imbalanced problem. HE learns a hybrid ensemble using the following two stages: (1) learning several projection matrixes from the rebalanced data obtained by under-sampling the original training set and constructing new training sets by projecting the original training set to different spaces defined by the matrixes, and (2) undersampling several subsets from each new training set and training a model on each subset. Here, feature projection aims to improve the diversity between ensemble members and under-sampling technique is to improve generalization ability of individual members on minority class. Experimental results show that, compared with other state-of-the-art methods, HE shows significantly better performance on measures of AUC, G-mean, F-measure and recall.
Collapse
Affiliation(s)
- Huaping Guo
- School of Computer and Information Technology, Xinyang Normal University, Xinyang, Henan 464000, P. R. China
| | - Jun Zhou
- School of Computer and Information Technology, Xinyang Normal University, Xinyang, Henan 464000, P. R. China
| | - Chang-an Wu
- School of Computer and Information Technology, Xinyang Normal University, Xinyang, Henan 464000, P. R. China
| | - Wei She
- School of Software Technology, Zhengzhou University, Zhengzhou, Henan 450000, P. R. China
| |
Collapse
|
521
|
Mousavi R, Eftekhari M, Rahdari F. Omni-Ensemble Learning (OEL): Utilizing Over-Bagging, Static and Dynamic Ensemble Selection Approaches for Software Defect Prediction. INT J ARTIF INTELL T 2018. [DOI: 10.1142/s0218213018500240] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Machine learning methods in software engineering are becoming increasingly important as they can improve quality and testing efficiency by constructing models to predict defects in software modules. The existing datasets for software defect prediction suffer from an imbalance of class distribution which makes the learning problem in such a task harder. In this paper, we propose a novel approach by integrating Over-Bagging, static and dynamic ensemble selection strategies. The proposed method utilizes most of ensemble learning approaches called Omni-Ensemble Learning (OEL). This approach exploits a new Over-Bagging method for class imbalance learning in which the effect of three different methods of assigning weight to training samples is investigated. The proposed method first specifies the best classifiers along with their combiner for all test samples through Genetic Algorithm as the static ensemble selection approach. Then, a subset of the selected classifiers is chosen for each test sample as the dynamic ensemble selection. Our experiments confirm that the proposed OEL can provide better overall performance (in terms of G-mean, balance, and AUC measures) comparing with other six related works and six multiple classifier systems over seven NASA datasets. We generally recommend OEL to improve the performance of software defect prediction and the similar problem based on these experimental results.
Collapse
Affiliation(s)
- Reza Mousavi
- Department of Biochemistry & Molecular Medicine, George Washington University, Washington, DC 20037, USA
| | - Mahdi Eftekhari
- Department of Computer Engineering, Shahid Bahonar University of Kerman, Kerman, Iran
| | - Farhad Rahdari
- Department of Computer and IT, Institute of Science and High Technology and Environmental Sciences, Graduate University of Advanced Technology, Kerman, Iran
| |
Collapse
|
522
|
Liu G, Yang Y, Li B. Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2018.05.044] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
523
|
Guermazi R, Chaabane I, Hammami M. AECID: Asymmetric entropy for classifying imbalanced data. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2018.07.076] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
524
|
Douzas G, Bacao F, Last F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2018.06.056] [Citation(s) in RCA: 167] [Impact Index Per Article: 23.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
525
|
Sharma A, Rani R. BE-DTI': Ensemble framework for drug target interaction prediction using dimensionality reduction and active learning. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2018; 165:151-162. [PMID: 30337070 DOI: 10.1016/j.cmpb.2018.08.011] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 08/03/2018] [Accepted: 08/17/2018] [Indexed: 06/08/2023]
Abstract
BACKGROUND AND OBJECTIVE Drug-target interaction prediction plays an intrinsic role in the drug discovery process. Prediction of novel drugs and targets helps in identifying optimal drug therapies for various stringent diseases. Computational prediction of drug-target interactions can help to identify potential drug-target pairs and speed-up the process of drug repositioning. In our present, work we have focused on machine learning algorithms for predicting drug-target interactions from the pool of existing drug-target data. The key idea is to train the classifier using existing DTI so as to predict new or unknown DTI. However, there are various challenges such as class imbalance and high dimensional nature of data that need to be addressed before developing optimal drug-target interaction model. METHODS In this paper, we propose a bagging based ensemble framework named BE-DTI' for drug-target interaction prediction using dimensionality reduction and active learning to deal with class-imbalanced data. Active learning helps to improve under-sampling bagging based ensembles. Dimensionality reduction is used to deal with high dimensional data. RESULTS Results show that the proposed technique outperforms the other five competing methods in 10-fold cross-validation experiments in terms of AUC=0.927, Sensitivity=0.886, Specificity=0.864, and G-mean=0.874. CONCLUSION Missing interactions and new interactions are predicted using the proposed framework. Some of the known interactions are removed from the original dataset and their interactions are recalculated to check the accuracy of the proposed framework. Moreover, validation of the proposed approach is performed using the external dataset. All these results show that structurally similar drugs tend to interact with similar targets.
Collapse
Affiliation(s)
- Aman Sharma
- Computer Science and Engineering Department, Thapar Institute of Engineering & Technology, Punjab, Patiala, India.
| | - Rinkle Rani
- Computer Science and Engineering Department, Thapar Institute of Engineering & Technology, Punjab, Patiala, India.
| |
Collapse
|
526
|
Wang S, Minku LL, Yao X. A Systematic Study of Online Class Imbalance Learning With Concept Drift. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:4802-4821. [PMID: 29993955 DOI: 10.1109/tnnls.2017.2771290] [Citation(s) in RCA: 49] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
As an emerging research topic, online class imbalance learning often combines the challenges of both class imbalance and concept drift. It deals with data streams having very skewed class distributions, where concept drift may occur. It has recently received increased research attention; however, very little work addresses the combined problem where both class imbalance and concept drift coexist. As the first systematic study of handling concept drift in class-imbalanced data streams, this paper first provides a comprehensive review of current research progress in this field, including current research focuses and open challenges. Then, an in-depth experimental study is performed, with the goal of understanding how to best overcome concept drift in online learning with class imbalance.
Collapse
|
527
|
Guo H, Zhou J, Wu CA, She W, Xu M. Ensemble based on feature projection and under-sampling for imbalanced learning. INTELL DATA ANAL 2018. [DOI: 10.3233/ida-173505] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Huaping Guo
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, Henan, China
| | - Jun Zhou
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, Henan, China
| | - Chang-an Wu
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, Henan, China
| | - Wei She
- School of Information Software Technology, Zhengzhou Uninversity, Zhengzhou 450001, Henan, China
| | - Mingliang Xu
- School of Information Engineering, Zhengzhou Uninversity, Zhengzhou 450001, Henan, China
| |
Collapse
|
528
|
Binary teaching–learning-based optimization algorithm with a new update mechanism for sample subset optimization in software defect prediction. Soft comput 2018. [DOI: 10.1007/s00500-018-3546-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
529
|
Orchard P, Agakova A, Pinnock H, Burton CD, Sarran C, Agakov F, McKinstry B. Improving Prediction of Risk of Hospital Admission in Chronic Obstructive Pulmonary Disease: Application of Machine Learning to Telemonitoring Data. J Med Internet Res 2018; 20:e263. [PMID: 30249589 PMCID: PMC6231768 DOI: 10.2196/jmir.9227] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2017] [Revised: 04/19/2018] [Accepted: 06/18/2018] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Telemonitoring of symptoms and physiological signs has been suggested as a means of early detection of chronic obstructive pulmonary disease (COPD) exacerbations, with a view to instituting timely treatment. However, algorithms to identify exacerbations result in frequent false-positive results and increased workload. Machine learning, when applied to predictive modelling, can determine patterns of risk factors useful for improving prediction quality. OBJECTIVE Our objectives were to (1) establish whether machine learning techniques applied to telemonitoring datasets improve prediction of hospital admissions and decisions to start corticosteroids, and (2) determine whether the addition of weather data further improves such predictions. METHODS We used daily symptoms, physiological measures, and medication data, with baseline demography, COPD severity, quality of life, and hospital admissions from a pilot and large randomized controlled trial of telemonitoring in COPD. We linked weather data from the United Kingdom meteorological service. We used feature selection and extraction techniques for time series to construct up to 153 predictive patterns (features) from symptom, medication, and physiological measurements. We used the resulting variables to construct predictive models fitted to training sets of patients and compared them with common symptom-counting algorithms. RESULTS We had a mean 363 days of telemonitoring data from 135 patients. The two most practical traditional score-counting algorithms, restricted to cases with complete data, resulted in area under the receiver operating characteristic curve (AUC) estimates of 0.60 (95% CI 0.51-0.69) and 0.58 (95% CI 0.50-0.67) for predicting admissions based on a single day's readings. However, in a real-world scenario allowing for missing data, with greater numbers of patient daily data and hospitalizations (N=57,150, N+=55, respectively), the performance of all the traditional algorithms fell, including those based on 2 days' data. One of the most frequently used algorithms performed no better than chance. All considered machine learning models demonstrated significant improvements; the best machine learning algorithm based on 57,150 episodes resulted in an aggregated AUC of 0.74 (95% CI 0.67-0.80). Adding weather data measurements did not improve the predictive performance of the best model (AUC 0.74, 95% CI 0.69-0.79). To achieve an 80% true-positive rate (sensitivity), the traditional algorithms were associated with an 80% false-positive rate: our algorithm halved this rate to approximately 40% (specificity approximately 60%). The machine learning algorithm was moderately superior to the best symptom-counting algorithm (AUC 0.77, 95% CI 0.74-0.79 vs AUC 0.66, 95% CI 0.63-0.68) at predicting the need for corticosteroids. CONCLUSIONS Early detection and management of COPD remains an important goal given its huge personal and economic costs. Machine learning approaches, which can be tailored to an individual's baseline profile and can learn from experience of the individual patient, are superior to existing predictive algorithms and show promise in achieving this goal. TRIAL REGISTRATION International Standard Randomized Controlled Trial Number ISRCTN96634935; http://www.isrctn.com/ISRCTN96634935 (Archived by WebCite at http://www.webcitation.org/722YkuhAz).
Collapse
Affiliation(s)
| | | | - Hilary Pinnock
- Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, United Kingdom
| | | | | | | | - Brian McKinstry
- Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, United Kingdom
| |
Collapse
|
530
|
Abstract
Classification of data with imbalanced class distribution has encountered a significant drawback by most conventional classification learning methods which assume a relatively balanced class distribution. This paper proposes a novel classification method based on data-partition and SMOTE for imbalanced learning. The proposed method differs from conventional ones in both the learning and prediction stages. For the learning stage, the proposed method uses the following three steps to learn a class-imbalance oriented model: (1) partitioning the majority class into several clusters using data partition methods such as K-Means, (2) constructing a novel training set using SMOTE on each data set obtained by merging each cluster with the minority class, and (3) learning a classification model on each training set using convention classification learning methods including decision tree, SVM and neural network. Therefore, a classifier repository consisting of several classification models is constructed. With respect to the prediction stage, for a given example to be classified, the proposed method uses the partition model constructed in the learning stage to select a model from the classifier repository to predict the example. Comprehensive experiments on KEEL data sets show that the proposed method outperforms some other existing methods on evaluation measures of recall, g-mean, f-measure and AUC.
Collapse
|
531
|
FCN-based approach for the automatic segmentation of bone surfaces in ultrasound images. Int J Comput Assist Radiol Surg 2018; 13:1707-1716. [DOI: 10.1007/s11548-018-1856-x] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2018] [Accepted: 09/03/2018] [Indexed: 01/17/2023]
|
532
|
Li Z, Xie W, Liu T. Efficient feature selection and classification for microarray data. PLoS One 2018; 13:e0202167. [PMID: 30125332 PMCID: PMC6101392 DOI: 10.1371/journal.pone.0202167] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2017] [Accepted: 07/30/2018] [Indexed: 11/19/2022] Open
Abstract
Feature selection and classification are the main topics in microarray data analysis. Although many feature selection methods have been proposed and developed in this field, SVM-RFE (Support Vector Machine based on Recursive Feature Elimination) is proved as one of the best feature selection methods, which ranks the features (genes) by training support vector machine classification model and selects key genes combining with recursive feature elimination strategy. The principal drawback of SVM-RFE is the huge time consumption. To overcome this limitation, we introduce a more efficient implementation of linear support vector machines and improve the recursive feature elimination strategy and then combine them together to select informative genes. Besides, we propose a simple resampling method to preprocess the datasets, which makes the information distribution of different kinds of samples balanced and the classification results more credible. Moreover, the applicability of four common classifiers is also studied in this paper. Extensive experiments are conducted on six most frequently used microarray datasets in this field, and the results show that the proposed methods have not only reduced the time consumption greatly but also obtained comparable classification performance.
Collapse
Affiliation(s)
- Zifa Li
- Department of Computer Science and Technology, Huaqiao University, Xiamen, Fujian, China
| | - Weibo Xie
- Department of Computer Science and Technology, Huaqiao University, Xiamen, Fujian, China
| | - Tao Liu
- Department of Computer Science and Technology, Huaqiao University, Xiamen, Fujian, China
| |
Collapse
|
533
|
Wolfslag WJ, Bharatheesha M, Moerland TM, Wisse M. RRT-CoLearn: Towards Kinodynamic Planning Without Numerical Trajectory Optimization. IEEE Robot Autom Lett 2018. [DOI: 10.1109/lra.2018.2801470] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
534
|
Ding X, Bucholc M, Wang H, Glass DH, Wang H, Clarke DH, Bjourson AJ, Dowey LRC, O'Kane M, Prasad G, Maguire L, Wong-Lin K. A hybrid computational approach for efficient Alzheimer's disease classification based on heterogeneous data. Sci Rep 2018; 8:9774. [PMID: 29950585 PMCID: PMC6021389 DOI: 10.1038/s41598-018-27997-8] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2017] [Accepted: 06/12/2018] [Indexed: 12/20/2022] Open
Abstract
There is currently a lack of an efficient, objective and systemic approach towards the classification of Alzheimer's disease (AD), due to its complex etiology and pathogenesis. As AD is inherently dynamic, it is also not clear how the relationships among AD indicators vary over time. To address these issues, we propose a hybrid computational approach for AD classification and evaluate it on the heterogeneous longitudinal AIBL dataset. Specifically, using clinical dementia rating as an index of AD severity, the most important indicators (mini-mental state examination, logical memory recall, grey matter and cerebrospinal volumes from MRI and active voxels from PiB-PET brain scans, ApoE, and age) can be automatically identified from parallel data mining algorithms. In this work, Bayesian network modelling across different time points is used to identify and visualize time-varying relationships among the significant features, and importantly, in an efficient way using only coarse-grained data. Crucially, our approach suggests key data features and their appropriate combinations that are relevant for AD severity classification with high accuracy. Overall, our study provides insights into AD developments and demonstrates the potential of our approach in supporting efficient AD diagnosis.
Collapse
Affiliation(s)
- Xuemei Ding
- Intelligent Systems Research Centre, Ulster University, Magee Campus, Derry~Londonderry, Northern Ireland, UK.
- Faculty of Mathematics and Informatics, Fujian Normal University, Fuzhou, China.
| | - Magda Bucholc
- Intelligent Systems Research Centre, Ulster University, Magee Campus, Derry~Londonderry, Northern Ireland, UK
| | - Haiying Wang
- School of Computing and Mathematics, Ulster University, Jordanstown Campus, Northern Ireland, UK
| | - David H Glass
- School of Computing and Mathematics, Ulster University, Jordanstown Campus, Northern Ireland, UK
| | - Hui Wang
- School of Computing and Mathematics, Ulster University, Jordanstown Campus, Northern Ireland, UK
| | - Dave H Clarke
- Clarke Analytics Ltd., 6 Dernville, Annabella Mallow, Cork, Ireland
| | - Anthony John Bjourson
- Northern Ireland Centre for Stratified Medicine, Biomedical Sciences Research Institute, C-TRIC, Ulster University, Altnagelvin Hospital, Derry~Londonderry, Northern Ireland, UK
| | - Le Roy C Dowey
- C-TRIC, Altnagelvin Hospital campus, Derry~Londonderry, Northern Ireland, UK
- School of Biomedical Sciences, Ulster University, Coleraine Campus, Northern Ireland, UK
| | - Maurice O'Kane
- C-TRIC, Altnagelvin Hospital campus, Derry~Londonderry, Northern Ireland, UK
| | - Girijesh Prasad
- Intelligent Systems Research Centre, Ulster University, Magee Campus, Derry~Londonderry, Northern Ireland, UK
| | - Liam Maguire
- Intelligent Systems Research Centre, Ulster University, Magee Campus, Derry~Londonderry, Northern Ireland, UK
| | - KongFatt Wong-Lin
- Intelligent Systems Research Centre, Ulster University, Magee Campus, Derry~Londonderry, Northern Ireland, UK.
| |
Collapse
|
535
|
Regularized fisher linear discriminant through two threshold variation strategies for imbalanced problems. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2018.02.035] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
536
|
Ensemble of Rotation Trees for Imbalanced Medical Datasets. JOURNAL OF HEALTHCARE ENGINEERING 2018; 2018:8902981. [PMID: 29850005 PMCID: PMC5914103 DOI: 10.1155/2018/8902981] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/22/2017] [Revised: 02/08/2018] [Accepted: 02/11/2018] [Indexed: 11/23/2022]
Abstract
Medical datasets are often predominately composed of “normal” examples with only a small percentage of “abnormal” ones and how to correctly recognize the abnormal examples is very meaningful. However, conventional classification learning methods try to pursue high accuracy by assuming that the number of any class examples is similar to each other, which lead to the fact that the abnormal class examples are usually ignored and misclassified to normal ones. In this paper, we propose a simple but effective ensemble method called ensemble of rotation trees (ERT) to handle this problem in imbalanced medical datasets. ERT learns an ensemble through the following four stages: (1) undersampling subsets from normal class, (2) obtaining new balanced training sets through combining each subset and abnormal class, (3) inducing a rotation matrix on randomly sampling subset of each new balanced set, and in each rotation matrix space, (4) learning a decision tree on each balanced training data. Here, the rotation matrix is mainly to improve the diversity between ensemble members, and undersampling technique aims to improve the performance of learned models on abnormal class. Experimental results show that, compared with other state-of-the-art methods, ERT shows significantly better performance for imbalanced medical datasets.
Collapse
|
537
|
Cao C, Wang Z. IMCStacking: Cost-sensitive stacking learning with feature inverse mapping for imbalanced problems. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2018.02.031] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
538
|
Zhu Y, Wang Z, Zha H, Gao D, Wang Z, Gao D, Zhu Y, Zha H. Boundary-Eliminated Pseudoinverse Linear Discriminant for Imbalanced Problems. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:2581-2594. [PMID: 28534789 DOI: 10.1109/tnnls.2017.2676239] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Existing learning models for classification of imbalanced data sets can be grouped as either boundary-based or nonboundary-based depending on whether a decision hyperplane is used in the learning process. The focus of this paper is a new approach that leverages the advantage of both approaches. Specifically, our new model partitions the input space into three parts by creating two additional boundaries in the training process, and then makes the final decision based on a heuristic measurement between the test sample and a subset of selected training samples. Since the original hyperplane used by the underlying original classifier will be eliminated, the proposed model is named the boundary-eliminated (BE) model. Additionally, the pseudoinverse linear discriminant (PILD) is adopted for the BE model so as to obtain a novel classifier abbreviated as BEPILD. Experiments validate both the effectiveness and the efficiency of BEPILD, compared with 13 state-of-the-art classification methods, based on 31 imbalanced and 7 standard data sets.
Collapse
|
539
|
Zhao Y, Wong ZSY, Tsui KL. A Framework of Rebalancing Imbalanced Healthcare Data for Rare Events' Classification: A Case of Look-Alike Sound-Alike Mix-Up Incident Detection. JOURNAL OF HEALTHCARE ENGINEERING 2018; 2018:6275435. [PMID: 29951182 PMCID: PMC5987310 DOI: 10.1155/2018/6275435] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/11/2017] [Revised: 02/02/2018] [Accepted: 02/22/2018] [Indexed: 11/17/2022]
Abstract
Identifying rare but significant healthcare events in massive unstructured datasets has become a common task in healthcare data analytics. However, imbalanced class distribution in many practical datasets greatly hampers the detection of rare events, as most classification methods implicitly assume an equal occurrence of classes and are designed to maximize the overall classification accuracy. In this study, we develop a framework for learning healthcare data with imbalanced distribution via incorporating different rebalancing strategies. The evaluation results showed that the developed framework can significantly improve the detection accuracy of medical incidents due to look-alike sound-alike (LASA) mix-ups. Specifically, logistic regression combined with the synthetic minority oversampling technique (SMOTE) produces the best detection results, with a significant 45.3% increase in recall (recall = 75.7%) compared with pure logistic regression (recall = 52.1%).
Collapse
Affiliation(s)
- Yang Zhao
- Department of Systems Engineering and Engineering Management, City University of Hong Kong, Kowloon, Hong Kong
| | - Zoie Shui-Yee Wong
- Graduate School of Public Health, St. Luke's International University, Tokyo, Japan
| | - Kwok Leung Tsui
- Department of Systems Engineering and Engineering Management, City University of Hong Kong, Kowloon, Hong Kong
| |
Collapse
|
540
|
Sun L, Xing X, Zhou Y, Hu X. Demand Forecasting for Petrol Products in Gas Stations Using Clustering and Decision Tree. JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS 2018. [DOI: 10.20965/jaciii.2018.p0387] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Demand forecasting for petrol products in gas stations is crucial to the planning of initiative distribution of petrol products, especially to the stability of product supply in petroleum companies. In this paper, a novel scheme of demand forecasting based on clustering and a decision tree is proposed, which uses a decision tree and integrates the results of clustering validity indices. First, the proposed scheme uses a k-means algorithm to divide the sales data into multiple disjointed clusters, evaluates the clustering result of the daily sales curve of a product according to seven validity indices and determines the optimal number of clustering. Next, the relationship between the sales pattern and the relevant influence factors is described using a decision tree, which can categorize a future day’s sales pattern with these factors into the most suitable cluster to predict the quantity of the demand and the peak demand time windows for each gas station. Finally, three months’ worth of sales data is collected from a gas station in Dalian city, China, to illustrate the proposed forecasting scheme. Experimental results demonstrate that the scheme is an effective alternative for the demand forecasting for petrol products because it outperforms three other selected methods.
Collapse
|
541
|
Class Imbalance Ensemble Learning Based on the Margin Theory. APPLIED SCIENCES-BASEL 2018. [DOI: 10.3390/app8050815] [Citation(s) in RCA: 83] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
|
542
|
Shao M, Ma J, Wang S. DeepBound: accurate identification of transcript boundaries via deep convolutional neural fields. Bioinformatics 2018; 33:i267-i273. [PMID: 28881999 PMCID: PMC5870651 DOI: 10.1093/bioinformatics/btx267] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Motivation Reconstructing the full-length expressed transcripts (a.k.a. the transcript assembly problem) from the short sequencing reads produced by RNA-seq protocol plays a central role in identifying novel genes and transcripts as well as in studying gene expressions and gene functions. A crucial step in transcript assembly is to accurately determine the splicing junctions and boundaries of the expressed transcripts from the reads alignment. In contrast to the splicing junctions that can be efficiently detected from spliced reads, the problem of identifying boundaries remains open and challenging, due to the fact that the signal related to boundaries is noisy and weak. Results We present DeepBound, an effective approach to identify boundaries of expressed transcripts from RNA-seq reads alignment. In its core DeepBound employs deep convolutional neural fields to learn the hidden distributions and patterns of boundaries. To accurately model the transition probabilities and to solve the label-imbalance problem, we novelly incorporate the AUC (area under the curve) score into the optimizing objective function. To address the issue that deep probabilistic graphical models requires large number of labeled training samples, we propose to use simulated RNA-seq datasets to train our model. Through extensive experimental studies on both simulation datasets of two species and biological datasets, we show that DeepBound consistently and significantly outperforms the two existing methods. Availability and implementation DeepBound is freely available at https://github.com/realbigws/DeepBound.
Collapse
Affiliation(s)
- Mingfu Shao
- Department of Computational Biology, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
- To whom correspondence should be addressed. or
| | - Jianzhu Ma
- School of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Sheng Wang
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- To whom correspondence should be addressed. or
| |
Collapse
|
543
|
López-Valenciano A, Ayala F, Puerta JM, De Ste Croix M, Vera-García F, Hernández-Sánchez S, Ruiz-Pérez I, Myer G. A Preventive Model for Muscle Injuries: A Novel Approach based on Learning Algorithms. Med Sci Sports Exerc 2018; 50:915-927. [PMID: 29283933 PMCID: PMC6582363 DOI: 10.1249/mss.0000000000001535] [Citation(s) in RCA: 53] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
INTRODUCTION The application of contemporary statistical approaches coming from Machine Learning and Data Mining environments to build more robust predictive models to identify athletes at high risk for injury might support injury prevention strategies of the future. PURPOSE The purpose was to analyze and compare the behavior of numerous machine learning methods to select the best-performing injury risk factor model to identify athlete at risk for lower extremity muscle injuries (MUSINJ). METHODS A total of 132 male professional soccer and handball players underwent a preseason screening evaluation that included personal, psychological, and neuromuscular measures. Furthermore, injury surveillance was used to capture all the MUSINJ occurring in the 2013/2014 seasons. The predictive ability of several models built by applying a range of learning techniques were analyzed and compared. RESULTS There were 32 MUSINJ over the follow-up period, 21 (65.6%) of which corresponded to the hamstrings, 3 to the quadriceps (9.3%), 4 to the adductors (12.5%), and 4 to the triceps surae (12.5%). A total of 13 injures occurred during training and 19 during competition. Three players were injured twice during the observation period so the first injury was used, leaving 29 MUSINJ that were used to develop the predictive models. The model generated by the SmooteBoost technique with a cost-sensitive ADTree as the base classifier reported the best evaluation criteria (area under the receiver operating characteristic curve score, 0.747; true positive rate, 65.9%; true negative rate, 79.1) and hence was considered the best for predicting MUSINJ. CONCLUSIONS The prediction model showed moderate accuracy for identifying professional soccer and handball players at risk for MUSINJ. Therefore, the model developed might help in the decision-making process for injury prevention.
Collapse
Affiliation(s)
| | - Francisco Ayala
- Sports Research Centre, Miguel Hernandez University of Elche, Alicante, Spain
| | - José Miguel Puerta
- Department of Computer Systems, University of Castilla-La Mancha, Albacete, Spain
| | - Mark De Ste Croix
- School of Physical Education, Faculty of Sport, Health and Social Care, University of Gloucestershire, Gloucester, United Kingdom
| | | | - Sergio Hernández-Sánchez
- Department of Pathology and Surgery, Physiotherapy Area, Miguel Hernandez University of Elche, Alicante, Spain
| | - Iñaki Ruiz-Pérez
- Sports Research Centre, Miguel Hernandez University of Elche, Alicante, Spain
| | - Gregory Myer
- Division of Sports Medicine, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH
| |
Collapse
|
544
|
Piette ER, Moore JH. Improving machine learning reproducibility in genetic association studies with proportional instance cross validation (PICV). BioData Min 2018; 11:6. [PMID: 29713384 PMCID: PMC5907739 DOI: 10.1186/s13040-018-0167-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2017] [Accepted: 04/03/2018] [Indexed: 11/10/2022] Open
Abstract
Background Machine learning methods and conventions are increasingly employed for the analysis of large, complex biomedical data sets, including genome-wide association studies (GWAS). Reproducibility of machine learning analyses of GWAS can be hampered by biological and statistical factors, particularly so for the investigation of non-additive genetic interactions. Application of traditional cross validation to a GWAS data set may result in poor consistency between the training and testing data set splits due to an imbalance of the interaction genotypes relative to the data as a whole. We propose a new cross validation method, proportional instance cross validation (PICV), that preserves the original distribution of an independent variable when splitting the data set into training and testing partitions. Results We apply PICV to simulated GWAS data with epistatic interactions of varying minor allele frequencies and prevalences and compare performance to that of a traditional cross validation procedure in which individuals are randomly allocated to training and testing partitions. Sensitivity and positive predictive value are significantly improved across all tested scenarios for PICV compared to traditional cross validation. We also apply PICV to GWAS data from a study of primary open-angle glaucoma to investigate a previously-reported interaction, which fails to significantly replicate; PICV however improves the consistency of testing and training results. Conclusions Application of traditional machine learning procedures to biomedical data may require modifications to better suit intrinsic characteristics of the data, such as the potential for highly imbalanced genotype distributions in the case of epistasis detection. The reproducibility of genetic interaction findings can be improved by considering this variable imbalance in cross validation implementation, such as with PICV. This approach may be extended to problems in other domains in which imbalanced variable distributions are a concern.
Collapse
Affiliation(s)
- Elizabeth R Piette
- 1Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA USA
| | - Jason H Moore
- 2Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA USA
| |
Collapse
|
545
|
Cortical Classification with Rhythm Entropy for Error Processing in Cocktail Party Environment Based on Scalp EEG Recording. Sci Rep 2018; 8:6070. [PMID: 29666460 PMCID: PMC5904132 DOI: 10.1038/s41598-018-24535-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2018] [Accepted: 04/05/2018] [Indexed: 11/17/2022] Open
Abstract
Using single-trial cortical signals calculated by weighted minimum norm solution estimation (WMNE), the present study explored a feature extraction method based on rhythm entropy to classify the scalp electroencephalography (EEG) signals of error response from that of correct response during performing auditory-track tasks in cocktail party environment. The classification rate achieved 89.7% with single-trial (≈700 ms) when using support vector machine(SVM) with the leave-one-out-cross-validation (LOOCV). And high discriminative regions mainly distributed at the medial frontal cortex (MFC), the left supplementary motor area (lSMA) and the right supplementary motor area (rSMA). The mean entropy value for error trials was significantly lower than that for correct trials in the discriminative cortices. By time-varying network analysis, different information flows changed among these discriminative regions with time, i.e. error processing showed a left-bias information flow, and correct processing presented a right-bias information flow. These findings revealed that the rhythm information based on single cortical signals could be well used to describe characteristics of error-related EEG signals and further provided a novel application about auditory attention for brain computer interfaces (BCIs).
Collapse
|
546
|
Jain S, Kotsampasakou E, Ecker GF. Comparing the performance of meta-classifiers-a case study on selected imbalanced data sets relevant for prediction of liver toxicity. J Comput Aided Mol Des 2018; 32:583-590. [PMID: 29626291 PMCID: PMC5919997 DOI: 10.1007/s10822-018-0116-z] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2017] [Accepted: 03/29/2018] [Indexed: 12/28/2022]
Abstract
Abstract Cheminformatics datasets used in classification problems, especially those related to biological or physicochemical properties, are often imbalanced. This presents a major challenge in development of in silico prediction models, as the traditional machine learning algorithms are known to work best on balanced datasets. The class imbalance introduces a bias in the performance of these algorithms due to their preference towards the majority class. Here, we present a comparison of the performance of seven different meta-classifiers for their ability to handle imbalanced datasets, whereby Random Forest is used as base-classifier. Four different datasets that are directly (cholestasis) or indirectly (via inhibition of organic anion transporting polypeptide 1B1 and 1B3) related to liver toxicity were chosen for this purpose. The imbalance ratio in these datasets ranges between 4:1 and 20:1 for negative and positive classes, respectively. Three different sets of molecular descriptors for model development were used, and their performance was assessed in 10-fold cross-validation and on an independent validation set. Stratified bagging, MetaCost and CostSensitiveClassifier were found to be the best performing among all the methods. While MetaCost and CostSensitiveClassifier provided better sensitivity values, Stratified Bagging resulted in high balanced accuracies. Graphical Abstract ![]()
Electronic supplementary material The online version of this article (10.1007/s10822-018-0116-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sankalp Jain
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstrasse 14, 1090, Vienna, Austria
| | - Eleni Kotsampasakou
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstrasse 14, 1090, Vienna, Austria.,Computational Toxicology Group, CMS, R&D Platform Technology & Science, GSK, Park Road, Ware, Hertfordshire, SG12 0DP, UK
| | - Gerhard F Ecker
- Department of Pharmaceutical Chemistry, University of Vienna, Althanstrasse 14, 1090, Vienna, Austria.
| |
Collapse
|
547
|
Wu Z, Guo Y, Lin W, Yu S, Ji Y. A Weighted Deep Representation Learning Model for Imbalanced Fault Diagnosis in Cyber-Physical Systems. SENSORS 2018; 18:s18041096. [PMID: 29621131 PMCID: PMC5948747 DOI: 10.3390/s18041096] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/05/2018] [Revised: 03/17/2018] [Accepted: 03/19/2018] [Indexed: 11/16/2022]
Abstract
Predictive maintenance plays an important role in modern Cyber-Physical Systems (CPSs) and data-driven methods have been a worthwhile direction for Prognostics Health Management (PHM). However, two main challenges have significant influences on the traditional fault diagnostic models: one is that extracting hand-crafted features from multi-dimensional sensors with internal dependencies depends too much on expertise knowledge; the other is that imbalance pervasively exists among faulty and normal samples. As deep learning models have proved to be good methods for automatic feature extraction, the objective of this paper is to study an optimized deep learning model for imbalanced fault diagnosis for CPSs. Thus, this paper proposes a weighted Long Recurrent Convolutional LSTM model with sampling policy (wLRCL-D) to deal with these challenges. The model consists of 2-layer CNNs, 2-layer inner LSTMs and 2-Layer outer LSTMs, with under-sampling policy and weighted cost-sensitive loss function. Experiments are conducted on PHM 2015 challenge datasets, and the results show that wLRCL-D outperforms other baseline methods.
Collapse
Affiliation(s)
- Zhenyu Wu
- Engineering Research Center of Information Network, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing 100876, China.
| | - Yang Guo
- Key Laboratory of Universal Wireless Communications, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing 100876, China.
| | - Wenfang Lin
- Key Laboratory of Universal Wireless Communications, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing 100876, China.
| | - Shuyang Yu
- Key Laboratory of Universal Wireless Communications, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing 100876, China.
| | - Yang Ji
- Engineering Research Center of Information Network, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing 100876, China.
- Key Laboratory of Universal Wireless Communications, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing 100876, China.
| |
Collapse
|
548
|
Aydogan EK, Ozmen M, Delice Y. CBR-PSO: cost-based rough particle swarm optimization approach for high-dimensional imbalanced problems. Neural Comput Appl 2018. [DOI: 10.1007/s00521-018-3469-2] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
549
|
Rodríguez JP, Corrales DC, Corrales JC. A Process for Increasing the Samples of Coffee Rust Through Machine Learning Methods. INTERNATIONAL JOURNAL OF AGRICULTURAL AND ENVIRONMENTAL INFORMATION SYSTEMS 2018. [DOI: 10.4018/ijaeis.2018040103] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
This article describes how coffee rust has become a serious concern for many coffee farmers and manufacturers. The American Phytopathological Society discusses its importance saying this: “…the most economically important coffee disease in the world…” while “…in monetary value, coffee is the most important agricultural product in international trade…” The early detection has inspired researchers to apply supervised learning algorithms on predicting the disease appearance. However, the main issue of the related works is the small number of samples of the dependent variable: Incidence Percentage of Rust, since the datasets do not have a reliable representation of the disease, which will generate inaccurate predictions in the models. This article provides a process about coffee rust to select appropriate machine learning methods to increase rust samples.
Collapse
Affiliation(s)
| | - David Camilo Corrales
- Telematic Engineering Group, University of Cauca, Popayán, Colombia and Department of Computer Science and Engineering, Carlos III University of Madrid, Madrid, Spain
| | | |
Collapse
|
550
|
Roy A, Cruz RM, Sabourin R, Cavalcanti GD. A study on combining dynamic selection and data preprocessing for imbalance learning. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.01.060] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|