1
|
Reddy GS, Chittineni S. Entropy based C4.5-SHO algorithm with information gain optimization in data mining. PeerJ Comput Sci 2021; 7:e424. [PMID: 33954229 PMCID: PMC8049126 DOI: 10.7717/peerj-cs.424] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Accepted: 02/11/2021] [Indexed: 06/12/2023]
Abstract
Information efficiency is gaining more importance in the development as well as application sectors of information technology. Data mining is a computer-assisted process of massive data investigation that extracts meaningful information from the datasets. The mined information is used in decision-making to understand the behavior of each attribute. Therefore, a new classification algorithm is introduced in this paper to improve information management. The classical C4.5 decision tree approach is combined with the Selfish Herd Optimization (SHO) algorithm to tune the gain of given datasets. The optimal weights for the information gain will be updated based on SHO. Further, the dataset is partitioned into two classes based on quadratic entropy calculation and information gain. Decision tree gain optimization is the main aim of our proposed C4.5-SHO method. The robustness of the proposed method is evaluated on various datasets and compared with classifiers, such as ID3 and CART. The accuracy and area under the receiver operating characteristic curve parameters are estimated and compared with existing algorithms like ant colony optimization, particle swarm optimization and cuckoo search.
Collapse
Affiliation(s)
- G Sekhar Reddy
- Department of Computer Science and Engineering, Acharya Nagarjuna University, Guntur, Andhra Pradesh, India
| | - Suneetha Chittineni
- Department of Computer Applications, RVR&JC college of Engineering, Guntur, Andhra Pradesh, India
| |
Collapse
|
2
|
Yan L, He Y, Qin L, Wu C, Zhu D, Ran B. A novel feature extraction model for traffic injury severity and its application to Fatality Analysis Reporting System data analysis. Sci Prog 2020; 103:36850419886471. [PMID: 31829790 PMCID: PMC10358574 DOI: 10.1177/0036850419886471] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
The prevention of severe injuries during crashes has become one of the leading issues in traffic management and transportation safety. Identifying the impact factors that affect traffic injury severity is critical for reducing the occurrence of severe injuries. In this study, the Fatality Analysis Reporting System data are selected as the dataset for the analysis. An algorithm named improved Markov Blanket was proposed to extract the significant and common factors that affect crash injury severity from 29 variables related to driver characteristics, vehicle characteristics, accidents types, road condition, and environment characteristics. The Pearson correlation coefficient test is applied to verify the significant correlation between the selected factors and traffic injury severity. Two widely used classification algorithms (Bayesian networks and C4.5 decision tree) were employed to evaluate the performance of the proposed feature selection algorithm. The calculation result of the correlation coefficient, accuracy of classification, and classification error rate indicated that the improved Markov Blanket not only could extract the significant impact factors but could also improve the accuracy of classification. Meanwhile, the relationship between five selected factors (atmospheric condition, time of crash, alcohol test result, crash type, and driver's distraction) and traffic injury severity was also analyzed in this study. The results indicated that crashes occurred in bad weather condition (e.g. fog or worse), in night time, in drunk driving, in crash type of single driver, and in distracted driving, which are associated with more severe injuries.
Collapse
Affiliation(s)
- Lixin Yan
- School of Transportation and Logistics, East China Jiaotong University, Nanchang, P.R. China
| | - Yi He
- Intelligent Transport Systems Research Center, Wuhan University of Technology, Wuhan, P.R. China
- Engineering Research Center for Transportation Safety, Ministry of Education, Wuhan, P.R. China
- National Engineering Laboratory for Transportation Safety & Emergency Informatics, Beijing, P.R. China
| | - Lingqiao Qin
- Wuhan KOTEI Informatics Co. Ltd., Wuhan, P.R. China
| | - Chaozhong Wu
- Intelligent Transport Systems Research Center, Wuhan University of Technology, Wuhan, P.R. China
- Engineering Research Center for Transportation Safety, Ministry of Education, Wuhan, P.R. China
| | - Dunyao Zhu
- Intelligent Transport Systems Research Center, Wuhan University of Technology, Wuhan, P.R. China
- Wuhan KOTEI Informatics Co. Ltd., Wuhan, P.R. China
| | - Bin Ran
- Transportation Engineering Laboratory, University of Wisconsin–Madison, Madison, WI, USA
| |
Collapse
|
3
|
Karimi K, Wuitchik DM, Oldach MJ, Vize PD. Distinguishing Species Using GC Contents in Mixed DNA or RNA Sequences. Evol Bioinform Online 2018; 14:1176934318788866. [PMID: 30038485 PMCID: PMC6052495 DOI: 10.1177/1176934318788866] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2018] [Accepted: 06/22/2018] [Indexed: 12/04/2022] Open
Abstract
With the advent of whole transcriptome and genome analysis methods, classifying samples containing multiple origins has become a significant task. Nucleotide sequences can be allocated to a genome or transcriptome by aligning sequences to multiple target sequence sets, but this approach requires extensive computational resources and also depends on target sequence sets lacking contaminants, which is often not the case. Here, we demonstrate that raw sequences can be rapidly sorted into groups, in practice corresponding to genera, by exploiting differences in nucleotide GC content. To do so, we introduce GCSpeciesSorter, which uses classification, specifically Support Vector Machines (SVM) and the C4.5 decision tree generator, to differentiate sequences. It also implements a secondary BLAST feature to identify known outliers. In the test case presented, a hermatypic coral holobiont, the cnidarian host includes various endosymbionts. The best characterized and most common of these symbionts are zooxanthellae of the genus Symbiodinium. GCSpeciesSorter separates cnidarian from Symbiodinium sequences with a high degree of accuracy. We show that if the GC contents of the species differ enough, this method can be used to accurately distinguish the sequences of different species when using high-throughput sequencing technologies.
Collapse
Affiliation(s)
- Kamran Karimi
- Department of Biological Sciences, University of Calgary, Calgary, AB, Canada.,Department of Computer Science, University of Calgary, Calgary, AB, Canada
| | - Daniel M Wuitchik
- Department of Biological Sciences, University of Calgary, Calgary, AB, Canada
| | - Matthew J Oldach
- Department of Biological Sciences, University of Calgary, Calgary, AB, Canada
| | - Peter D Vize
- Department of Biological Sciences, University of Calgary, Calgary, AB, Canada.,Department of Computer Science, University of Calgary, Calgary, AB, Canada
| |
Collapse
|
4
|
Lee SJ, Xu Z, Li T, Yang Y. A novel bagging C4.5 algorithm based on wrapper feature selection for supporting wise clinical decision making. J Biomed Inform 2017; 78:144-155. [PMID: 29137965 DOI: 10.1016/j.jbi.2017.11.005] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2017] [Revised: 10/11/2017] [Accepted: 11/10/2017] [Indexed: 11/16/2022]
Abstract
From the perspective of clinical decision-making in a Medical IoT-based healthcare system, achieving effective and efficient analysis of long-term health data for supporting wise clinical decision-making is an extremely important objective, but determining how to effectively deal with the multi-dimensionality and high volume of generated data obtained from Medical IoT-based healthcare systems is an issue of increasing importance in IoT healthcare data exploration and management. A novel classifier or predicator equipped with a good feature selection function contributes effectively to classification and prediction performance. This paper proposes a novel bagging C4.5 algorithm based on wrapper feature selection, for the purpose of supporting wise clinical decision-making in the medical and healthcare fields. In particular, the new proposed sampling method, S-C4.5-SMOTE, is not only able to overcome the problem of data distortion, but also improves overall system performance because its mechanism aims at effectively reducing the data size without distortion, by keeping datasets balanced and technically smooth. This achievement directly supports the Wrapper method of effective feature selection without the need to consider the problem of huge amounts of data; this is a novel innovation in this work.
Collapse
Affiliation(s)
- Shin-Jye Lee
- National Pilot School of Software, Yunnan University, No. 2, Cuihu North Rd., Kunming 650091, China; Queens' College, University of Cambridge, Cambridge CB3 9ET, UK
| | - Zhaozhao Xu
- National Pilot School of Software, Yunnan University, No. 2, Cuihu North Rd., Kunming 650091, China
| | - Tong Li
- National Pilot School of Software, Yunnan University, No. 2, Cuihu North Rd., Kunming 650091, China
| | - Yun Yang
- National Pilot School of Software, Yunnan University, No. 2, Cuihu North Rd., Kunming 650091, China.
| |
Collapse
|
5
|
Rabiee-Ghahfarrokhi B, Rafiei F, Niknafs AA, Zamani B. Prediction of microRNA target genes using an efficient genetic algorithm-based decision tree. FEBS Open Bio 2015; 5:877-84. [PMID: 26649272 PMCID: PMC4643183 DOI: 10.1016/j.fob.2015.10.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2015] [Revised: 09/29/2015] [Accepted: 10/05/2015] [Indexed: 11/27/2022] Open
Abstract
MicroRNAs (miRNAs) are small, non-coding RNA molecules that regulate gene expression in almost all plants and animals. They play an important role in key processes, such as proliferation, apoptosis, and pathogen-host interactions. Nevertheless, the mechanisms by which miRNAs act are not fully understood. The first step toward unraveling the function of a particular miRNA is the identification of its direct targets. This step has shown to be quite challenging in animals primarily because of incomplete complementarities between miRNA and target mRNAs. In recent years, the use of machine-learning techniques has greatly increased the prediction of miRNA targets, avoiding the need for costly and time-consuming experiments to achieve miRNA targets experimentally. Among the most important machine-learning algorithms are decision trees, which classify data based on extracted rules. In the present work, we used a genetic algorithm in combination with C4.5 decision tree for prediction of miRNA targets. We applied our proposed method to a validated human datasets. We nearly achieved 93.9% accuracy of classification, which could be related to the selection of best rules.
Collapse
Affiliation(s)
| | - Fariba Rafiei
- Department of Plant Breeding and Biotechnology, Shahrekord University, Shahrekord, Iran
| | - Ali Akbar Niknafs
- Department of Computer Engineering, Shahid Bahonar University of Kerman, Kerman, Iran
| | - Behzad Zamani
- Department of Computer Engineering, Iran University of Science & Technology, Tehran, Iran
| |
Collapse
|
6
|
Fu J, Jones M, Jan YK. Development of intelligent model for personalized guidance on wheelchair tilt and recline usage for people with spinal cord injury: methodology and preliminary report. J Rehabil Res Dev 2014; 51:775-88. [PMID: 25333817 DOI: 10.1682/jrrd.2013.09.0199] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2013] [Revised: 12/30/2013] [Indexed: 11/05/2022]
Abstract
Wheelchair tilt and recline functions are two of the most desirable features for relieving seating pressure to decrease the risk of pressure ulcers. The effective guidance on wheelchair tilt and recline usage is therefore critical to pressure ulcer prevention. The aim of this study was to demonstrate the feasibility of using machine learning techniques to construct an intelligent model to provide personalized guidance to individuals with spinal cord injury (SCI). The motivation stems from the clinical evidence that the requirements of individuals vary greatly and that no universal guidance on tilt and recline usage could possibly satisfy all individuals with SCI. We explored all aspects involved in constructing the intelligent model and proposed approaches tailored to suit the characteristics of this preliminary study, such as the way of modeling research participants, using machine learning techniques to construct the intelligent model, and evaluating the performance of the intelligent model. We further improved the intelligent model's prediction accuracy by developing a two-phase feature selection algorithm to identify important attributes. Experimental results demonstrated that our approaches held the promise: they could effectively construct the intelligent model, evaluate its performance, and refine the participant model so that the intelligent model's prediction accuracy was significantly improved.
Collapse
Affiliation(s)
- Jicheng Fu
- Department of Computer Science, University of Central Oklahoma, Edmond, OK
| | | | | |
Collapse
|