1
|
Barry KA, Manzali Y, Flouchi R, Balouki Y, Chelhi K, Elfar M. Exploring the use of association rules in random forest for predicting heart disease. Comput Methods Biomech Biomed Engin 2024; 27:338-346. [PMID: 36877167 DOI: 10.1080/10255842.2023.2185477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 02/07/2023] [Accepted: 02/16/2023] [Indexed: 03/07/2023]
Abstract
Heart disease is one of the most dangerous diseases in the world. People with these diseases, most of them end up losing their lives. Therefore, machine learning algorithms have proven to be useful in this sense to help decision-making and prediction from the large amount of data generated by the healthcare sector. In this work, we have proposed a novel method that allows increasing the performance of the classical random forest technique so that this technique can be used for the prediction of heart disease with its better performance. We used in this study other classifiers such as classical random forest, support vector machine, decision tree, Naïve Bayes, and XGBoost. This work was done in the heart dataset Cleveland. According to the experimental results, the accuracy of the proposed model is better than that of other classifiers with 83.5%.This study contributed to the optimization of the random forest technique as well as gave solid knowledge of the formation of this technique.
Collapse
Affiliation(s)
| | | | - Rachid Flouchi
- Laboratory of Microbial Biotechnology and Bioactive Molecules, Science and Technologies Faculty, Sidi Mohamed Ben Abdellah University, Fez, Morocco
| | - Youssef Balouki
- Labo: Mathematics, Computer Science and Engineering Sciences(MISI), Settat, Morocco
| | - Khadija Chelhi
- The logistics center of excellence, Higher School of Textile and Clothing Industries(ESITH Casablanca), Casablanca, Morocco
| | - Mohamed Elfar
- LPAIS Laboratory, Faculty of Sciences, USMBA, Fez, Morocco
| |
Collapse
|
2
|
Monteverde-Suárez D, González-Flores P, Santos-Solórzano R, García-Minjares M, Zavala-Sierra I, de la Luz VL, Sánchez-Mendiola M. Predicting students' academic progress and related attributes in first-year medical students: an analysis with artificial neural networks and Naïve Bayes. BMC Med Educ 2024; 24:74. [PMID: 38243257 PMCID: PMC10799512 DOI: 10.1186/s12909-023-04918-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Accepted: 11/30/2023] [Indexed: 01/21/2024]
Abstract
BACKGROUND Dropout and poor academic performance are persistent problems in medical schools in emerging economies. Identifying at-risk students early and knowing the factors that contribute to their success would be useful for designing educational interventions. Educational Data Mining (EDM) methods can identify students at risk of poor academic progress and dropping out. The main goal of this study was to use machine learning models, Artificial Neural Networks (ANN) and Naïve Bayes (NB), to identify first year medical students that succeed academically, using sociodemographic data and academic history. METHODS Data from seven cohorts (2011 to 2017) of admitted medical students to the National Autonomous University of Mexico (UNAM) Faculty of Medicine in Mexico City were analysed. Data from 7,976 students (2011 to 2017 cohorts) of the program were included. Information from admission diagnostic exam results, academic history, sociodemographic characteristics and family environment was used. The main dataset included 48 variables. The study followed the general knowledge discovery process: pre-processing, data analysis, and validation. Artificial Neural Networks (ANN) and Naïve Bayes (NB) models were used for data mining analysis. RESULTS ANNs models had slightly better performance in accuracy, sensitivity, and specificity. Both models had better sensitivity when classifying regular students and better specificity when classifying irregular students. Of the 25 variables with highest predictive value in the Naïve Bayes model, percentage of correct answers in the diagnostic exam was the best variable. CONCLUSIONS Both ANN and Naïve Bayes methods can be useful for predicting medical students' academic achievement in an undergraduate program, based on information of their prior knowledge and socio-demographic factors. Although ANN offered slightly superior results, Naïve Bayes made it possible to obtain an in-depth analysis of how the different variables influenced the model. The use of educational data mining techniques and machine learning classification techniques have potential in medical education.
Collapse
Affiliation(s)
- Diego Monteverde-Suárez
- Coordination of Open University, Educational Innovation and Distance Education, (CUAIEED), National Autonomous University of Mexico (UNAM), Mexico City, Mexico
| | - Patricia González-Flores
- Coordination of Open University, Educational Innovation and Distance Education, (CUAIEED), National Autonomous University of Mexico (UNAM), Mexico City, Mexico
| | - Roberto Santos-Solórzano
- Coordination of Open University, Educational Innovation and Distance Education, (CUAIEED), National Autonomous University of Mexico (UNAM), Mexico City, Mexico
| | - Manuel García-Minjares
- Coordination of Open University, Educational Innovation and Distance Education, (CUAIEED), National Autonomous University of Mexico (UNAM), Mexico City, Mexico
| | - Irma Zavala-Sierra
- Coordination of Open University, Educational Innovation and Distance Education, (CUAIEED), National Autonomous University of Mexico (UNAM), Mexico City, Mexico
| | - Verónica Luna de la Luz
- Coordination of Open University, Educational Innovation and Distance Education, (CUAIEED), National Autonomous University of Mexico (UNAM), Mexico City, Mexico
- Faculty of Medicine, National Autonomous University of Mexico (UNAM), Mexico City, Mexico
| | - Melchor Sánchez-Mendiola
- Coordination of Open University, Educational Innovation and Distance Education, (CUAIEED), National Autonomous University of Mexico (UNAM), Mexico City, Mexico.
- Faculty of Medicine, National Autonomous University of Mexico (UNAM), Mexico City, Mexico.
| |
Collapse
|
3
|
Kochetkova T, Hanke MS, Indermaur M, Groetsch A, Remund S, Neuenschwander B, Michler J, Siebenrock KA, Zysset P, Schwiedrzik J. Composition and micromechanical properties of the femoral neck compact bone in relation to patient age, sex and hip fracture occurrence. Bone 2023; 177:116920. [PMID: 37769956 DOI: 10.1016/j.bone.2023.116920] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Revised: 09/22/2023] [Accepted: 09/25/2023] [Indexed: 10/03/2023]
Abstract
Current clinical methods of bone health assessment depend to a great extent on bone mineral density (BMD) measurements. However, these methods only act as a proxy for bone strength and are often only carried out after the fracture occurs. Besides BMD, composition and tissue-level mechanical properties are expected to affect the whole bone's strength and toughness. While the elastic properties of the bone extracellular matrix (ECM) have been extensively investigated over the past two decades, there is still limited knowledge of the yield properties and their relationship to composition and architecture. In the present study, morphological, compositional and micropillar compression bone data was collected from patients who underwent hip arthroplasty. Femoral neck samples from 42 patients were collected together with anonymous clinical information about age, sex and primary diagnosis (coxarthrosis or hip fracture). The femoral neck cortex from the inferomedial region was analyzed in a site-matched manner using a combination of micromechanical testing (nanoindentation, micropillar compression) together with micro-CT and quantitative polarized Raman spectroscopy for both morphological and compositional characterization. Mechanical properties, as well as the sample-level mineral density, were constant over age. Only compositional properties demonstrate weak dependence on patient age: decreasing mineral to matrix ratio (p = 0.02, R2 = 0.13, 2.6 % per decade) and increasing amide I sub-peak ratio I∼1660/I∼1683 (p = 0.04, R2 = 0.11, 1.5 % per decade). The patient's sex and diagnosis did not seem to influence investigated bone properties. A clear zonal dependence between interstitial and osteonal cortical zones was observed for compositional and elastic bone properties (p < 0.0001). Site-matched microscale analysis confirmed that all investigated mechanical properties except yield strain demonstrate a positive correlation with the mineral fraction of bone. The output database is the first to integrate the experimentally assessed microscale yield properties, local tissue composition and morphology with the available patient clinical information. The final dataset was used for bone fracture risk prediction in-silico through the principal component analysis and the Naïve Bayes classification algorithm. The analysis showed that the mineral to matrix ratio, indentation hardness and micropillar yield stress are the most relevant parameters for bone fracture risk prediction at 70 % model accuracy (0.71 AUC). Due to the low number of samples, further studies to build a universal fracture prediction algorithm are anticipated with the higher number of patients (N > 200). The proposed classification algorithm together with the output dataset of bone tissue properties can be used for the future comparison of existing methods to evaluate bone quality as well as to form a better understanding of the mechanisms through which bone tissue is affected by aging or disease.
Collapse
Affiliation(s)
- Tatiana Kochetkova
- Empa, Swiss Federal Laboratories for Materials Science and Technology, Thun, Switzerland.
| | - Markus S Hanke
- Department of Orthopedic Surgery, Inselspital, University of Bern, Switzerland
| | - Michael Indermaur
- ARTORG Center for Biomedical Engineering Research, University of Bern, Switzerland
| | - Alexander Groetsch
- Empa, Swiss Federal Laboratories for Materials Science and Technology, Thun, Switzerland
| | - Stefan Remund
- Institute for Applied Laser, Photonics and Surface Technologies (ALPS), Bern University of Applied Sciences, Burgdorf, Switzerland
| | - Beat Neuenschwander
- Institute for Applied Laser, Photonics and Surface Technologies (ALPS), Bern University of Applied Sciences, Burgdorf, Switzerland
| | - Johann Michler
- Empa, Swiss Federal Laboratories for Materials Science and Technology, Thun, Switzerland
| | - Klaus A Siebenrock
- Department of Orthopedic Surgery, Inselspital, University of Bern, Switzerland
| | - Philippe Zysset
- ARTORG Center for Biomedical Engineering Research, University of Bern, Switzerland
| | - Jakob Schwiedrzik
- Empa, Swiss Federal Laboratories for Materials Science and Technology, Thun, Switzerland.
| |
Collapse
|
4
|
Zhang Q, Zhao HM, Yang K, Chen J, Yang RQ, Wang C. Construction of an Analysis Model of mRNA Markers in Menstrual Blood Based on Naïve Bayes and Multivariate Logistic Regression Methods. Fa Yi Xue Za Zhi 2023; 39:447-451. [PMID: 38006263 DOI: 10.12116/j.issn.1004-5619.2021.511207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 11/26/2023]
Abstract
OBJECTIVES To establish the menstrual blood identification model based on Naïve Bayes and multivariate logistic regression methods by using specific mRNA markers in menstrual blood detection technology combined with statistical methods, and to quantitatively distinguish menstrual blood from other body fluids. METHODS Body fluids including 86 menstrual blood, 48 peripheral blood, 48 vaginal secretions, 24 semen and 24 saliva samples were collected. RNA of the samples was extracted and cDNA was obtained by reverse transcription. Five menstrual blood-specific markers including members of the matrix metalloproteinase (MMP) family MMP3, MMP7, MMP11, progestogens associated endometrial protein (PAEP) and stanniocalcin-1 (STC1) were amplified and analyzed by electrophoresis. The results were analyzed by Naïve Bayes and multivariate logistic regression. RESULTS The accuracy of the classification model constructed was 88.37% by Naïve Bayes and 91.86% by multivariate logistic regression. In non-menstrual blood samples, the distinguishing accuracy of peripheral blood, saliva and semen was generally higher than 90%, while the distinguishing accuracy of vaginal secretions was lower, which were 16.67% and 33.33%, respectively. CONCLUSIONS The mRNA detection technology combined with statistical methods can be used to establish a classification and discrimination model for menstrual blood, which can distignuish the menstrual blood and other body fluids, and quantitative description of analysis results, which has a certain application value in body fluid stain identification.
Collapse
Affiliation(s)
- Qi Zhang
- People's Public Security University of China, Beijing 100038, China
- Wafangdian Public Security Bureau, Dalian 116300, Liaoning Province, China
| | - He-Miao Zhao
- Key Laboratory of Forensic Genetics, Institute of Forensic Science, Ministry of Public Security, Beijing 100038, China
| | - Kang Yang
- Xi'an Public Security Bureau, Xi'an 710038, China
| | - Jing Chen
- Key Laboratory of Forensic Genetics, Institute of Forensic Science, Ministry of Public Security, Beijing 100038, China
| | - Rui-Qin Yang
- People's Public Security University of China, Beijing 100038, China
| | - Chong Wang
- Key Laboratory of Forensic Genetics, Institute of Forensic Science, Ministry of Public Security, Beijing 100038, China
| |
Collapse
|
5
|
Varga G, Stoicu-Tivadar L, Nicola S. Comparison of Data Classification Results in Serious Gaming for Rehabilitation of Rheumatoid Arthritis. Stud Health Technol Inform 2023; 309:63-67. [PMID: 37869807 DOI: 10.3233/shti230740] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2023]
Abstract
Rheumatoid arthritis is a common disease which affects the joints of the wrist, fingers, feet and in the end the daily activities. Nowadays, gestures and virtual reality are used in many activities supporting recovery, games, learning as technology is present more and more in different fields. This paper presents results related to the grip movement detected by a Leap Motion device using binary classification and machine learning algorithms. We used 2 models to compare the results: Naïve Bayes and Random Forest Classifier. The metrics for comparison were: accuracy, precision, recall and f1-score. Also, we create a confusion matrix for a clear visualization of the results. We used 5000 data to train the algorithm and 1500 data to test. The accuracy and the precision were bigger than 97% in all the cases.
Collapse
Affiliation(s)
- Gabriela Varga
- Politehnica University Timişoara, Department of Automation and Applied Informatics, 2, Vasile Pârvan Blvd., 300223, Timişoara, România
| | - Lăcrămioara Stoicu-Tivadar
- Politehnica University Timişoara, Department of Automation and Applied Informatics, 2, Vasile Pârvan Blvd., 300223, Timişoara, România
| | - Stelian Nicola
- Politehnica University Timişoara, Department of Automation and Applied Informatics, 2, Vasile Pârvan Blvd., 300223, Timişoara, România
| |
Collapse
|
6
|
Costantini G, Cesarini V, Brenna E. High-Level CNN and Machine Learning Methods for Speaker Recognition. Sensors (Basel) 2023; 23:3461. [PMID: 37050521 PMCID: PMC10098737 DOI: 10.3390/s23073461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 03/20/2023] [Accepted: 03/22/2023] [Indexed: 06/19/2023]
Abstract
Speaker Recognition (SR) is a common task in AI-based sound analysis, involving structurally different methodologies such as Deep Learning or "traditional" Machine Learning (ML). In this paper, we compared and explored the two methodologies on the DEMoS dataset consisting of 8869 audio files of 58 speakers in different emotional states. A custom CNN is compared to several pre-trained nets using image inputs of spectrograms and Cepstral-temporal (MFCC) graphs. AML approach based on acoustic feature extraction, selection and multi-class classification by means of a Naïve Bayes model is also considered. Results show how a custom, less deep CNN trained on grayscale spectrogram images obtain the most accurate results, 90.15% on grayscale spectrograms and 83.17% on colored MFCC. AlexNet provides comparable results, reaching 89.28% on spectrograms and 83.43% on MFCC.The Naïve Bayes classifier provides a 87.09% accuracy and a 0.985 average AUC while being faster to train and more interpretable. Feature selection shows how F0, MFCC and voicing-related features are the most characterizing for this SR task. The high amount of training samples and the emotional content of the DEMoS dataset better reflect a real case scenario for speaker recognition, and account for the generalization power of the models.
Collapse
|
7
|
Barman U, Pathak C, Mazumder NK. Comparative assessment of Pest damage identification of coconut plant using damage texture and color analysis. Multimed Tools Appl 2023; 82:1-23. [PMID: 36712953 PMCID: PMC9874181 DOI: 10.1007/s11042-023-14369-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Revised: 06/27/2022] [Accepted: 01/02/2023] [Indexed: 06/18/2023]
Abstract
Coconut cultivation is a promising agricultural activity. But to keep the coconut plants pest-free, the detection of various pest damage in coconut plants is of utmost importance for the cultivators. The processes that the cultivators use to detect pest damage in coconut plants are conventional methods, experts' views, or some laboratory techniques. But these procedures are not adequate in the detection of coconut damage identification. In this study, 16 different color and texture features are reported for 1265 coconut pest damage images by extracting the color and texture features of the damage images in the color and grey domain after the damage segmentation using the thresholding technique. The Gray Level Co-occurrence Matrix (GLCM) and Gray Level Run Length Matrix (GLRLM) techniques are applied to extract the texture features of the damages and two Artificial Neural Network (ANN) architectures are reported to classify the extracted data features of the damages into 5 different classes such as Eriophyid_Mite, Rhinoceros_Beetle, Red_Palm_Weevil, Rugose_Spiraling_White_fly, and Rugose_in_Mature with an average testing accuracy of almost 100% respectively. To compare the results with the other machine learning techniques, the Support Vector Machine(SVM), Decision Tree (DT), and Naïve Bayes (NB) are also introduced for damage identification where the SVM methods also report almost 100% accuracy on the fuse features of GLCM and GLRLM. The results of the ANN and SVM are compared by finding the confusion matrix, precision, recall, and f-1 score of the ANN model with the DT and NB classifier. The ANN and SVM outperform in all matrices and they can be used as the base model for further study of coconut pest damage identification using deep learning techniques.
Collapse
Affiliation(s)
- Utpal Barman
- Department of CSE, The Assam Kaziranga University, Jorhat, Assam India
| | | | - Nirmal Kumar Mazumder
- Department of Plant Pathology, BN College of Agriculture, AAU, Biswanath Chariali, Assam India
| |
Collapse
|
8
|
Albataineh Z, Aldrweesh F, Alzubaidi MA. COVID-19 CT-images diagnosis and severity assessment using machine learning algorithm. Cluster Comput 2023:1-16. [PMID: 36712413 PMCID: PMC9871425 DOI: 10.1007/s10586-023-03972-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Revised: 11/20/2022] [Accepted: 11/26/2022] [Indexed: 06/18/2023]
Abstract
As a pandemic, the primary evaluation tool for coronavirus (COVID-19) still has serious flaws. To improve the existing situation, all facilities and tools available in this field should be used to combat the pandemic. Reverse transcription polymerase chain reaction is used to evaluate whether or not a person has this virus, but it cannot establish the severity of the illness. In this paper, we propose a simple, reliable, and automatic system to diagnose the severity of COVID-19 from the CT scans into three stages: mild, moderate, and severe, based on the simple segmentation method and three types of features extracted from the CT images, which are ratio of infection, statistical texture features (mean, standard deviation, skewness, and kurtosis), GLCM and GLRLM texture features. Four machine learning techniques (decision trees (DT), K-nearest neighbors (KNN), support vector machines (SVM), and Naïve Bayes) are used to classify scans. 1801 scans are divided into four stages based on the CT findings in the scans and the description file found with the datasets. Our proposed model divides into four steps: preprocessing, feature extraction, classification, and performance evaluation. Four machine learning algorithms are used in the classification step: SVM, KNN, DT, and Naive Bayes. By SVM method, the proposed model achieves 99.12%, 98.24%, 98.73%, and 99.9% accuracy for COVID-19 infection segmentation at the normal, mild, moderate, and severe stages, respectively. The area under the curve of the model is 0.99. Finally, our proposed model achieves better performance than state-of-art models. This will help the doctors know the stage of the infection and thus shorten the time and give the appropriate dose of treatment for this stage.
Collapse
Affiliation(s)
- Zaid Albataineh
- Department of Electronic Engineering, Yarmouk University, Irbid, 21163 Jordan
| | - Fatima Aldrweesh
- Department of Computer Engineering, Yarmouk University, Irbid, 21163 Jordan
| | | |
Collapse
|
9
|
Sebro R, la Garza-Ramos CD. Utilizing machine learning for opportunistic screening for low BMD using CT scans of the cervical spine. J Neuroradiol 2022; 50:293-301. [PMID: 36030924 DOI: 10.1016/j.neurad.2022.08.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2022] [Revised: 08/22/2022] [Accepted: 08/24/2022] [Indexed: 11/28/2022]
Abstract
BACKGROUND Computed Tomography (CT) scans of the cervical spine are often performed to evaluate patients for trauma and degenerative changes of the cervical spine. We hypothesized that the CT attenuation of the cervical vertebrae can be used to identify patients who should be screened for osteoporosis. METHODS Retrospective study of 253 patients (177 training/validation and 76 test) with unenhanced CT scans of the cervical spine and DXA studies within 12 months of each other. Volumetric segmentation of C1-T1, clivus, and first ribs was performed to obtain the CT attenuation of each bone. The correlations of the CT attenuations between the bones and with DXA measurements were evaluated. Univariate receiver operator characteristic (ROC) analyses, and multivariate classifiers (Random Forest (RF), XGBoost, Naïve Bayes (NB), and Support Vector Machines (SVM)) analyzing the CT attenuation of all bones, were utilized to predict patients with osteopenia/osteoporosis and femoral neck bone mineral density (BMD) T-scores <-1. RESULTS There were positive correlations between the CT attenuation of each bone, and with the DXA measurements. A CT attenuation threshold of 305.2 Hounsfield Units (HU) at C3 had the highest accuracy =0.763 (AUC=0.814) to detect femoral neck BMD T-scores ≤-1 and a CT attenuation threshold of 323.6 HU at C3 had the highest accuracy=0.774 (AUC=0.843) to detect osteopenia/osteoporosis. The SVM classifier (AUC=0.756) had higher AUC than the RF (AUC=0.692, P=0.224), XGBoost (AUC=0.736; P=0.814), NB (AUC=0.622, P=0.133) and CT threshold of 305.2 HU at C3 (AUC=0.704, P=0.531) classifiers to identify patients with femoral neck BMD T-scores <-1. The SVM classifier (accuracy=0.816) was more accurate than using the CT threshold of 305.2 HU at C3 (accuracy=0.671) (McNemar's χ12=7.55, P=0.006). CONCLUSION Opportunistic screening for low BMD can be done using cervical spine CT scans. A SVM classifier was more accurate than using the CT threshold of 305.2 HU at C3.
Collapse
Affiliation(s)
- Ronnie Sebro
- Department of Radiology, Mayo Clinic, Jacksonville, FL 32224; Center for Augmented Intelligence, Mayo Clinic, Jacksonville, FL 32224.
| | | |
Collapse
|
10
|
Rabie AH, Mansour NA, Saleh AI, Takieldeen AE. Expecting individuals' body reaction to Covid-19 based on statistical Naïve Bayes technique. Pattern Recognit 2022; 128:108693. [PMID: 35400761 PMCID: PMC8983097 DOI: 10.1016/j.patcog.2022.108693] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/15/2021] [Revised: 02/01/2022] [Accepted: 04/03/2022] [Indexed: 06/14/2023]
Abstract
Covid-19, what a strange, unpredictable mutated virus. It has baffled many scientists, as no firm rule has yet been reached to predict the effect that the virus can inflict on people if they are infected with it. Recently, many researches have been introduced for diagnosing Covid-19; however, none of them pay attention to predict the effect of the virus on the person's body if the infection occurs but before the infection really takes place. Predicting the extent to which people will be affected if they are infected with the virus allows for some drastic precautions to be taken for those who will suffer from serious complications, while allowing some freedom for those who expect not to be affected badly. This paper introduces Covid-19 Prudential Expectation Strategy (CPES) as a new strategy for predicting the behavior of the person's body if he has been infected with Covid-19. The CPES composes of three phases called Outlier Rejection Phase (ORP), Feature Selection Phase (FSP), and Classification Phase (CP). For enhancing the classification accuracy in CP, CPES employs two proposed techniques for outlier rejection in ORP and feature selection in FSP, which are called Hybrid Outlier Rejection (HOR) method and Improved Binary Genetic Algorithm (IBGA) method respectively. In ORP, HOR rejects outliers in the training data using a hybrid method that combines standard division and Binary Gray Wolf Optimization (BGWO) method. On the other hand, in FSP, IBGA as a hybrid method selects the most useful features for the prediction process. IBGA includes Fisher Score (FScore) as a filter method to quickly select the features and BGA as a wrapper method to accurately select the features based on the average accuracy value from several classification models as a fitness function to guarantee the efficiency of the selected subset of features with any classifier. In CP, CPES has the ability to classify people based on their bodies' reaction to Covid-19 infection, which is built upon a proposed Statistical Naïve Bayes (SNB) classifier after performing the previous two phases. CPES has been compared against recent related strategies in terms of accuracy, error, recall, precision, and run-time using Covid-19 dataset [1]. This dataset contains routine blood tests collected from people before and after their infection with covid-19 through a Web-based form created by us. CPES outperforms the competing methods in experimental results because it provides the best results with values of 0.87, 0.13, 0.84, and 0.79 for accuracy, error, precision, and recall.
Collapse
Affiliation(s)
- Asmaa H Rabie
- Computers and Control Dept. faculty of engineering Mansoura University, Mansoura, Egypt
| | - Nehal A Mansour
- Nile Higher Institute for Engineering and Technology, Artificial intelligence Lab., Mansoura, Egypt
| | - Ahmed I Saleh
- Computers and Control Dept. faculty of engineering Mansoura University, Mansoura, Egypt
| | - Ali E Takieldeen
- IEEE Senior Member, Faculty of Artificial Intelligence, Delta University For Science and Technology, Egypt
| |
Collapse
|
11
|
Abu El-Magd SA, Maged A, Farhat HI. Hybrid-based Bayesian algorithm and hydrologic indices for flash flood vulnerability assessment in coastal regions: machine learning, risk prediction, and environmental impact. Environ Sci Pollut Res Int 2022; 29:57345-57356. [PMID: 35352224 PMCID: PMC9395492 DOI: 10.1007/s11356-022-19903-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 03/21/2022] [Indexed: 05/29/2023]
Abstract
Natural hazards and severe weather events are a matter of serious threat to humans, economic activities, and the environment. Flash floods are one of the extremely devastating natural events around the world. Consequently, the prediction and precise assessment of flash flood-prone areas are mandatory for any flood mitigation strategy. In this study, a new hybrid approach of machine learning (ML) algorithm and hydrologic indices opted to detect impacted and highly vulnerable areas. The obtained models were trained and validated using a total of 189 locations from Wadi Ghoweiba and surrounding area (case study). Various controlling factors including varied datasets such as stream transport index (STI), stream power index (SPI), lithological units, topographic wetness index (TWI), slope angle, stream density (SD), curvature, and slope aspect (SA) were utilized via hyper-parameter optimization setting to enhance the performance of the proposed model prediction. The hybrid machine learning (HML) model, developed by combining naïve Bayes (NïB) approach and hydrologic indices, was successfully implemented and utilized to investigate flash flood risk, sediment accumulation, and erosion predictions in the studied site. The synthesized new hybrid model demonstrated a model accuracy of 90.8% compared to 87.7% of NïB model, confirming the superior performance of the obtained model. Furthermore, the proposed model can be successfully employed in large-scale prediction applications.
Collapse
Affiliation(s)
- Sherif Ahmed Abu El-Magd
- Geology Department, Faculty of Science, Suez University, P.O. Box 43518, El Salam City, Suez Governorate, Egypt
| | - Ali Maged
- Geology Department, Faculty of Science, Suez University, P.O. Box 43518, El Salam City, Suez Governorate, Egypt.
| | - Hassan I Farhat
- Geology Department, Faculty of Science, Suez University, P.O. Box 43518, El Salam City, Suez Governorate, Egypt
| |
Collapse
|
12
|
Kalezhi J, Chibuluma M, Chembe C, Chama V, Lungo F, Kunda D. Modelling Covid-19 infections in Zambia using data mining techniques. Results Eng 2022; 13:100363. [PMID: 35317385 PMCID: PMC8813672 DOI: 10.1016/j.rineng.2022.100363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Revised: 01/08/2022] [Accepted: 02/01/2022] [Indexed: 06/14/2023]
Abstract
The outbreak of Covid-19 pandemic has been declared a global health crisis by the World Health Organization since its emergence. Several researchers have proposed a number of techniques to understand how the pandemic affects the populations. Reported among these techniques are data mining models which have been successfully applied in a wide range of situations before the advent of Covid-19 pandemic. In this work, the researchers have applied a number of existing data mining methods (classifiers) available in the Waikato Environment for Knowledge Analysis (WEKA) machine learning library. WEKA was used to gain a better understanding on how the epidemic spread within Zambia. The classifiers used are J48 decision tree, Multilayer Perceptron and Naïve Bayes among others. The predictions of these techniques are compared against simpler classifiers and those reported in related works.
Collapse
Affiliation(s)
- Josephat Kalezhi
- Department of Computer Engineering, Copperbelt University, Kitwe, Zambia
| | - Mathews Chibuluma
- Department of Information Technology/Systems, Copperbelt University, Kitwe, Zambia
| | | | - Victoria Chama
- Department of Computer Science and Information Technology, Mulungushi University, Kabwe, Zambia
| | - Francis Lungo
- School of Social Sciences, Mulungushi University, Kabwe, Zambia
| | | |
Collapse
|
13
|
Tiwari D, Bhati BS, Al‐Turjman F, Nagpal B. Pandemic coronavirus disease (Covid-19): World effects analysis and prediction using machine-learning techniques. Expert Syst 2022; 39:e12714. [PMID: 34177035 PMCID: PMC8209956 DOI: 10.1111/exsy.12714] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Accepted: 04/26/2021] [Indexed: 05/09/2023]
Abstract
Pandemic novel Coronavirus (Covid-19) is an infectious disease that primarily spreads by droplets of nose discharge when sneezing and saliva from the mouth when coughing, that had first been reported in Wuhan, China in December 2019. Covid-19 became a global pandemic, which led to a harmful impact on the world. Many predictive models of Covid-19 are being proposed by academic researchers around the world to take the foremost decisions and enforce the appropriate control measures. Due to the lack of accurate Covid-19 records and uncertainty, the standard techniques are being failed to correctly predict the epidemic global effects. To address this issue, we present an Artificial Intelligence (AI)-based meta-analysis to predict the trend of epidemic Covid-19 over the world. The powerful machine learning algorithms namely Naïve Bayes, Support Vector Machine (SVM) and Linear Regression were applied on real time-series dataset, which holds the global record of confirmed, recovered, deaths and active cases of Covid-19 outbreak. Statistical analysis has also been conducted to present various facts regarding Covid-19 observed symptoms, a list of Top-20 Coronavirus affected countries and a number of coactive cases over the world. Among the three machine learning techniques investigated, Naïve Bayes produced promising results to predict Covid-19 future trends with less Mean Absolute Error (MAE) and Mean Squared Error (MSE). The less value of MAE and MSE strongly represent the effectiveness of the Naïve Bayes regression technique. Although, the global footprint of this pandemic is still uncertain. This study demonstrates the various trends and future growth of the global pandemic for a proactive response from the citizens and governments of countries. This paper sets the initial benchmark to demonstrate the capability of machine learning for outbreak prediction.
Collapse
Affiliation(s)
- Dimple Tiwari
- Ambedkar Institute of Advanced Communication Technologies and Research, Govt of NCT of DelhiDelhiIndia
| | - Bhoopesh Singh Bhati
- Ambedkar Institute of Advanced Communication Technologies and Research, Govt of NCT of DelhiDelhiIndia
| | - Fadi Al‐Turjman
- Artificial Intelligence Engineering Department, Research Center for AI and IoTNear East UniversityNicosiaTurkey
| | - Bharti Nagpal
- Ambedkar Institute of Advanced Communication Technologies and Research, Govt of NCT of DelhiDelhiIndia
| |
Collapse
|
14
|
Abd-Elsalam SM, Ezz MM, Gamalel-Din S, Esmat G, Elakel W, ElHefnawi M. Derivation of "Egyptian varices prediction (EVP) index": A novel noninvasive index for diagnosing esophageal varices in HCV Patients. J Adv Res 2022; 35:87-97. [PMID: 35024195 PMCID: PMC8721354 DOI: 10.1016/j.jare.2021.02.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2020] [Revised: 02/06/2021] [Accepted: 02/17/2021] [Indexed: 02/07/2023] Open
Abstract
Esophageal Varices is one complication of chronic liver disease that leads to deaths globally due to hemorrhage. The prediction of presence the Esophageal Varices is essential to avoid bleeding for patients. Now the only diagnostic method for Esophageal Varices by the upper gastrointestinal endoscopy but it has many disadvantages. Only ten variables are the most significant for diagnosing the varices: PLT, Stiffness, PC, liver texture, spleen, HCV-RNA, Albumin, gender, Total bilirubin, and PV diameter. We Evaluated the effectiveness of several noninvasive markers for predicting Varices. We Introduced a novel (EVP) index with acceptable performance for diagnosing Varices and compared with the exist, it could save operating the upper endoscopic by nearly 46.5%.
Introduction Esophageal Varices (EVs) is one of the major dangerous complications of liver fibrosis. Upper Gastrointestinal (UGI) Endoscopy is necessary for its diagnosis. Repeated examinations for EVs screening severely burden endoscopic units in terms of cost and other side implications; moreover, the lack of public health resources in rural areas and primary hospitals should be considered, particularly in developing countries. So, an accurate noninvasive marker for EV is highly needed for liver disease patients. Objectives This study sought to evaluate the values of several indices to determine how adequate are they in predicting EV and build a novel accurate prediction index. Methods Five thousand and thirteen patients were enrolled. The laboratory tests, abdominal ultrasonography, liver stiffness measurement using Fibro-scan, and UGI endoscopy were performed. Ten common indices: Fib-4 score, AST-to-platelet ratio index, Fibrosis index, AST/ALT ratio Varices Prediction Rule, Baveno VI, APRI-Fib4 Combo, King score, “Model for End-Stage Liver Disease”, and Lok Score were calculated. The significant predictors for EVs were identified by using “P-value Correlation-based Filter Selection” method, where a novel Egyptian Varices Prediction (EVP) index was developed using binary logistic regression. The diagnostic performance was evaluated by some parameters and the Area Under Curve (AUC). Results EVP Index was correlated to EVs at 0.5; it achieved higher performance (AUC 0.788, accuracy 73.3%, and sensitivity 78%) than the other indices at a cutoff point of 0.423. Conclusion EVP Index was a good noninvasive predictor. It had an acceptable performance for diagnosing EVs and it was only required regular laboratory tests and imaging data. It can provide a tool for classifying or arranging the patients according to the degree pre-emptive for selective endoscopy and the degree of severity. Also, it will enable clinicians to concentrate on one marker instead of a wide set of parameters.
Collapse
Affiliation(s)
- Shimaa M Abd-Elsalam
- Systems and Information Department, Engineering Research Division, National Research Centre, Giza, Egypt.,Biomedical Informatics in Cheminformatic Group, Centre of Excellence for Medical Research, National Research Centre, Giza, Egypt.,Systems and Computers Engineering Department, Faculty of Engineering, Al-Azhar University, Cairo, Egypt
| | - Mohamed M Ezz
- Department of Computer Science, College of Computer and Information Sciences, Jouf University, Sakaka, Saudi Arabia.,Systems and Computers Engineering Department, Faculty of Engineering, Al-Azhar University, Cairo, Egypt
| | - Shehab Gamalel-Din
- Systems and Computers Engineering Department, Faculty of Engineering, Al-Azhar University, Cairo, Egypt
| | - Gamal Esmat
- Endemic Medicine and Hepatology Department, Faculty of Medicine, Cairo University, Cairo, Egypt
| | - Wafaa Elakel
- Endemic Medicine and Hepatology Department, Faculty of Medicine, Cairo University, Cairo, Egypt
| | - Mahmoud ElHefnawi
- Systems and Information Department, Engineering Research Division, National Research Centre, Giza, Egypt.,Biomedical Informatics in Cheminformatic Group, Centre of Excellence for Medical Research, National Research Centre, Giza, Egypt
| |
Collapse
|
15
|
Alshammari MM, Almuhanna A, Alhiyafi J. Mammography Image-Based Diagnosis of Breast Cancer Using Machine Learning: A Pilot Study. Sensors (Basel) 2021; 22:203. [PMID: 35009746 DOI: 10.3390/s22010203] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/06/2021] [Revised: 12/22/2021] [Accepted: 12/24/2021] [Indexed: 02/08/2023]
Abstract
A tumor is an abnormal tissue classified as either benign or malignant. A breast tumor is one of the most common tumors in women. Radiologists use mammograms to identify a breast tumor and classify it, which is a time-consuming process and prone to error due to the complexity of the tumor. In this study, we applied machine learning-based techniques to assist the radiologist in reading mammogram images and classifying the tumor in a very reasonable time interval. We extracted several features from the region of interest in the mammogram, which the radiologist manually annotated. These features are incorporated into a classification engine to train and build the proposed structure classification models. We used a dataset that was not previously seen in the model to evaluate the accuracy of the proposed system following the standard model evaluation schemes. Accordingly, this study found that various factors could affect the performance, which we avoided after experimenting all the possible ways. This study finally recommends using the optimized Support Vector Machine or Naïve Bayes, which produced 100% accuracy after integrating the feature selection and hyper-parameter optimization schemes.
Collapse
|
16
|
Das S, Amin SA, Jha T. Insight into the structural requirement of aryl sulphonamide based gelatinases (MMP-2 and MMP-9) inhibitors - Part I: 2D-QSAR, 3D-QSAR topomer CoMFA and Naïve Bayes studies - First report of 3D-QSAR Topomer CoMFA analysis for MMP-9 inhibitors and jointly inhibitors of gelatinases together. SAR QSAR Environ Res 2021; 32:655-687. [PMID: 34355614 DOI: 10.1080/1062936x.2021.1955414] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Accepted: 07/11/2021] [Indexed: 06/13/2023]
Abstract
Gelatinases [gelatinase A - matrix metalloproteinase-2 (MMP-2), gelatinase B - matrix metalloproteinase-9 (MMP-9)] play key roles in many disease conditions including cancer. Despite some research work on gelatinases inhibitors both jointly and individually had been reported, challenges still exist in achieving potency as well as selectivity. Here in part I of a series of work, we have reported the structural requirement of some arylsulfonamides. In particular, regression-based 2D-QSARs, topomer CoMFA (comparative molecular field analysis) and Bayesian classification models were constructed to refine structural features for attaining better gelatinase inhibitory activity. The 2D-QSAR models exhibited good statistical significance. The descriptors nsssN, SHBint6, SHBint7, PubchemFP629 were directly correlated with the MMP-2 binding affinities whereas nsssN, SHBint10 and AATS2i were directly proportional to MMP-9 binding affinities. The topomer CoMFA results indicated that the steric and electrostatic fields play key roles in gelatinase inhibition. The established Naïve Bayes prediction models were evaluated by fivefold cross validation and an external test set. Furthermore, important molecular descriptors related to MMP-2 and MMP-9 binding affinities and some active/inactive fragments were identified. Thus, these observations may be helpful for further work of aryl sulphonamide based gelatinase inhibitors in future.
Collapse
Affiliation(s)
- S Das
- Natural Science Laboratory, Division of Medicinal and Pharmaceutical Chemistry, Department of Pharmaceutical Technology, Jadavpur University, Kolkata, India
| | - S A Amin
- Natural Science Laboratory, Division of Medicinal and Pharmaceutical Chemistry, Department of Pharmaceutical Technology, Jadavpur University, Kolkata, India
| | - T Jha
- Natural Science Laboratory, Division of Medicinal and Pharmaceutical Chemistry, Department of Pharmaceutical Technology, Jadavpur University, Kolkata, India
| |
Collapse
|
17
|
Jha AN, Chatterjee N, Tiwari G. A performance analysis of prediction techniques for impacting vehicles in hit-and-run road accidents. Accid Anal Prev 2021; 157:106164. [PMID: 33957476 DOI: 10.1016/j.aap.2021.106164] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Revised: 01/12/2021] [Accepted: 04/27/2021] [Indexed: 06/12/2023]
Abstract
Road accidents are globally accepted challenges. They are one of the significant causes of deaths and injuries besides other direct and indirect losses. Countries and international organizations have designed technologies, systems, and policies to prevent accidents. However, hit-and-run accidents remain one of the most dangerous types of road accidents as the information about the vehicle responsible for the accident remain unknown. Therefore, any mechanism which can provide information about the impacting vehicle in hit-and-run accidents will be useful in planning and executing preventive measures to address this road menace. Since there exist several models to predict the impacting unknown vehicle, it becomes important to find which is the most accurate amongst those available. This research applies a process-based approach that identifies the most accurate model out of six supervised learning classification models viz. Logistic Reasoning, Linear Discriminant Analysis, Naïve Bayes, Classification and Regression Trees, k-Nearest Neighbor and Support Vector Machine. These models are implemented using five-fold and ten-fold cross validation, on road accident data collected from five mid-sized Indian cities: Agra, Amritsar, Bhopal, Ludhiana, and Vizag (Vishakhapatnam).This study investigates the possible input factors that may have effect on the performance of applied models. Based on the results of the experiment conducted in this study, Support Vector Machine has been found to have the maximum potentiality to predict unknown impacting vehicle type in hit-and-run accidents for all the cities except Amritsar. The result indicates that, Classification and Regression Trees have maximum accuracy, for Amritsar. Naïve Bayes performed very poorly for the five cities. These recommendations will help in predicting unknown impacting vehicles in hit-and-run accidents. The outcome is useful for transportation authorities and policymakers to implement effective road safety measures for the safety of road users.
Collapse
|
18
|
Savic N, Bovio N, Gilbert F, Paz J, Guseva Canu I. Procode: A Machine-Learning Tool to Support (Re-)coding of Free-Texts of Occupations and Industries. Ann Work Expo Health 2021; 66:113-118. [PMID: 34145882 DOI: 10.1093/annweh/wxab037] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Revised: 03/30/2021] [Accepted: 05/07/2021] [Indexed: 11/13/2022] Open
Abstract
Procode is a free of charge web-tool that allows automatic coding of occupational data (free-texts) by implementing Complement Naïve Bayes (CNB) as a machine-learning technique. The paper describes the algorithm, performance evaluation, and future goals regarding the tool's development. Almost 30 000 free-texts with manually assigned classification codes of French classification of occupations (PCS) and French classification of activities (NAF) were used to train CNB. A 5-fold cross-validation found that Procode predicts correct classification codes in 57-81 and 63-83% cases for PCS and NAF, respectively. Procode also integrates recoding between two classifications. In the first version of Procode, this operation, however, is only a simple search function of recoding links in existing crosswalks. Future focus of the project will be collection of the data to support automatic coding to other classification and to establish a more advanced method for recoding.
Collapse
Affiliation(s)
- Nenad Savic
- Department for Health, Work and Environment, Centre for Primary Care and Public Health (Unisanté), University of Lausanne, Route de la Corniche 2, CH-1066 Epalinges-Lausanne, Switzerland
| | - Nicolas Bovio
- Department for Health, Work and Environment, Centre for Primary Care and Public Health (Unisanté), University of Lausanne, Route de la Corniche 2, CH-1066 Epalinges-Lausanne, Switzerland
| | - Fabien Gilbert
- Research Institute for Environmental and Occupational Health, 28 rue Roger Amsler, CS 74521, 49045 Angers, France
| | - José Paz
- Department for Health, Work and Environment, Centre for Primary Care and Public Health (Unisanté), University of Lausanne, Route de la Corniche 2, CH-1066 Epalinges-Lausanne, Switzerland
| | - Irina Guseva Canu
- Department for Health, Work and Environment, Centre for Primary Care and Public Health (Unisanté), University of Lausanne, Route de la Corniche 2, CH-1066 Epalinges-Lausanne, Switzerland
| |
Collapse
|
19
|
Chatterjee A, Roy S, Das S. A Bi-fold Approach to Detect and Classify COVID-19 X-Ray Images and Symptom Auditor. SN Comput Sci 2021; 2:304. [PMID: 34075356 PMCID: PMC8160081 DOI: 10.1007/s42979-021-00701-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Accepted: 05/12/2021] [Indexed: 11/17/2022]
Abstract
In this paper, we propose an ensemble-based transfer learning method to predict the X-ray image of a COVID-19 affected person. We have used a weighted Euclidean distance average as the parameter to ensemble the transfer learning model viz. ResNet50, VGG16, VGG19, Xception, and InceptionV3. Image augmentations have been carried out using generative adversarial network modelling. We took 784 training images, and 278 test images to validate our model accuracy, and the accuracy of our proposed model was around 98.67% for the training data set and 95.52% for the test data set. Along with that, we also propose a genetic algorithm optimized classification algorithm, to analyze the symptoms of COVID-19 for low, medium, and high-risk patients. The accuracy for the optimized set overshadowed the accuracy of un-optimized classification, and the optimized accuracy is as high as 88.96% for the optimized model. The novelty of this paper lies in the bi-sided model of the paper, i.e., we propose two major models, and one is the genetic algorithm optimized model to analyze the symptoms for a patient of varied risk and the other is to classify the X-ray image using an ensemble-based transfer learning model.
Collapse
Affiliation(s)
- Ahan Chatterjee
- Department of Computer Science and Engineering, The Neotia University, Sarisha, West Bengal India
| | - Swagatam Roy
- Department of Computer Science and Engineering, The Neotia University, Sarisha, West Bengal India
| | - Sunanda Das
- Department of Computer Science and Engineering, SVCET, Chittoor, Andhra Pradesh India
| |
Collapse
|
20
|
Bosc N, Felix E, Arcila R, Mendez D, Saunders MR, Green DVS, Ochoada J, Shelat AA, Martin EJ, Iyer P, Engkvist O, Verras A, Duffy J, Burrows J, Gardner JMF, Leach AR. MAIP: a web service for predicting blood-stage malaria inhibitors. J Cheminform 2021; 13:13. [PMID: 33618772 PMCID: PMC7898753 DOI: 10.1186/s13321-021-00487-2] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Accepted: 01/20/2021] [Indexed: 12/17/2022] Open
Abstract
Malaria is a disease affecting hundreds of millions of people across the world, mainly in developing countries and especially in sub-Saharan Africa. It is the cause of hundreds of thousands of deaths each year and there is an ever-present need to identify and develop effective new therapies to tackle the disease and overcome increasing drug resistance. Here, we extend a previous study in which a number of partners collaborated to develop a consensus in silico model that can be used to identify novel molecules that may have antimalarial properties. The performance of machine learning methods generally improves with the number of data points available for training. One practical challenge in building large training sets is that the data are often proprietary and cannot be straightforwardly integrated. Here, this was addressed by sharing QSAR models, each built on a private data set. We describe the development of an open-source software platform for creating such models, a comprehensive evaluation of methods to create a single consensus model and a web platform called MAIP available at https://www.ebi.ac.uk/chembl/maip/ . MAIP is freely available for the wider community to make large-scale predictions of potential malaria inhibiting compounds. This project also highlights some of the practical challenges in reproducing published computational methods and the opportunities that open-source software can offer to the community.
Collapse
Affiliation(s)
- Nicolas Bosc
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, CB10 1SD, Hinxton, Cambridge, United Kingdom.
| | - Eloy Felix
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, CB10 1SD, Hinxton, Cambridge, United Kingdom
| | - Ricardo Arcila
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, CB10 1SD, Hinxton, Cambridge, United Kingdom
| | - David Mendez
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, CB10 1SD, Hinxton, Cambridge, United Kingdom
| | - Martin R Saunders
- Department of Molecular Design, Data and Computational Sciences, GlaxoSmithKline, Gunnels Wood Road, Hertfordshire, SG1 2NY, Stevenage, UK
| | - Darren V S Green
- Department of Molecular Design, Data and Computational Sciences, GlaxoSmithKline, Gunnels Wood Road, Hertfordshire, SG1 2NY, Stevenage, UK
| | - Jason Ochoada
- Department of Chemical Biology and Therapeutics, St. Jude Children's Research Hospital, 262 Danny Thomas Place, Tennessee, 38105, Memphis, USA
| | - Anang A Shelat
- Department of Chemical Biology and Therapeutics, St. Jude Children's Research Hospital, 262 Danny Thomas Place, Tennessee, 38105, Memphis, USA
| | - Eric J Martin
- Novartis Institute for Biomedical Research, 5300 Chiron Way, California, 94608- 2916, Emeryville, USA
| | - Preeti Iyer
- Hit Discovery, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden
| | - Ola Engkvist
- Hit Discovery, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden
| | - Andreas Verras
- Schrodinger Inc, 120 West 45th Street, 10036-4041, New York, NY, USA
| | - James Duffy
- Medicines for Malaria Ventures Discovery, 1215, Geneva, Switzerland
| | - Jeremy Burrows
- Medicines for Malaria Ventures Discovery, 1215, Geneva, Switzerland
| | - J Mark F Gardner
- AMG Consultants Ltd, Discovery Park House, Discovery Park, Ramsgate Road, CT13 9ND, Sandwich, Kent, UK
| | - Andrew R Leach
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, CB10 1SD, Hinxton, Cambridge, United Kingdom.
| |
Collapse
|
21
|
Jain N, Jhunthra S, Garg H, Gupta V, Mohan S, Ahmadian A, Salahshour S, Ferrara M. Prediction modelling of COVID using machine learning methods from B-cell dataset. Results Phys 2021; 21:103813. [PMID: 33495725 PMCID: PMC7816944 DOI: 10.1016/j.rinp.2021.103813] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Revised: 12/25/2020] [Accepted: 12/30/2020] [Indexed: 05/03/2023]
Abstract
Coronavirus is a pandemic that has become a concern for the whole world. This disease has stepped out to its greatest extent and is expanding day by day. Coronavirus, termed as a worldwide disease, has caused more than 8 lakh deaths worldwide. The foremost cause of the spread of coronavirus is SARS-CoV and SARS-CoV-2, which are part of the coronavirus family. Thus, predicting the patients suffering from such pandemic diseases would help to formulate the difference in inaccurate and infeasible time duration. This paper mainly focuses on the prediction of SARS-CoV and SARS-CoV-2 using the B-cells dataset. The paper also proposes different ensemble learning strategies that came out to be beneficial while making predictions. The predictions are made using various machine learning models. The numerous machine learning models, such as SVM, Naïve Bayes, K-nearest neighbors, AdaBoost, Gradient boosting, XGBoost, Random forest, ensembles, and neural networks are used in predicting and analyzing the dataset. The most accurate result was obtained using the proposed algorithm with 0.919 AUC score and 87.248% validation accuracy for predicting SARS-CoV and 0.923 AUC and 87.7934% validation accuracy for predicting SARS-CoV-2 virus.
Collapse
Affiliation(s)
- Nikita Jain
- Department of Computer Science & Engineering, Bharati Vidyapeeth's College of Engineering, 110063 New Delhi, India
| | - Srishti Jhunthra
- Department of Computer Science & Engineering, Bharati Vidyapeeth's College of Engineering, 110063 New Delhi, India
| | - Harshit Garg
- Department of Computer Science & Engineering, Bharati Vidyapeeth's College of Engineering, 110063 New Delhi, India
| | - Vedika Gupta
- Department of Computer Science & Engineering, Bharati Vidyapeeth's College of Engineering, 110063 New Delhi, India
| | - Senthilkumar Mohan
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore 632014, India
| | - Ali Ahmadian
- Institute of IR 4.0, The National University of Malaysia, Bangi 43600 UKM, Selangor, Malaysia
- School of Mathematical Sciences, College of Science and Technology, Wenzhou-Kean University, Wenzhou, China
| | - Soheil Salahshour
- Faculty of Engineering and Natural Sciences, Bahcesehir University, Istanbul, Turkey
| | - Massimiliano Ferrara
- ICRIOS - The Invernizzi Centre for Research in Innovation, Organization, Strategy and Entrepreneurship, Bocconi University - Department of Management and Technology, Via Sarfatti, 25Milano (MI) 20136, Italy
| |
Collapse
|
22
|
Humayun F, Khan F, Fawad N, Shamas S, Fazal S, Khan A, Ali A, Farhan A, Wei DQ. Computational Method for Classification of Avian Influenza A Virus Using DNA Sequence Information and Physicochemical Properties. Front Genet 2021; 12:599321. [PMID: 33584824 PMCID: PMC7877484 DOI: 10.3389/fgene.2021.599321] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 01/04/2021] [Indexed: 11/30/2022] Open
Abstract
Accurate and fast characterization of the subtype sequences of Avian influenza A virus (AIAV) hemagglutinin (HA) and neuraminidase (NA) depends on expanding diagnostic services and is embedded in molecular epidemiological studies. A new approach for classifying the AIAV sequences of the HA and NA genes into subtypes using DNA sequence data and physicochemical properties is proposed. This method simply requires unaligned, full-length, or partial sequences of HA or NA DNA as input. It allows for quick and highly accurate assignments of HA sequences to subtypes H1–H16 and NA sequences to subtypes N1–N9. For feature extraction, k-gram, discrete wavelet transformation, and multivariate mutual information were used, and different classifiers were trained for prediction. Four different classifiers, Naïve Bayes, Support Vector Machine (SVM), K nearest neighbor (KNN), and Decision Tree, were compared using our feature selection method. This comparison is based on the 30% dataset separated from the original dataset for testing purposes. Among the four classifiers, Decision Tree was the best, and Precision, Recall, F1 score, and Accuracy were 0.9514, 0.9535, 0.9524, and 0.9571, respectively. Decision Tree had considerable improvements over the other three classifiers using our method. Results show that the proposed feature selection method, when trained with a Decision Tree classifier, gives the best results for accurate prediction of the AIAV subtype.
Collapse
Affiliation(s)
- Fahad Humayun
- State Key Laboratory of Microbial Metabolism, Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Fatima Khan
- Department of Bioinformatics and Biosciences, Capital University of Science and Technology, Islamabad, Pakistan
| | - Nasim Fawad
- Poultry Research Institute, Rawalpindi, Pakistan
| | - Shazia Shamas
- Department of Zoology, University of Gujrat, Gujrat, Pakistan
| | - Sahar Fazal
- Department of Bioinformatics and Biosciences, Capital University of Science and Technology, Islamabad, Pakistan
| | - Abbas Khan
- State Key Laboratory of Microbial Metabolism, Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Arif Ali
- State Key Laboratory of Microbial Metabolism, Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Ali Farhan
- Department of Bioinformatics and Biotechnology, Government College University Faisalabad, Faisalabad, Pakistan
| | - Dong-Qing Wei
- State Key Laboratory of Microbial Metabolism, Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| |
Collapse
|
23
|
Iqbal N, Sang J, Chen J, Xia X. Measuring Software Maintainability with Naïve Bayes Classifier. Entropy (Basel) 2021; 23:e23020136. [PMID: 33499278 PMCID: PMC7910974 DOI: 10.3390/e23020136] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Revised: 01/17/2021] [Accepted: 01/19/2021] [Indexed: 11/16/2022]
Abstract
Software products in the market are changing due to changes in business processes, technology, or new requirements from the customers. Maintainability of legacy systems has always been an inspiring task for the software companies. In order to determine whether the software requires maintainability by reverse engineering or by forward engineering approach, a system assessment was done from diverse perspectives: quality, business value, type of errors, etc. In this research, the changes required in the existing software components of the legacy system were identified using a supervised learning approach. New interfaces for the software components were redesigned according to the new requirements and/or type of errors. Software maintainability was measured by applying a machine learning technique, i.e., Naïve Bayes classifier. The dataset was designed based on the observations such as component state, successful or error type in the component, line of code of error that exists in the component, component business value, and changes required for the component or not. The results generated by the Waikato Environment for Knowledge Analysis (WEKA) software confirm the effectiveness of the introduced methodology with an accuracy of 97.18%.
Collapse
Affiliation(s)
| | - Jun Sang
- Correspondence: ; Tel.: +86-139-8369-7592
| | | | | |
Collapse
|
24
|
Lakretz Y, Ossmy O, Friedmann N, Mukamel R, Fried I. Single-cell activity in human STG during perception of phonemes is organized according to manner of articulation. Neuroimage 2021; 226:117499. [PMID: 33186717 DOI: 10.1016/j.neuroimage.2020.117499] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2020] [Revised: 09/29/2020] [Accepted: 10/21/2020] [Indexed: 11/23/2022] Open
Abstract
One of the central tasks of the human auditory system is to extract sound features from incoming acoustic signals that are most critical for speech perception. Specifically, phonological features and phonemes are the building blocks for more complex linguistic entities, such as syllables, words and sentences. Previous ECoG and EEG studies showed that various regions in the superior temporal gyrus (STG) exhibit selective responses to specific phonological features. However, electrical activity recorded by ECoG or EEG grids reflects average responses of large neuronal populations and is therefore limited in providing insights into activity patterns of single neurons. Here, we recorded spiking activity from 45 units in the STG from six neurosurgical patients who performed a listening task with phoneme stimuli. Fourteen units showed significant responsiveness to the stimuli. Using a Naïve-Bayes model, we find that single-cell responses to phonemes are governed by manner-of-articulation features and are organized according to sonority with two main clusters for sonorants and obstruents. We further find that 'neural similarity' (i.e. the similarity of evoked spiking activity between pairs of phonemes) is comparable to the 'perceptual similarity' (i.e. to what extent two phonemes are judged as sounding similar) based on perceptual confusion, assessed behaviorally in healthy subjects. Thus, phonemes that were perceptually similar also had similar neural responses. Taken together, our findings indicate that manner-of-articulation is the dominant organization dimension of phoneme representations at the single-cell level, suggesting a remarkable consistency across levels of analyses, from the single neuron level to that of large neuronal populations and behavior.
Collapse
|
25
|
Tavazzi E, Daberdaku S, Vasta R, Calvo A, Chiò A, Di Camillo B. Exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach. BMC Med Inform Decis Mak 2020; 20:174. [PMID: 32819346 PMCID: PMC7439551 DOI: 10.1186/s12911-020-01166-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2020] [Accepted: 06/24/2020] [Indexed: 11/12/2022] Open
Abstract
Background Clinical registers constitute an invaluable resource in the medical data-driven decision making context. Accurate machine learning and data mining approaches on these data can lead to faster diagnosis, definition of tailored interventions, and improved outcome prediction. A typical issue when implementing such approaches is the almost unavoidable presence of missing values in the collected data. In this work, we propose an imputation algorithm based on a mutual information-weighted k-nearest neighbours approach, able to handle the simultaneous presence of missing information in different types of variables. We developed and validated the method on a clinical register, constituted by the information collected over subsequent screening visits of a cohort of patients affected by amyotrophic lateral sclerosis. Methods For each subject with missing data to be imputed, we create a feature vector constituted by the information collected over his/her first three months of visits. This vector is used as sample in a k-nearest neighbours procedure, in order to select, among the other patients, the ones with the most similar temporal evolution of the disease over time. An ad hoc similarity metric was implemented for the sample comparison, capable of handling the different nature of the data, the presence of multiple missing values and include the cross-information among features captured by the mutual information statistic. Results We validated the proposed imputation method on an independent test set, comparing its performance with those of three state-of-the-art competitors, resulting in better performance. We further assessed the validity of our algorithm by comparing the performance of a survival classifier built on the data imputed with our method versus the one built on the data imputed with the best-performing competitor. Conclusions Imputation of missing data is a crucial –and often mandatory– step when working with real-world datasets. The algorithm proposed in this work could effectively impute an amyotrophic lateral sclerosis clinical dataset, by handling the temporal and the mixed-type nature of the data and by exploiting the cross-information among features. We also showed how the imputation quality can affect a machine learning task.
Collapse
Affiliation(s)
- Erica Tavazzi
- Department of Information Engineering, University of Padua, Via Gradenigo 6/A, Padua, 35131, Italy
| | - Sebastian Daberdaku
- Department of Information Engineering, University of Padua, Via Gradenigo 6/A, Padua, 35131, Italy
| | - Rosario Vasta
- Department of Neurosciences "Rita Levi Montalcini", University of Turin, Via Cherasco 15, Turin, 10124, Italy
| | - Andrea Calvo
- Department of Neurosciences "Rita Levi Montalcini", University of Turin, Via Cherasco 15, Turin, 10124, Italy
| | - Adriano Chiò
- Department of Neurosciences "Rita Levi Montalcini", University of Turin, Via Cherasco 15, Turin, 10124, Italy
| | - Barbara Di Camillo
- Department of Information Engineering, University of Padua, Via Gradenigo 6/A, Padua, 35131, Italy.
| |
Collapse
|
26
|
Abstract
The problem of cancer risk analysis is of great importance to health-service providers and medical researchers. In this study, we propose a novel Artificial Neural Network (ANN) algorithm based on the probabilistic framework, which aims to investigate patient patterns associated with their disease development. Compared to the traditional ANN where input features are directly extracted from raw data, the proposed probabilistic ANN manipulates original inputs according to their probability distribution. More precisely, the Naïve Bayes and Markov chain models are used to approximate the posterior distribution of the raw inputs, which provides a useful estimation of subsequent disease development. Later, this distribution information is further leveraged as additional input to train ANN. Additionally, to reduce the training cost and to boost the generalization capability, a sparse training strategy is also introduced. Experimentally, one of the largest cancer-related datasets is employed in this study. Compared to state-of-the-art methods, the proposed algorithm achieves a much better outcome, in terms of the prediction accuracy of subsequent disease development. The result also reveals the potential impact of patients' disease sequence on their future risk management.
Collapse
Affiliation(s)
- Chaoyu Yang
- School of Economics and Management, Anhui University of Science and Technology, Huainan, China
| | - Jie Yang
- Faculty of Engineering and Information Sciences, School of Computing and Information Technology, University of Wollongong, Wollongong, NSW, Australia
| | - Ying Liu
- School of Economics and Management, Anhui University of Science and Technology, Huainan, China
| | - Xianya Geng
- School of Mathematics and Physics, Anhui University of Science and Technology, Huainan, China
| |
Collapse
|
27
|
Saccà V, Sarica A, Novellino F, Barone S, Tallarico T, Filippelli E, Granata A, Chiriaco C, Bruno Bossio R, Valentino P, Quattrone A. Evaluation of machine learning algorithms performance for the prediction of early multiple sclerosis from resting-state FMRI connectivity data. Brain Imaging Behav 2020; 13:1103-1114. [PMID: 29992392 DOI: 10.1007/s11682-018-9926-9] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Machine Learning application on clinical data in order to support diagnosis and prognostic evaluation arouses growing interest in scientific community. However, choice of right algorithm to use was fundamental to perform reliable and robust classification. Our study aimed to explore if different kinds of Machine Learning technique could be effective to support early diagnosis of Multiple Sclerosis and which of them presented best performance in distinguishing Multiple Sclerosis patients from control subjects. We selected following algorithms: Random Forest, Support Vector Machine, Naïve-Bayes, K-nearest-neighbor and Artificial Neural Network. We applied the Independent Component Analysis to resting-state functional-MRI sequence to identify brain networks. We found 15 networks, from which we extracted the mean signals used into classification. We performed feature selection tasks in all algorithms to obtain the most important variables. We showed that best discriminant network between controls and early Multiple Sclerosis, was the sensori-motor I, according to early manifestation of motor/sensorial deficits in Multiple Sclerosis. Moreover, in classification performance, Random Forest and Support Vector Machine showed same 5-fold cross-validation accuracies (85.7%) using only this network, resulting to be best approaches. We believe that these findings could represent encouraging step toward the translation to clinical diagnosis and prognosis.
Collapse
Affiliation(s)
- Valeria Saccà
- Department of Medical and Surgical Sciences, University "Magna Graecia", Catanzaro, Italy
| | - Alessia Sarica
- National Research Council, Institute of Bioimaging and Molecular Physiology (IBFM), Catanzaro, Italy
| | - Fabiana Novellino
- National Research Council, Institute of Bioimaging and Molecular Physiology (IBFM), Catanzaro, Italy.
| | - Stefania Barone
- Institute of Neurology, University Magna Graecia, Catanzaro, Italy
| | | | | | - Alfredo Granata
- Institute of Neurology, University Magna Graecia, Catanzaro, Italy
| | - Carmelina Chiriaco
- National Research Council, Institute of Bioimaging and Molecular Physiology (IBFM), Catanzaro, Italy
| | - Roberto Bruno Bossio
- Neurology Operating Unit Serraspiga, Provincial Health Authority, Cosenza, Italy
| | - Paola Valentino
- Institute of Neurology, University Magna Graecia, Catanzaro, Italy
| | - Aldo Quattrone
- National Research Council, Institute of Bioimaging and Molecular Physiology (IBFM), Catanzaro, Italy
- Institute of Neurology, University Magna Graecia, Catanzaro, Italy
| |
Collapse
|
28
|
Maniruzzaman M, Rahman MJ, Ahammed B, Abedin MM. Classification and prediction of diabetes disease using machine learning paradigm. Health Inf Sci Syst 2020; 8:7. [PMID: 31949894 DOI: 10.1007/s13755-019-0095-z] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2019] [Accepted: 12/21/2019] [Indexed: 12/19/2022] Open
Abstract
Background and objectives Diabetes is a chronic disease characterized by high blood sugar. It may cause many complicated disease like stroke, kidney failure, heart attack, etc. About 422 million people were affected by diabetes disease in worldwide in 2014. The figure will be reached 642 million in 2040. The main objective of this study is to develop a machine learning (ML)-based system for predicting diabetic patients. Materials and methods Logistic regression (LR) is used to identify the risk factors for diabetes disease based on p value and odds ratio (OR). We have adopted four classifiers like naïve Bayes (NB), decision tree (DT), Adaboost (AB), and random forest (RF) to predict the diabetic patients. Three types of partition protocols (K2, K5, and K10) have also adopted and repeated these protocols into 20 trails. Performances of these classifiers are evaluated using accuracy (ACC) and area under the curve (AUC). Results We have used diabetes dataset, conducted in 2009-2012, derived from the National Health and Nutrition Examination Survey. The dataset consists of 6561 respondents with 657 diabetic and 5904 controls. LR model demonstrates that 7 factors out of 14 as age, education, BMI, systolic BP, diastolic BP, direct cholesterol, and total cholesterol are the risk factors for diabetes. The overall ACC of ML-based system is 90.62%. The combination of LR-based feature selection and RF-based classifier gives 94.25% ACC and 0.95 AUC for K10 protocol. Conclusion The combination of LR and RF-based classifier performs better. This combination will be very helpful for predicting diabetic patients.
Collapse
|
29
|
Chen W, Tsangaratos P, Ilia I, Duan Z, Chen X. Groundwater spring potential mapping using population-based evolutionary algorithms and data mining methods. Sci Total Environ 2019; 684:31-49. [PMID: 31150874 DOI: 10.1016/j.scitotenv.2019.05.312] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/01/2019] [Revised: 05/08/2019] [Accepted: 05/20/2019] [Indexed: 06/09/2023]
Abstract
Water scarcity in many regions of the world has become an unpleasant reality. Groundwater appears to be one of the main natural resources capable to reverse this situation. Uncovering the spatial patterns of groundwater occurrence is a crucial factor that could assist in carrying out successful water resources management projects. The main objective of the current study was to provide a novel methodology approach which utilized Genetic Algorithm (GA) in order to perform a feature selection procedure and data mining methods for generating a groundwater spring potential map. Three data mining methods, Naïve Bayes (NB), Support Vector Machine (SVM) and Random Forest (RF) were utilized to construct a groundwater spring potential map that had over 0.81 probability of occurrence for the Wuqi County, Shaanxi Province, China. Groundwater spring locations and sixteen related variables were analyzed, namely: lithology, soil cover, land use cover, normalized difference vegetation index (NDVI), elevation, slope angle, aspect, planform curvature, profile curvature, curvature, stream power index (SPI), stream transport index (STI), topographic wetness index (TWI), mean annual rainfall, distance from river network and distance from road network. The Frequency ratio method was used to weight the variables, whereas a multi-collinearity analysis was performed to identify the relation between the parameters and to decide about their usage. The optimal set of parameters, which was determined by the GA, reduced the number of parameters into twelve removing planform curvature, profile curvature, curvature and STI. The Receiver Operating Characteristic curve and the area under the curve (AUROC) were estimated so as to evaluate the predictive power of each model. The results indicated that the optimized models were superior in accuracy than the original models. The optimized RF model produced the best results (0.9572), followed by the optimized SVM (0.9529) and the optimized NB (0.8235). Overall, the current study highlights the necessity of applying feature selection techniques in groundwater spring assessments and also that data mining methods may be a highly powerful investigation approach for groundwater spring potential mapping.
Collapse
Affiliation(s)
- Wei Chen
- College of Geology & Environment, Xi'an University of Science and Technology, Xi'an, Shaanxi 710054, China; Key Laboratory of Coal Resources Exploration and Comprehensive Utilization, Ministry of Land and Resources, Xi'an 710021, China
| | - Paraskevas Tsangaratos
- National Technical University of Athens, School of Mining and Metallurgical Engineering, Department of Geological Sciences, Laboratory of Engineering Geology and Hydrogeology, Zografou Campus, Heroon Polytechniou 9, 15780 Zografou, Greece.
| | - Ioanna Ilia
- National Technical University of Athens, School of Mining and Metallurgical Engineering, Department of Geological Sciences, Laboratory of Engineering Geology and Hydrogeology, Zografou Campus, Heroon Polytechniou 9, 15780 Zografou, Greece
| | - Zhao Duan
- College of Geology & Environment, Xi'an University of Science and Technology, Xi'an, Shaanxi 710054, China
| | - Xinjian Chen
- Department of Geological Engineering, Chang'an University, Xi'an, Shaanxi 710054, China
| |
Collapse
|
30
|
Dada EG, Bassi JS, Chiroma H, Abdulhamid SM, Adetunmbi AO, Ajibuwa OE. Machine learning for email spam filtering: review, approaches and open research problems. Heliyon 2019; 5:e01802. [PMID: 31211254 PMCID: PMC6562150 DOI: 10.1016/j.heliyon.2019.e01802] [Citation(s) in RCA: 135] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2018] [Revised: 02/25/2019] [Accepted: 05/20/2019] [Indexed: 11/18/2022] Open
Abstract
The upsurge in the volume of unwanted emails called spam has created an intense need for the development of more dependable and robust antispam filters. Machine learning methods of recent are being used to successfully detect and filter spam emails. We present a systematic review of some of the popular machine learning based email spam filtering approaches. Our review covers survey of the important concepts, attempts, efficiency, and the research trend in spam filtering. The preliminary discussion in the study background examines the applications of machine learning techniques to the email spam filtering process of the leading internet service providers (ISPs) like Gmail, Yahoo and Outlook emails spam filters. Discussion on general email spam filtering process, and the various efforts by different researchers in combating spam through the use machine learning techniques was done. Our review compares the strengths and drawbacks of existing machine learning approaches and the open research problems in spam filtering. We recommended deep leaning and deep adversarial learning as the future techniques that can effectively handle the menace of spam emails.
Collapse
Affiliation(s)
- Emmanuel Gbenga Dada
- Department of Computer Engineering, University of Maiduguri, Maiduguri, Nigeria
- Corresponding author.
| | | | - Haruna Chiroma
- Department of Computer Science, Federal College of Education (Technical), Gombe, Nigeria
| | | | | | | |
Collapse
|
31
|
He Q, Shahabi H, Shirzadi A, Li S, Chen W, Wang N, Chai H, Bian H, Ma J, Chen Y, Wang X, Chapi K, Ahmad BB. Landslide spatial modelling using novel bivariate statistical based Naïve Bayes, RBF Classifier, and RBF Network machine learning algorithms. Sci Total Environ 2019; 663:1-15. [PMID: 30708212 DOI: 10.1016/j.scitotenv.2019.01.329] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2018] [Revised: 01/06/2019] [Accepted: 01/25/2019] [Indexed: 06/09/2023]
Abstract
Landslides are major hazards for human activities often causing great damage to human lives and infrastructure. Therefore, the main aim of the present study is to evaluate and compare three machine learning algorithms (MLAs) including Naïve Bayes (NB), radial basis function (RBF) Classifier, and RBF Network for landslide susceptibility mapping (LSM) at Longhai area in China. A total of 14 landslide conditioning factors were obtained from various data sources, then the frequency ratio (FR) and support vector machine (SVM) methods were used for the correlation and selection the most important factors for modelling process, respectively. Subsequently, the resulting three models were validated and compared using some statistical metrics including area under the receiver operating characteristics (AUROC) curve, and Friedman and Wilcoxon signed-rank tests The results indicated that the RBF Classifier model had the highest goodness-of-fit and performance based on the training and validation datasets. The results concluded that the RBF Classifier model outperformed and outclassed (AUROC = 0.881), the NB (AUROC = 0.872) and the RBF Network (AUROC = 0.854) models. The obtained results pointed out that the RBF Classifier model is a promising method for spatial prediction of landslide over the world.
Collapse
Affiliation(s)
- Qingfeng He
- College of Geology & Environment, Xi'an University of Science and Technology, Xi'an, Shaanxi 710054, China
| | - Himan Shahabi
- Department of Geomorphology, Faculty of Natural Resources, University of Kurdistan, Sanandaj, Iran.
| | - Ataollah Shirzadi
- Department of Rangeland and Watershed Management, Faculty of Natural Resources, University of Kurdistan, Sanandaj, Iran
| | - Shaojun Li
- State Key Laboratory of Geomechanics and Geotechnical Engineering, Institute of Rock and Soil Mechanics, Chinese Academy of Sciences, Wuhan, Hubei 430071, China
| | - Wei Chen
- College of Geology & Environment, Xi'an University of Science and Technology, Xi'an, Shaanxi 710054, China
| | - Nianqin Wang
- College of Geology & Environment, Xi'an University of Science and Technology, Xi'an, Shaanxi 710054, China
| | - Huichan Chai
- School of earth and environment, Anhui University of science & technology, HuaiNan, AnHui 232001, China
| | - Huiyuan Bian
- College of Geology & Environment, Xi'an University of Science and Technology, Xi'an, Shaanxi 710054, China
| | - Jianquan Ma
- College of Geology & Environment, Xi'an University of Science and Technology, Xi'an, Shaanxi 710054, China
| | - Yingtao Chen
- College of Geology & Environment, Xi'an University of Science and Technology, Xi'an, Shaanxi 710054, China
| | - Xiaojing Wang
- College of Geology & Environment, Xi'an University of Science and Technology, Xi'an, Shaanxi 710054, China
| | - Kamran Chapi
- Department of Rangeland and Watershed Management, Faculty of Natural Resources, University of Kurdistan, Sanandaj, Iran
| | - Baharin Bin Ahmad
- Faculty of Built Environment and Surveying, Universiti Teknologi Malaysia (UTM), 81310 Johor Bahru, Malaysia
| |
Collapse
|
32
|
Abstract
The human immunodeficiency virus (HIV) causes over a million deaths every year and has a huge economic impact in many countries. The first class of drugs approved were nucleoside reverse transcriptase inhibitors. A newer generation of reverse transcriptase inhibitors have become susceptible to drug resistant strains of HIV, and hence, alternatives are urgently needed. We have recently pioneered the use of Bayesian machine learning to generate models with public data to identify new compounds for testing against different disease targets. The current study has used the NIAID ChemDB HIV, Opportunistic Infection and Tuberculosis Therapeutics Database for machine learning studies. We curated and cleaned data from HIV-1 wild-type cell-based and reverse transcriptase (RT) DNA polymerase inhibition assays. Compounds from this database with ≤1 μM HIV-1 RT DNA polymerase activity inhibition and cell-based HIV-1 inhibition are correlated (Pearson r = 0.44, n = 1137, p < 0.0001). Models were trained using multiple machine learning approaches (Bernoulli Naive Bayes, AdaBoost Decision Tree, Random Forest, support vector classification, k-Nearest Neighbors, and deep neural networks as well as consensus approaches) and then their predictive abilities were compared. Our comparison of different machine learning methods demonstrated that support vector classification, deep learning, and a consensus were generally comparable and not significantly different from each other using 5-fold cross validation and using 24 training and test set combinations. This study demonstrates findings in line with our previous studies for various targets that training and testing with multiple data sets does not demonstrate a significant difference between support vector machine and deep neural networks.
Collapse
Affiliation(s)
- Kimberley M Zorn
- Collaborations Pharmaceuticals, Inc. , Main Campus Drive, Lab 3510 , Raleigh , North Carolina 27606 , United States
| | - Thomas R Lane
- Collaborations Pharmaceuticals, Inc. , Main Campus Drive, Lab 3510 , Raleigh , North Carolina 27606 , United States
| | - Daniel P Russo
- Collaborations Pharmaceuticals, Inc. , Main Campus Drive, Lab 3510 , Raleigh , North Carolina 27606 , United States.,The Rutgers Center for Computational and Integrative Biology , Camden , New Jersey 08102 , United States
| | - Alex M Clark
- Molecular Materials Informatics, Inc. , 2234 Duvernay Street , Montreal , Quebec H3J2Y3 , Canada
| | - Vadim Makarov
- Bach Institute of Biochemistry , Research Center of Biotechnology of the Russian Academy of Sciences , Leninsky Prospekt 33-2 , Moscow 119071 , Russia
| | - Sean Ekins
- Collaborations Pharmaceuticals, Inc. , Main Campus Drive, Lab 3510 , Raleigh , North Carolina 27606 , United States
| |
Collapse
|
33
|
Purushothaman G, Vikas R. Identification of a feature selection based pattern recognition scheme for finger movement recognition from multichannel EMG signals. Australas Phys Eng Sci Med 2018; 41:549-559. [PMID: 29744809 DOI: 10.1007/s13246-018-0646-7] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/26/2017] [Accepted: 05/01/2018] [Indexed: 11/26/2022]
Abstract
This paper focuses on identification of an effective pattern recognition scheme with the least number of time domain features for dexterous control of prosthetic hand to recognize the various finger movements from surface electromyogram (EMG) signals. Eight channels EMG from 8 able-bodied subjects for 15 individuals and combined finger activities have been considered in this work. In this work, an attempt has been made to recognize a number of classes with the least number of features. Therefore, EMG signals are pre-processed using dual tree complex wavelet transform to improve the discriminating capability of features and time domain features such as zero crossing, slope sign change, mean absolute value, and waveform length is extracted from the pre-processed data. The performance of extracted features is studied with different classifiers such as linear discriminant analysis, naive Bayes classifier, quadratic support vector machine and cubic support vector machine with and without feature selection algorithms. The feature selection has been studied using particle swarm optimization (PSO) and ant colony optimization (ACO) with different number of features to identify the effect of features. The results demonstrated that naive Bayes classifier with ant colony optimization shows an average classification accuracy of 88.89% with a response time of 0.058025 ms for recognizing the 15 different finger movements with 16 features with significant difference in accuracy compared to SVM classifier with feature selection for a significance level of 0.05. There is no significant difference in the accuracy, specificity and sensitivity of an SVM classifier with and without feature selection. But the processing time is significantly more than the LDA and NB classifier. The PSO and ACO results revealed that slope sign changes contribute to recognizing the activity. In PSO, mean absolute value has been found to be effective compared to waveform length, contradictory with ACO. Further, the zero crossings have been found to be not effective in classification of finger movements in both the methods.
Collapse
Affiliation(s)
| | - Raunak Vikas
- School of Electrical Engineering, VIT, Vellore, TN, 632 014, India
| |
Collapse
|
34
|
Periwal V, Scaria V. Machine Learning Approaches Toward Building Predictive Models for Small Molecule Modulators of miRNA and Its Utility in Virtual Screening of Molecular Databases. Methods Mol Biol 2017; 1517:155-68. [PMID: 27924481 DOI: 10.1007/978-1-4939-6563-2_11] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
The ubiquitous role of microRNAs (miRNAs) in a number of pathological processes has suggested that they could act as potential drug targets. RNA-binding small molecules offer an attractive means for modulating miRNA function. The availability of bioassay data sets for a variety of biological assays and molecules in public domain provides a new opportunity toward utilizing them to create models and further utilize them for in silico virtual screening approaches to prioritize or assign potential functions for small molecules. Here, we describe a computational strategy based on machine learning for creation of predictive models from high-throughput biological screens for virtual screening of small molecules with the potential to inhibit microRNAs. Such models could be potentially used for computational prioritization of small molecules before performing high-throughput biological assay.
Collapse
|
35
|
Pal LR, Kundu K, Yin Y, Moult J. CAGI4 Crohn's exome challenge: Marker SNP versus exome variant models for assigning risk of Crohn disease. Hum Mutat 2017; 38:1225-1234. [PMID: 28512778 PMCID: PMC5576730 DOI: 10.1002/humu.23256] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2016] [Revised: 05/09/2017] [Accepted: 05/10/2017] [Indexed: 12/18/2022]
Abstract
Understanding the basis of complex trait disease is a fundamental problem in human genetics. The CAGI Crohn's Exome challenges are providing insight into the adequacy of current disease models by requiring participants to identify which of a set of individuals has been diagnosed with the disease, given exome data. For the CAGI4 round, we developed a method that used the genotypes from exome sequencing data only to impute the status of genome wide association studies marker SNPs. We then used the imputed genotypes as input to several machine learning methods that had been trained to predict disease status from marker SNP information. We achieved the best performance using Naïve Bayes and with a consensus machine learning method, obtaining an area under the curve of 0.72, larger than other methods used in CAGI4. We also developed a model that incorporated the contribution from rare missense variants in the exome data, but this performed less well. Future progress is expected to come from the use of whole genome data rather than exomes.
Collapse
Affiliation(s)
- Lipika R. Pal
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850
| | - Kunal Kundu
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850
- Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, MD 20742, USA
| | - Yizhou Yin
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850
- Computational Biology, Bioinformatics and Genomics, Biological Sciences Graduate Program, University of Maryland, College Park, MD 20742, USA
| | - John Moult
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD 20850
- Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, MD 20742
| |
Collapse
|
36
|
Koutsoukas A, Monaghan KJ, Li X, Huan J. Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data. J Cheminform 2017; 9:42. [PMID: 29086090 PMCID: PMC5489441 DOI: 10.1186/s13321-017-0226-y] [Citation(s) in RCA: 107] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2016] [Accepted: 05/27/2017] [Indexed: 01/03/2023] Open
Abstract
Background In recent years, research in artificial neural networks has resurged, now under the deep-learning umbrella, and grown extremely popular. Recently reported success of DL techniques in crowd-sourced QSAR and predictive toxicology competitions has showcased these methods as powerful tools in drug-discovery and toxicology research. The aim of this work was dual, first large number of hyper-parameter configurations were explored to investigate how they affect the performance of DNNs and could act as starting points when tuning DNNs and second their performance was compared to popular methods widely employed in the field of cheminformatics namely Naïve Bayes, k-nearest neighbor, random forest and support vector machines. Moreover, robustness of machine learning methods to different levels of artificially introduced noise was assessed. The open-source Caffe deep-learning framework and modern NVidia GPU units were utilized to carry out this study, allowing large number of DNN configurations to be explored. Results We show that feed-forward deep neural networks are capable of achieving strong classification performance and outperform shallow methods across diverse activity classes when optimized. Hyper-parameters that were found to play critical role are the activation function, dropout regularization, number hidden layers and number of neurons. When compared to the rest methods, tuned DNNs were found to statistically outperform, with p value <0.01 based on Wilcoxon statistical test. DNN achieved on average MCC units of 0.149 higher than NB, 0.092 than kNN, 0.052 than SVM with linear kernel, 0.021 than RF and finally 0.009 higher than SVM with radial basis function kernel. When exploring robustness to noise, non-linear methods were found to perform well when dealing with low levels of noise, lower than or equal to 20%, however when dealing with higher levels of noise, higher than 30%, the Naïve Bayes method was found to perform well and even outperform at the highest level of noise 50% more sophisticated methods across several datasets. Electronic supplementary material The online version of this article (doi:10.1186/s13321-017-0226-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alexios Koutsoukas
- Department of Electrical Engineering and Computer Sciences, University of Kansas, Lawrence, KS, 66047-7621, USA
| | - Keith J Monaghan
- Department of Electrical Engineering and Computer Sciences, University of Kansas, Lawrence, KS, 66047-7621, USA
| | - Xiaoli Li
- Department of Electrical Engineering and Computer Sciences, University of Kansas, Lawrence, KS, 66047-7621, USA
| | - Jun Huan
- Department of Electrical Engineering and Computer Sciences, University of Kansas, Lawrence, KS, 66047-7621, USA.
| |
Collapse
|
37
|
Abstract
Statistical classification is a critical component of utilizing metabolomics data for examining the molecular determinants of phenotypes. Despite this, a comprehensive and rigorous evaluation of the accuracy of classification techniques for phenotype discrimination given metabolomics data has not been conducted. We conducted such an evaluation using both simulated and real metabolomics datasets, comparing Partial Least Squares-Discriminant Analysis (PLS-DA), Sparse PLS-DA, Random Forests, Support Vector Machines (SVM), Artificial Neural Network, k-Nearest Neighbors (k-NN), and Naïve Bayes classification techniques for discrimination. We evaluated the techniques on simulated data generated to mimic global untargeted metabolomics data by incorporating realistic block-wise correlation and partial correlation structures for mimicking the correlations and metabolite clustering generated by biological processes. Over the simulation studies, covariance structures, means, and effect sizes were stochastically varied to provide consistent estimates of classifier performance over a wide range of possible scenarios. The effects of the presence of non-normal error distributions, the introduction of biological and technical outliers, unbalanced phenotype allocation, missing values due to abundances below a limit of detection, and the effect of prior-significance filtering (dimension reduction) were evaluated via simulation. In each simulation, classifier parameters, such as the number of hidden nodes in a Neural Network, were optimized by cross-validation to minimize the probability of detecting spurious results due to poorly tuned classifiers. Classifier performance was then evaluated using real metabolomics datasets of varying sample medium, sample size, and experimental design. We report that in the most realistic simulation studies that incorporated non-normal error distributions, unbalanced phenotype allocation, outliers, missing values, and dimension reduction, classifier performance (least to greatest error) was ranked as follows: SVM, Random Forest, Naïve Bayes, sPLS-DA, Neural Networks, PLS-DA and k-NN classifiers. When non-normal error distributions were introduced, the performance of PLS-DA and k-NN classifiers deteriorated further relative to the remaining techniques. Over the real datasets, a trend of better performance of SVM and Random Forest classifier performance was observed.
Collapse
Affiliation(s)
- Patrick J Trainor
- Division of Cardiovascular Medicine, Department of Medicine, University of Louisville, 580 S. Preston St., Louisville, KY 40202, USA.
| | - Andrew P DeFilippis
- Division of Cardiovascular Medicine, Department of Medicine, University of Louisville, 580 S. Preston St., Louisville, KY 40202, USA.
| | - Shesh N Rai
- Department of Bioinformatics and Biostatistics, University of Louisville, 505 S. Hancock St., Louisville, KY 40202, USA.
| |
Collapse
|
38
|
Jarecki JB, Meder B, Nelson JD. Naïve and Robust: Class-Conditional Independence in Human Classification Learning. Cogn Sci 2017; 42:4-42. [PMID: 28574602 DOI: 10.1111/cogs.12496] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2016] [Revised: 09/19/2016] [Accepted: 11/18/2017] [Indexed: 11/30/2022]
Abstract
Humans excel in categorization. Yet from a computational standpoint, learning a novel probabilistic classification task involves severe computational challenges. The present paper investigates one way to address these challenges: assuming class-conditional independence of features. This feature independence assumption simplifies the inference problem, allows for informed inferences about novel feature combinations, and performs robustly across different statistical environments. We designed a new Bayesian classification learning model (the dependence-independence structure and category learning model, DISC-LM) that incorporates varying degrees of prior belief in class-conditional independence, learns whether or not independence holds, and adapts its behavior accordingly. Theoretical results from two simulation studies demonstrate that classification behavior can appear to start simple, yet adapt effectively to unexpected task structures. Two experiments-designed using optimal experimental design principles-were conducted with human learners. Classification decisions of the majority of participants were best accounted for by a version of the model with very high initial prior belief in class-conditional independence, before adapting to the true environmental structure. Class-conditional independence may be a strong and useful default assumption in category learning tasks.
Collapse
Affiliation(s)
- Jana B Jarecki
- Department of Psychology, University of Basel.,Center for Adaptive Behavior and Cognition, Max Planck Institute for Human Development
| | - Björn Meder
- Center for Adaptive Behavior and Cognition, Max Planck Institute for Human Development
| | - Jonathan D Nelson
- Center for Adaptive Behavior and Cognition, Max Planck Institute for Human Development.,School of Psychology, University of Surrey
| |
Collapse
|
39
|
Zeng F, Yang D, Xing X, Qi S. Evaluation of Bayesian approaches to identify DDT source contributions to soils in Southeast China. Chemosphere 2017; 176:32-38. [PMID: 28254712 DOI: 10.1016/j.chemosphere.2017.02.049] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/06/2016] [Revised: 02/06/2017] [Accepted: 02/08/2017] [Indexed: 06/06/2023]
Abstract
Dicofol application may be an important source to elevate the dichlorodiphenyltrichloroethane (DDT) residues to soils in Fujian, Southeast China, after the technical DDT was banned, which left DDT residues from the historical application. The DDT residues varied geographically, corresponding to the varied potential sources of DDT. In this study, a novel approach based on the Bayesian method (BM) was developed to identify the source contributions of DDT to soils, composed with both historical DDT and dicofol. The Naive Bayesian classifier was used basing on the subset of the samples, which were determined by chemical analysis independent of the Bayesian approach. The results show that BM (95%) was higher than that using the ratio of o, p'-/p, p'-DDT (84%) to identify DDT source contributions. High detection rate (97%) of dicofol (p, p'-OH-DDT) was observed in the subset, showing dicofol application influenced the DDX levels in soils in Fujian. However, the contribution from historical technical DDT source was greater than that from dicofol in Fujian, indicating historical technical DDT was still an important pollution source to soils. In addition, both the DDX (DDT isomers and derivatives) level and dicofol contribution in non-agricultural soils were higher than other agricultural land uses, especially in hilly regions, the potential cause may be the atmospheric transport of dicofol type DDT, after spraying during daytime, or regional difference on production and application.
Collapse
Affiliation(s)
- Faming Zeng
- State Key Laboratory of Biogeology and Environmental Geology, China University of Geosciences, Wuhan, 430074, China; Institute of Karst Geology, Chinese Academy of Geological Sciences, Guilin 541004, China; School of Environmental Studies, China University of Geosciences, Wuhan, 430074, China
| | - Dan Yang
- Faculty of Engineering, China University of Geosciences, Wuhan, 430074, China
| | - Xinli Xing
- State Key Laboratory of Biogeology and Environmental Geology, China University of Geosciences, Wuhan, 430074, China; School of Environmental Studies, China University of Geosciences, Wuhan, 430074, China
| | - Shihua Qi
- State Key Laboratory of Biogeology and Environmental Geology, China University of Geosciences, Wuhan, 430074, China; School of Environmental Studies, China University of Geosciences, Wuhan, 430074, China.
| |
Collapse
|
40
|
Basu N, Bandyopadhyay SK. 2D Source area prediction based on physical characteristics of a regular, passive blood drip stain. Forensic Sci Int 2016; 266:39-53. [PMID: 27295073 DOI: 10.1016/j.forsciint.2016.04.024] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2015] [Revised: 03/21/2016] [Accepted: 04/18/2016] [Indexed: 10/21/2022]
Abstract
Violent criminal acts are often accompanied by dynamic blood shedding events. Bloodstain pattern analysis particularly deals with estimation of the dynamic blood shedding events from the static bloodstain patterns that have been left at the scene. Of all the stain patterns present at a crime scene, drip stain patterns are common stain patterns one would expect to document at a violent crime scene. The paper documents statistically significant correlations between different physical parameters, such as fall height, total number of spines associated with each stain. Statistical significant correlation between the angle of impact and the total number of spines associated with each stain pattern has been established in this work. The paper propounds that the breadth of a regular drip stain is particularly significant in making predictions empirically as also statistically about the surface area from which blood has dripped leading to the formation of a particular drip stain. A data model has been developed using machine learning techniques to predict the range of surface radius from which blood has dripped and lead to the formation of a particular drip stain (Accuracy: 97.53%, Sensitivity=0.9481, Specificity=1).
Collapse
Affiliation(s)
- Nabanita Basu
- Department of Computer Science and Engineering, University of Calcutta, JD Block Salt Lake, Sector III, Kolkata, 700098, India.
| | - Samir Kumar Bandyopadhyay
- Department of Computer Science and Engineering, University of Calcutta, JD Block Salt Lake, Sector III, Kolkata, 700098, India.
| |
Collapse
|
41
|
Yao ZJ, Dong J, Che YJ, Zhu MF, Wen M, Wang NN, Wang S, Lu AP, Cao DS. TargetNet: a web service for predicting potential drug-target interaction profiling via multi-target SAR models. J Comput Aided Mol Des 2016; 30:413-24. [PMID: 27167132 DOI: 10.1007/s10822-016-9915-2] [Citation(s) in RCA: 185] [Impact Index Per Article: 23.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2016] [Accepted: 05/06/2016] [Indexed: 02/01/2023]
Abstract
Drug-target interactions (DTIs) are central to current drug discovery processes and public health fields. Analyzing the DTI profiling of the drugs helps to infer drug indications, adverse drug reactions, drug-drug interactions, and drug mode of actions. Therefore, it is of high importance to reliably and fast predict DTI profiling of the drugs on a genome-scale level. Here, we develop the TargetNet server, which can make real-time DTI predictions based only on molecular structures, following the spirit of multi-target SAR methodology. Naïve Bayes models together with various molecular fingerprints were employed to construct prediction models. Ensemble learning from these fingerprints was also provided to improve the prediction ability. When the user submits a molecule, the server will predict the activity of the user's molecule across 623 human proteins by the established high quality SAR model, thus generating a DTI profiling that can be used as a feature vector of chemicals for wide applications. The 623 SAR models related to 623 human proteins were strictly evaluated and validated by several model validation strategies, resulting in the AUC scores of 75-100 %. We applied the generated DTI profiling to successfully predict potential targets, toxicity classification, drug-drug interactions, and drug mode of action, which sufficiently demonstrated the wide application value of the potential DTI profiling. The TargetNet webserver is designed based on the Django framework in Python, and is freely accessible at http://targetnet.scbdd.com .
Collapse
Affiliation(s)
- Zhi-Jiang Yao
- School of Pharmaceutical Sciences, Central South University, Changsha, 410013, People's Republic of China
- College of Chemistry and Chemical Engineering, Central South University, Changsha, 410083, People's Republic of China
| | - Jie Dong
- School of Pharmaceutical Sciences, Central South University, Changsha, 410013, People's Republic of China
| | - Yu-Jing Che
- School of Mathematics and Statistics, Central South University, Changsha, 410083, People's Republic of China
| | - Min-Feng Zhu
- School of Mathematics and Statistics, Central South University, Changsha, 410083, People's Republic of China
| | - Ming Wen
- College of Chemistry and Chemical Engineering, Central South University, Changsha, 410083, People's Republic of China
| | - Ning-Ning Wang
- School of Pharmaceutical Sciences, Central South University, Changsha, 410013, People's Republic of China
| | - Shan Wang
- College of Chemistry and Chemical Engineering, Central South University, Changsha, 410083, People's Republic of China
| | - Ai-Ping Lu
- Institute of Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, SAR, People's Republic of China
| | - Dong-Sheng Cao
- School of Pharmaceutical Sciences, Central South University, Changsha, 410013, People's Republic of China.
- Institute of Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, SAR, People's Republic of China.
| |
Collapse
|
42
|
Yin X, Hadjiloucas S, Zhang Y. Classification of THz pulse signals using two-dimensional cross-correlation feature extraction and non-linear classifiers. Comput Methods Programs Biomed 2016; 127:64-82. [PMID: 27000290 DOI: 10.1016/j.cmpb.2016.01.017] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/19/2015] [Revised: 01/20/2016] [Accepted: 01/21/2016] [Indexed: 05/14/2023]
Abstract
This work provides a performance comparison of four different machine learning classifiers: multinomial logistic regression with ridge estimators (MLR) classifier, k-nearest neighbours (KNN), support vector machine (SVM) and naïve Bayes (NB) as applied to terahertz (THz) transient time domain sequences associated with pixelated images of different powder samples. The six substances considered, although have similar optical properties, their complex insertion loss at the THz part of the spectrum is significantly different because of differences in both their frequency dependent THz extinction coefficient as well as differences in their refractive index and scattering properties. As scattering can be unquantifiable in many spectroscopic experiments, classification solely on differences in complex insertion loss can be inconclusive. The problem is addressed using two-dimensional (2-D) cross-correlations between background and sample interferograms, these ensure good noise suppression of the datasets and provide a range of statistical features that are subsequently used as inputs to the above classifiers. A cross-validation procedure is adopted to assess the performance of the classifiers. Firstly the measurements related to samples that had thicknesses of 2mm were classified, then samples at thicknesses of 4mm, and after that 3mm were classified and the success rate and consistency of each classifier was recorded. In addition, mixtures having thicknesses of 2 and 4mm as well as mixtures of 2, 3 and 4mm were presented simultaneously to all classifiers. This approach provided further cross-validation of the classification consistency of each algorithm. The results confirm the superiority in classification accuracy and robustness of the MLR (least accuracy 88.24%) and KNN (least accuracy 90.19%) algorithms which consistently outperformed the SVM (least accuracy 74.51%) and NB (least accuracy 56.86%) classifiers for the same number of feature vectors across all studies. The work establishes a general methodology for assessing the performance of other hyperspectral dataset classifiers on the basis of 2-D cross-correlations in far-infrared spectroscopy or other parts of the electromagnetic spectrum. It also advances the wider proliferation of automated THz imaging systems across new application areas e.g., biomedical imaging, industrial processing and quality control where interpretation of hyperspectral images is still under development.
Collapse
Affiliation(s)
- Xiaoxia Yin
- Centre for Applied Informatics, College of Engineering & Science, Victoria University, Melbourne, Australia.
| | - Sillas Hadjiloucas
- School of Systems Engineering, University of Reading, Reading RG6 6AY, UK.
| | - Yanchun Zhang
- Centre for Applied Informatics, College of Engineering & Science, Victoria University, Melbourne, Australia.
| |
Collapse
|
43
|
Bertke SJ, Meyers AR, Wurzelbacher SJ, Measure A, Lampl MP, Robins D. Comparison of methods for auto-coding causation of injury narratives. Accid Anal Prev 2016; 88:117-123. [PMID: 26745274 PMCID: PMC4915551 DOI: 10.1016/j.aap.2015.12.006] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/27/2015] [Revised: 11/13/2015] [Accepted: 12/07/2015] [Indexed: 05/30/2023]
Abstract
Manually reading free-text narratives in large databases to identify the cause of an injury can be very time consuming and recently, there has been much work in automating this process. In particular, the variations of the naïve Bayes model have been used to successfully auto-code free text narratives describing the event/exposure leading to the injury of a workers' compensation claim. This paper compares the naïve Bayes model with an alternative logistic model and found that this new model outperformed the naïve Bayesian model. Further modest improvements were found through the addition of sequences of keywords in the models as opposed to consideration of only single keywords. The programs and weights used in this paper are available upon request to researchers without a training set wishing to automatically assign event codes to large data-sets of text narratives. The utility of sharing this program was tested on an outside set of injury narratives provided by the Bureau of Labor Statistics with promising results.
Collapse
Affiliation(s)
- S J Bertke
- National Institute for Occupational Safety and Health, Division of Surveillance, Hazard Evaluations, and Field Studies, Industrywide Studies Branch, 1090 Tusculum Ave, Cincinnati, OH 45226, United States.
| | - A R Meyers
- National Institute for Occupational Safety and Health, Division of Surveillance, Hazard Evaluations, and Field Studies, Industrywide Studies Branch, Center for Workers' Compensation Studies, 1090 Tusculum Ave, Cincinnati, OH 45226, United States
| | - S J Wurzelbacher
- National Institute for Occupational Safety and Health, Division of Surveillance, Hazard Evaluations, and Field Studies, Industrywide Studies Branch, Center for Workers' Compensation Studies, 1090 Tusculum Ave, Cincinnati, OH 45226, United States
| | - A Measure
- Bureau of Labor Statistics, Occupational Safety and Health Statistics, 2 Massachusetts Avenue, Washington, DC 20212, United States
| | - M P Lampl
- Ohio Bureau of Workers' Compensation, Division of Safety & Hygiene, 13430 Yarmouth Drive, Pickerington, OH 43147, United States
| | - D Robins
- Ohio Bureau of Workers' Compensation, Division of Safety & Hygiene, 13430 Yarmouth Drive, Pickerington, OH 43147, United States
| |
Collapse
|
44
|
Carvajal G, Roser DJ, Sisson SA, Keegan A, Khan SJ. Modelling pathogen log10 reduction values achieved by activated sludge treatment using naïve and semi naïve Bayes network models. Water Res 2015; 85:304-315. [PMID: 26342914 DOI: 10.1016/j.watres.2015.08.035] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/09/2015] [Revised: 08/03/2015] [Accepted: 08/19/2015] [Indexed: 06/05/2023]
Abstract
Risk management for wastewater treatment and reuse have led to growing interest in understanding and optimising pathogen reduction during biological treatment processes. However, modelling pathogen reduction is often limited by poor characterization of the relationships between variables and incomplete knowledge of removal mechanisms. The aim of this paper was to assess the applicability of Bayesian belief network models to represent associations between pathogen reduction, and operating conditions and monitoring parameters and predict AS performance. Naïve Bayes and semi-naïve Bayes networks were constructed from an activated sludge dataset including operating and monitoring parameters, and removal efficiencies for two pathogens (native Giardia lamblia and seeded Cryptosporidium parvum) and five native microbial indicators (F-RNA bacteriophage, Clostridium perfringens, Escherichia coli, coliforms and enterococci). First we defined the Bayesian network structures for the two pathogen log10 reduction values (LRVs) class nodes discretized into two states (< and ≥ 1 LRV) using two different learning algorithms. Eight metrics, such as Prediction Accuracy (PA) and Area Under the receiver operating Curve (AUC), provided a comparison of model prediction performance, certainty and goodness of fit. This comparison was used to select the optimum models. The optimum Tree Augmented naïve models predicted removal efficiency with high AUC when all system parameters were used simultaneously (AUCs for C. parvum and G. lamblia LRVs of 0.95 and 0.87 respectively). However, metrics for individual system parameters showed only the C. parvum model was reliable. By contrast individual parameters for G. lamblia LRV prediction typically obtained low AUC scores (AUC < 0.81). Useful predictors for C. parvum LRV included solids retention time, turbidity and total coliform LRV. The methodology developed appears applicable for predicting pathogen removal efficiency in water treatment systems generally.
Collapse
Affiliation(s)
- Guido Carvajal
- UNSW Water Research Centre, School of Civil & Environmental Engineering, University of New South Wales, NSW, 2052, Australia.
| | - David J Roser
- UNSW Water Research Centre, School of Civil & Environmental Engineering, University of New South Wales, NSW, 2052, Australia.
| | - Scott A Sisson
- School of Mathematics & Statistics, University of New South Wales, NSW, 2052, Australia.
| | - Alexandra Keegan
- Australian Water Quality Centre, SA Water Corporation, Adelaide, SA, 5000, Australia.
| | - Stuart J Khan
- UNSW Water Research Centre, School of Civil & Environmental Engineering, University of New South Wales, NSW, 2052, Australia.
| |
Collapse
|
45
|
Marucci-Wellman HR, Lehto MR, Corns HL. A practical tool for public health surveillance: Semi-automated coding of short injury narratives from large administrative databases using Naïve Bayes algorithms. Accid Anal Prev 2015; 84:165-176. [PMID: 26412196 DOI: 10.1016/j.aap.2015.06.014] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2015] [Accepted: 06/30/2015] [Indexed: 06/05/2023]
Abstract
Public health surveillance programs in the U.S. are undergoing landmark changes with the availability of electronic health records and advancements in information technology. Injury narratives gathered from hospital records, workers compensation claims or national surveys can be very useful for identifying antecedents to injury or emerging risks. However, classifying narratives manually can become prohibitive for large datasets. The purpose of this study was to develop a human-machine system that could be relatively easily tailored to routinely and accurately classify injury narratives from large administrative databases such as workers compensation. We used a semi-automated approach based on two Naïve Bayesian algorithms to classify 15,000 workers compensation narratives into two-digit Bureau of Labor Statistics (BLS) event (leading to injury) codes. Narratives were filtered out for manual review if the algorithms disagreed or made weak predictions. This approach resulted in an overall accuracy of 87%, with consistently high positive predictive values across all two-digit BLS event categories including the very small categories (e.g., exposure to noise, needle sticks). The Naïve Bayes algorithms were able to identify and accurately machine code most narratives leaving only 32% (4853) for manual review. This strategy substantially reduces the need for resources compared with manual review alone.
Collapse
Affiliation(s)
- Helen R Marucci-Wellman
- Center for Injury Epidemiology, Liberty Mutual Research Institute for Safety, Hopkinton, MA, USA.
| | - Mark R Lehto
- School of Industrial Engineering, Purdue University, West Lafayette, IN, USA
| | - Helen L Corns
- Center for Injury Epidemiology, Liberty Mutual Research Institute for Safety, Hopkinton, MA, USA
| |
Collapse
|
46
|
Mussa HY, Marcus D, Mitchell JBO, Glen RC. Verifying the fully "Laplacianised" posterior Naïve Bayesian approach and more. J Cheminform 2015; 7:27. [PMID: 26075027 PMCID: PMC4464057 DOI: 10.1186/s13321-015-0075-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2015] [Accepted: 05/12/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In a recent paper, Mussa, Mitchell and Glen (MMG) have mathematically demonstrated that the "Laplacian Corrected Modified Naïve Bayes" (LCMNB) algorithm can be viewed as a variant of the so-called Standard Naïve Bayes (SNB) scheme, whereby the role played by absence of compound features in classifying/assigning the compound to its appropriate class is ignored. MMG have also proffered guidelines regarding the conditions under which this omission may hold. Utilising three data sets, the present paper examines the validity of these guidelines in practice. The paper also extends MMG's work and introduces a new version of the SNB classifier: "Tapered Naïve Bayes" (TNB). TNB does not discard the role of absence of a feature out of hand, nor does it fully consider its role. Hence, TNB encapsulates both SNB and LCMNB. RESULTS LCMNB, SNB and TNB performed differently on classifying 4,658, 5,031 and 1,149 ligands (all chosen from the ChEMBL Database) distributed over 31 enzymes, 23 membrane receptors, and one ion-channel, four transporters and one transcription factor as their target proteins. When the number of features utilised was equal to or smaller than the "optimal" number of features for a given data set, SNB classifiers systematically gave better classification results than those yielded by LCMNB classifiers. The opposite was true when the number of features employed was markedly larger than the "optimal" number of features for this data set. Nonetheless, these LCMNB performances were worse than the classification performance achieved by SNB when the "optimal" number of features for the data set was utilised. TNB classifiers systematically outperformed both SNB and LCMNB classifiers. CONCLUSIONS The classification results obtained in this study concur with the mathematical based guidelines given in MMG's paper-that is, ignoring the role of absence of a feature out of hand does not necessarily improve classification performance of the SNB approach; if anything, it could make the performance of the SNB method worse. The results obtained also lend support to the rationale, on which the TNB algorithm rests: handled judiciously, taking into account absence of features can enhance (not impair) the discriminatory classification power of the SNB approach.
Collapse
Affiliation(s)
- Hamse Y Mussa
- />Department of Chemistry, Centre for Molecular Informatics, Lensfield Road, Cambridge, England CB2 1EW UK
- />EaStCHEM School of Chemistry and Biomedical Sciences Research Complex, University of St Andrews, North Haugh, St Andrews, Scotland KY16 9ST UK
| | - David Marcus
- />European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Trust Genome Campus, Hinxton, Cambridge, England CB10 1SD UK
| | - John B O Mitchell
- />EaStCHEM School of Chemistry and Biomedical Sciences Research Complex, University of St Andrews, North Haugh, St Andrews, Scotland KY16 9ST UK
| | - Robert C Glen
- />Department of Chemistry, Centre for Molecular Informatics, Lensfield Road, Cambridge, England CB2 1EW UK
| |
Collapse
|
47
|
Abstract
Feature rankings are often used for supervised dimension reduction especially when discriminating power of each feature is of interest, dimensionality of dataset is extremely high, or computational power is limited to perform more complicated methods. In practice, it is recommended to start dimension reduction via simple methods such as feature rankings before applying more complex approaches. Single Variable Classifier (SVC) ranking is a feature ranking based on the predictive performance of a classifier built using only a single feature. While benefiting from capabilities of classifiers, this ranking method is not as computationally intensive as wrappers. In this paper, we report the results of an extensive study on the bias and stability of such feature ranking method. We study whether the classifiers influence the SVC rankings or the discriminative power of features themselves has a dominant impact on the final rankings. We show the common intuition of using the same classifier for feature ranking and final classification does not always result in the best prediction performance. We then study if heterogeneous classifiers ensemble approaches provide more unbiased rankings and if they improve final classification performance. Furthermore, we calculate an empirical prediction performance loss for using the same classifier in SVC feature ranking and final classification from the optimal choices.
Collapse
Affiliation(s)
- Shobeir Fakhraei
- Medical Image Analysis Laboratory, Department of Radiology, Henry Ford Health System, Detroit, MI 48202, USA ; Department of Computer Science, University of Maryland, College Park, MD 20740, USA
| | - Hamid Soltanian-Zadeh
- Medical Image Analysis Laboratory, Department of Radiology, Henry Ford Health System, Detroit, MI 48202, USA ; Control and Intelligent Processing Center of Excellence (CIPCE), School of Electrical and Computer Engineering, University of Tehran, Tehran 14395, Iran
| | - Farshad Fotouhi
- College of Engineering, Wayne State University, Detroit, MI 48202, USA
| |
Collapse
|
48
|
Abstract
Methods for extracting quantitative information regarding nuclear morphology from histopathology images have been long used to aid pathologists in determining the degree of differentiation in numerous malignancies. Most methods currently in use, however, employ the naïve Bayes approach to classify a set of nuclear measurements extracted from one patient. Hence, the statistical dependency between the samples (nuclear measurements) is often not directly taken into account. Here we describe a method that makes use of statistical dependency between samples in thyroid tissue to improve patient classification accuracies with respect to standard naïve Bayes approaches. We report results in two sample diagnostic challenges.
Collapse
Affiliation(s)
- Hu Huang
- Department of Biomedical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Akif Burak Tosun
- Department of Biomedical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Jia Guo
- Department of Biomedical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Cheng Chen
- Department of Biomedical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Wei Wang
- Department of Biomedical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - John A Ozolek
- Department of Pathology, Children's Hospital of Pittsburgh, Pittsburgh, PA 15224, USA
| | - Gustavo K Rohde
- Department of Biomedical Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA ; Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA ; Lane Center for Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
49
|
Majid A, Ali S, Iqbal M, Kausar N. Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Comput Methods Programs Biomed 2014; 113:792-808. [PMID: 24472367 DOI: 10.1016/j.cmpb.2014.01.001] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/06/2013] [Revised: 12/29/2013] [Accepted: 01/03/2014] [Indexed: 06/03/2023]
Abstract
This study proposes a novel prediction approach for human breast and colon cancers using different feature spaces. The proposed scheme consists of two stages: the preprocessor and the predictor. In the preprocessor stage, the mega-trend diffusion (MTD) technique is employed to increase the samples of the minority class, thereby balancing the dataset. In the predictor stage, machine-learning approaches of K-nearest neighbor (KNN) and support vector machines (SVM) are used to develop hybrid MTD-SVM and MTD-KNN prediction models. MTD-SVM model has provided the best values of accuracy, G-mean and Matthew's correlation coefficient of 96.71%, 96.70% and 71.98% for cancer/non-cancer dataset, breast/non-breast cancer dataset and colon/non-colon cancer dataset, respectively. We found that hybrid MTD-SVM is the best with respect to prediction performance and computational cost. MTD-KNN model has achieved moderately better prediction as compared to hybrid MTD-NB (Naïve Bayes) but at the expense of higher computing cost. MTD-KNN model is faster than MTD-RF (random forest) but its prediction is not better than MTD-RF. To the best of our knowledge, the reported results are the best results, so far, for these datasets. The proposed scheme indicates that the developed models can be used as a tool for the prediction of cancer. This scheme may be useful for study of any sequential information such as protein sequence or any nucleic acid sequence.
Collapse
Affiliation(s)
- Abdul Majid
- Department of Computer & Information Sciences, Pakistan Institute of Engineering & Applied Sciences, Nilore, 45650 Islamabad, Pakistan.
| | - Safdar Ali
- Department of Computer & Information Sciences, Pakistan Institute of Engineering & Applied Sciences, Nilore, 45650 Islamabad, Pakistan.
| | - Mubashar Iqbal
- Department of Computer & Information Sciences, Pakistan Institute of Engineering & Applied Sciences, Nilore, 45650 Islamabad, Pakistan.
| | - Nabeela Kausar
- Department of Computer & Information Sciences, Pakistan Institute of Engineering & Applied Sciences, Nilore, 45650 Islamabad, Pakistan.
| |
Collapse
|
50
|
Bashyam V, Morioka C, El-Saden S, Bui AAT, Taira RK. Identifying relevant medical reports from an assorted report collection using the multinomial naïve Bayes classifier and the UMLS. Indian J Med Inform 2007; 2:2. [PMID: 36284749 PMCID: PMC9592058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
A patient's electronic medical record contains a large number of medical reports and imaging studies. Identifying the relevant information in order to make a diagnosis can be a time consuming process that can easily overwhelm the physician. Summarizing key clinical information for physicians evaluating brain tumor patients is an ongoing research project at our institution. Notably, identifying documents associated with brain tumor is an important step in collecting the data relevant for summarization. Current electronic medical record systems lack meta-information which is useful in structuring heterogeneous medical information. Thus, identifying reports relevant to a particular task cannot be easily retrieved from a structured database. This necessitates content analysis methods for identifying relevant reports. This paper reports a system designed to identify brain-tumor related reports from an assorted collection of clinical reports. A large collection of clinical reports was obtained from our university hospital database. A domain expert manually annotated the documents classifying them into `related' and ùnrelated' categories. A multinomial naïve Bayes classifier was trained to use word level and UMLS concept level features from the reports to identify brain tumor related reports from the assorted collection. The system was trained on 90% and tested on 10% of the manually annotated corpus. A ten-fold cross validation is reported. Performance of the system was best (f-score 94.7) when the system was trained using both word level and UMLS concept level features. Using UMLS concepts improved classifier accuracy.
Collapse
Affiliation(s)
- Vijayaraghavan Bashyam
- Department of Information Studies, University of California - Los Angeles, Los Angeles, CA 90024
| | - Craig Morioka
- Department of Radiological Sciences, University of California - Los Angeles, Los Angeles, CA 90024
| | - Suzie El-Saden
- Department of Radiological Sciences, University of California - Los Angeles, Los Angeles, CA 90024
| | - Alex AT Bui
- Department of Radiological Sciences, University of California - Los Angeles, Los Angeles, CA 90024
| | - Ricky K Taira
- Department of Radiological Sciences, University of California - Los Angeles, Los Angeles, CA 90024
| |
Collapse
|