1
|
Park J, Feng Y, Jeong SP. Developing an advanced prediction model for new employee turnover intention utilizing machine learning techniques. Sci Rep 2024; 14:1221. [PMID: 38216616 PMCID: PMC10786846 DOI: 10.1038/s41598-023-50593-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Accepted: 12/21/2023] [Indexed: 01/14/2024] Open
Abstract
In recent years, the turnover phenomenon of new college graduates has been intensifying. The turnover of new employees creates many difficulties for businesses as it is difficult to recover the costs spent on their hiring and training. Therefore, it is necessary to promptly identify and effectively manage new employees who are inclined to change jobs. So far previous studies related to turnover intention have contributed to understanding the turnover phenomenon of new employees by identifying factors influencing turnover intention. However, with these factors, there is a limitation that it has not been able to present how much it is possible to predict employees who are actually willing to change jobs. Therefore, this study proposes a method of developing a machine learning-based turnover intention prediction model to overcome the limitations of previous studies. In this study, data from the Korea Employment Information Service's Job Movement Path Survey for college graduates were used, and OLS regression analysis was performed to confirm the influence of predictors. And model learning and classification were performed using a logistic regression (LR), k-nearest neighbor (KNN), and extreme gradient boosting (XGB) classifier. A novel finding of this research is the diminished or reversed influence of certain traditional factors, such as workload importance and the relevance of one's major field, on turnover intention. Instead, job security emerged as the most significant predictor. The model's accuracy rates, highest with XGB at 78.5%, demonstrate the efficacy of applying machine learning in turnover intention prediction, marking a significant advancement over traditional econometric models. This study breaks new ground by integrating advanced predictive analytics into turnover intention research, offering a more nuanced understanding of the factors influencing the turnover intentions of new college graduates. The insights gained could guide organizations in effectively managing and retaining new talent, highlighting the need for a focus on job security and organizational satisfaction, and the shifting relevance of traditional factors like job preference.
Collapse
Affiliation(s)
- Jungryeol Park
- Technology Policy Research Division, Electronics and Telecommunications Research Institute (ETRI), Daejeon, South Korea
| | - Yituo Feng
- Management Information Systems, Chungbuk National University, Cheongju, South Korea.
| | - Seon-Phil Jeong
- Department of Computer Science, BNU-HKBU United International College, Zhuhai, Guangdong, China
| |
Collapse
|
2
|
Ho IMK, Weldon A, Yong JTH, Lam CTT, Sampaio J. Using Machine Learning Algorithms to Pool Data from Meta-Analysis for the Prediction of Countermovement Jump Improvement. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2023; 20:ijerph20105881. [PMID: 37239607 DOI: 10.3390/ijerph20105881] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 03/13/2023] [Accepted: 05/17/2023] [Indexed: 05/28/2023]
Abstract
To solve the research-practice gap and take one step forward toward using big data with real-world evidence, the present study aims to adopt a novel method using machine learning to pool findings from meta-analyses and predict the change of countermovement jump. The data were collected through a total of 124 individual studies included in 16 recent meta-analyses. The performance of four selected machine learning algorithms including support vector machine, random forest (RF) ensemble, light gradient boosted machine, and the neural network using multi-layer perceptron was compared. The RF yielded the highest accuracy (mean absolute error: 0.071 cm; R2: 0.985). Based on the feature importance calculated by the RF regressor, the baseline CMJ ("Pre-CMJ") was the most impactful predictor, followed by age ("Age"), the total number of training sessions received ("Total number of training_session"), controlled or non-controlled conditions ("Control (no training)"), whether the training program included squat, lunge, deadlift, or hip thrust exercises ("Squat_Lunge_Deadlift_Hipthrust_True", "Squat_Lunge_Deadlift_Hipthrust_False"), or "Plyometric (mixed fast/slow SSC)", and whether the athlete was from an Asian pacific region including Australia ("Race_Asian or Australian"). By using multiple simulated virtual cases, the successful predictions of the CMJ improvement are shown, whereas the perceived benefits and limitations of using machine learning in a meta-analysis are discussed.
Collapse
Affiliation(s)
- Indy Man Kit Ho
- Department of Sports and Recreation, Technological and Higher Education Institute of Hong Kong (THEi), Chai Wan, Hong Kong, China
- The Asian Academy for Sports and Fitness Professionals, Chai Wan, Hongkong, China
| | - Anthony Weldon
- Centre for Life and Sport Sciences, Birmingham City University, Birmingham B15 3TN, UK
| | - Jason Tze Ho Yong
- Department of Sports and Recreation, Technological and Higher Education Institute of Hong Kong (THEi), Chai Wan, Hong Kong, China
| | - Candy Tze Tim Lam
- Department of Sports and Recreation, Technological and Higher Education Institute of Hong Kong (THEi), Chai Wan, Hong Kong, China
| | - Jaime Sampaio
- Research Center in Sports Sciences, Health Sciences and Human Development, CIDESD, CreativeLab Research Community, 5000-801 Vila Real, Portugal
| |
Collapse
|
3
|
Dutschmann TM, Kinzel L, Ter Laak A, Baumann K. Large-scale evaluation of k-fold cross-validation ensembles for uncertainty estimation. J Cheminform 2023; 15:49. [PMID: 37118768 PMCID: PMC10142532 DOI: 10.1186/s13321-023-00709-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2022] [Accepted: 03/10/2023] [Indexed: 04/30/2023] Open
Abstract
It is insightful to report an estimator that describes how certain a model is in a prediction, additionally to the prediction alone. For regression tasks, most approaches implement a variation of the ensemble method, apart from few exceptions. Instead of a single estimator, a group of estimators yields several predictions for an input. The uncertainty can then be quantified by measuring the disagreement between the predictions, for example by the standard deviation. In theory, ensembles should not only provide uncertainties, they also boost the predictive performance by reducing errors arising from variance. Despite the development of novel methods, they are still considered the "golden-standard" to quantify the uncertainty of regression models. Subsampling-based methods to obtain ensembles can be applied to all models, regardless whether they are related to deep learning or traditional machine learning. However, little attention has been given to the question whether the ensemble method is applicable to virtually all scenarios occurring in the field of cheminformatics. In a widespread and diversified attempt, ensembles are evaluated for 32 datasets of different sizes and modeling difficulty, ranging from physicochemical properties to biological activities. For increasing ensemble sizes with up to 200 members, the predictive performance as well as the applicability as uncertainty estimator are shown for all combinations of five modeling techniques and four molecular featurizations. Useful recommendations were derived for practitioners regarding the success and minimum size of ensembles, depending on whether predictive performance or uncertainty quantification is of more importance for the task at hand.
Collapse
Affiliation(s)
- Thomas-Martin Dutschmann
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, Beethovenstrasse 55, 38106, Brunswick, Germany
| | - Lennart Kinzel
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, Beethovenstrasse 55, 38106, Brunswick, Germany
| | - Antonius Ter Laak
- Bayer AG, Research & Development, Pharmaceuticals, Muellerstrasse 178, 13353, Berlin, Germany
| | - Knut Baumann
- Institute of Medicinal and Pharmaceutical Chemistry, University of Technology Braunschweig, Beethovenstrasse 55, 38106, Brunswick, Germany.
| |
Collapse
|
4
|
Predicting Potent Compounds Using a Conditional Variational Autoencoder Based upon a New Structure-Potency Fingerprint. Biomolecules 2023; 13:biom13020393. [PMID: 36830761 PMCID: PMC9953226 DOI: 10.3390/biom13020393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2022] [Revised: 02/07/2023] [Accepted: 02/16/2023] [Indexed: 02/22/2023] Open
Abstract
Prediction of the potency of bioactive compounds generally relies on linear or nonlinear quantitative structure-activity relationship (QSAR) models. Nonlinear models are generated using machine learning methods. We introduce a novel approach for potency prediction that depends on a newly designed molecular fingerprint (FP) representation. This structure-potency fingerprint (SPFP) combines different modules accounting for the structural features of active compounds and their potency values in a single bit string, hence unifying structure and potency representation. This encoding enables the derivation of a conditional variational autoencoder (CVAE) using SPFPs of training compounds and apply the model to predict the SPFP potency module of test compounds using only their structure module as input. The SPFP-CVAE approach correctly predicts the potency values of compounds belonging to different activity classes with an accuracy comparable to support vector regression (SVR), representing the state-of-the-art in the field. In addition, highly potent compounds are predicted with very similar accuracy as SVR and deep neural networks.
Collapse
|
5
|
Trajectory tracking of changes digital divide prediction factors in the elderly through machine learning. PLoS One 2023; 18:e0281291. [PMID: 36763570 PMCID: PMC9916605 DOI: 10.1371/journal.pone.0281291] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2022] [Accepted: 01/19/2023] [Indexed: 02/11/2023] Open
Abstract
RESEARCH MOTIVATION Recently, the digital divide problem among elderly individuals has been intensifying. A larger problem is that the level of use of digital technology varies from person to person. Therefore, a digital divide may even exist among elderly individuals. Considering the recent accelerating digital transformation in our society, it is highly likely that elderly individuals are experiencing many difficulties in their daily life. Therefore, it is necessary to quickly address and manage these difficulties. RESEARCH OBJECTIVE This study aims to predict the digital divide in the elderly population and provide essential insights into managing it. To this end, predictive analysis is performed using public data and machine learning techniques. METHODS AND MATERIALS This study used data from the '2020 Report on Digital Information Divide Survey' published by the Korea National Information Society Agency. In establishing the prediction model, various independent variables were used. Ten variables with high importance for predicting the digital divide were identified and used as critical, independent variables to increase the convenience of analyzing the model. The data were divided into 70% for training and 30% for testing. The model was trained on the training set, and the model's predictive accuracy was analyzed on the test set. The prediction accuracy was analyzed using logistic regression (LR), support vector machine (SVM), K-nearest neighbor (KNN), decision tree (DT), and eXtreme gradient boosting (XGBoost). A convolutional neural network (CNN) was used to further improve the accuracy. In addition, the importance of variables was analyzed using data from 2019 before the COVID-19 outbreak, and the results were compared with the results from 2020. RESULTS The study results showed that the variables with high importance in the 2020 data predicting the digital divide of elderly individuals were the demographic perspective, internet usage perspective, self-efficacy perspective, and social connectedness perspective. These variables, as well as the social support perspective, were highly important in 2019. The highest prediction accuracy was achieved using the CNN-based model (accuracy: 80.4%), followed by the XGBoost model (accuracy: 79%) and LR model (accuracy: 78.3%). The lowest accuracy (accuracy: 72.6%) was obtained using the DT model. DISCUSSION The results of this analysis suggest that support that can strengthen the practical connection of elderly individuals through digital devices is becoming more critical than ever in a situation where digital transformation is accelerating in various fields. In addition, it is necessary to comprehensively use classification algorithms from various academic fields when constructing a classification model to obtain higher prediction accuracy. CONCLUSION The academic significance of this study is that the CNN, which is often employed in image and video processing, was extended and applied to a social science field using structured data to improve the accuracy of the prediction model. The practical significance of this study is that the prediction models and the analytical methodologies proposed in this article can be applied to classify elderly people affected by the digital divide, and the trained models can be used to predict the people of younger generations who may be affected by the digital divide. Another practical significance of this study is that, as a method for managing individuals who are affected by a digital divide, the self-efficacy perspective about acquiring and using ICTs and the socially connected perspective are suggested in addition to the demographic perspective and the internet usage perspective.
Collapse
|
6
|
Rodríguez-Pérez R, Bajorath J. Evolution of Support Vector Machine and Regression Modeling in Chemoinformatics and Drug Discovery. J Comput Aided Mol Des 2022; 36:355-362. [PMID: 35304657 PMCID: PMC9325859 DOI: 10.1007/s10822-022-00442-9] [Citation(s) in RCA: 43] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Accepted: 02/15/2022] [Indexed: 11/05/2022]
Abstract
The support vector machine (SVM) algorithm is one of the most widely used machine learning (ML) methods for predicting active compounds and molecular properties. In chemoinformatics and drug discovery, SVM has been a state-of-the-art ML approach for more than a decade. A unique attribute of SVM is that it operates in feature spaces of increasing dimensionality. Hence, SVM conceptually departs from the paradigm of low dimensionality that applies to many other methods for chemical space navigation. The SVM approach is applicable to compound classification, and ranking, multi-class predictions, and –in algorithmically modified form– regression modeling. In the emerging era of deep learning (DL), SVM retains its relevance as one of the premier ML methods in chemoinformatics, for reasons discussed herein. We describe the SVM methodology including strengths and weaknesses and discuss selected applications that have contributed to the evolution of SVM as a premier approach for compound classification, property predictions, and virtual compound screening.
Collapse
Affiliation(s)
- Raquel Rodríguez-Pérez
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115, Bonn, Germany.,Novartis Institutes for Biomedical Research, Novartis Campus, CH-4002, Basel, Switzerland
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115, Bonn, Germany. .,Novartis Institutes for Biomedical Research, Novartis Campus, CH-4002, Basel, Switzerland.
| |
Collapse
|
7
|
Fan J, Huang G, Chi M, Shi Y, Jiang J, Feng C, Yan Z, Xu Z. Prediction of chemical reproductive toxicity to aquatic species using a machine learning model: An application in an ecological risk assessment of the Yangtze River, China. THE SCIENCE OF THE TOTAL ENVIRONMENT 2021; 796:148901. [PMID: 34265613 DOI: 10.1016/j.scitotenv.2021.148901] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Revised: 06/29/2021] [Accepted: 07/04/2021] [Indexed: 06/13/2023]
Abstract
The endocrine disrupting chemicals (EDCs) have been at the forefront of environmental issues for over 20 years and are a principle factor considered in every ecological risk assessment, but this kind of risk assessment faces difficulties. The expense, time cost of in vivo tests, and lack of toxicity data are key limiting factors for the ability to conduct ecological risk assessments of EDCs to aquatic species. In this study, a machine learning model named the support vector machine (SVM) was used to predict the reproductive toxicity of EDCs, and the performance of the models was evaluated. The results showed that the SVM model provided more accurate toxicity prediction data compared with the interspecies correlation estimation (ICE) model developed by previous study to predict the reproductive toxicity. The application of the predicted toxicity data was an important supplement to the observed data for the ecological risk assessment of EDCs in the Yangtze River, where estrogens and phenolic compounds have been found at some sampling sites in the middle and lower reaches. The results showed that the ecological risk of estrone, 17β-estradiol, and ethinyl estradiol were significant. This study revealed the application potential of machine learning models for the prediction of reproductive toxicity effects of EDCs. This can provide reliable alternative toxicity data for the ecological risk assessments of EDCs.
Collapse
Affiliation(s)
- Juntao Fan
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing 100012, China
| | - Guoxian Huang
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing 100012, China
| | - Minghui Chi
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing 100012, China
| | - Yao Shi
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing 100012, China
| | - Jinyuan Jiang
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing 100012, China
| | - Chaoyang Feng
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing 100012, China
| | - Zhenguang Yan
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing 100012, China.
| | - Zongxue Xu
- College of Water Sciences, Beijing Normal University, Beijing 100875, China
| |
Collapse
|
8
|
Mohammed Rashid A, Midi H, Dhhan W, Arasan J. Detection of outliers in high-dimensional data using nu-support vector regression. J Appl Stat 2021; 49:2550-2569. [DOI: 10.1080/02664763.2021.1911965] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
| | - Habshah Midi
- Institute for Mathematical Research, Universiti Putra Malaysia, Serdang, Malaysia
- Department of Mathematics, Faculty of Science, Universiti Putra Malaysia, Serdang, Malaysia
| | - Waleed Dhhan
- Centre of Scientific Research, Nawroz University (NZU), Duhok, Iraq
- Babylon Housing Department, Babylon Governorate, Babylon, Iraq
| | - Jayanthi Arasan
- Department of Mathematics, Faculty of Science, Universiti Putra Malaysia, Serdang, Malaysia
| |
Collapse
|
9
|
Ho IMK, Cheong KY, Weldon A. Predicting student satisfaction of emergency remote learning in higher education during COVID-19 using machine learning techniques. PLoS One 2021; 16:e0249423. [PMID: 33798204 PMCID: PMC8018673 DOI: 10.1371/journal.pone.0249423] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2020] [Accepted: 03/17/2021] [Indexed: 11/20/2022] Open
Abstract
Despite the wide adoption of emergency remote learning (ERL) in higher education during the COVID-19 pandemic, there is insufficient understanding of influencing factors predicting student satisfaction for this novel learning environment in crisis. The present study investigated important predictors in determining the satisfaction of undergraduate students (N = 425) from multiple departments in using ERL at a self-funded university in Hong Kong while Moodle and Microsoft Team are the key learning tools. By comparing the predictive accuracy between multiple regression and machine learning models before and after the use of random forest recursive feature elimination, all multiple regression, and machine learning models showed improved accuracy while the most accurate model was the elastic net regression with 65.2% explained variance. The results show only neutral (4.11 on a 7-point Likert scale) regarding the overall satisfaction score on ERL. Even majority of students are competent in technology and have no obvious issue in accessing learning devices or Wi-Fi, face-to-face learning is more preferable compared to ERL and this is found to be the most important predictor. Besides, the level of efforts made by instructors, the agreement on the appropriateness of the adjusted assessment methods, and the perception of online learning being well delivered are shown to be highly important in determining the satisfaction scores. The results suggest that the need of reviewing the quality and quantity of modified assessment accommodated for ERL and structured class delivery with the suitable amount of interactive learning according to the learning culture and program nature.
Collapse
Affiliation(s)
- Indy Man Kit Ho
- Technological and Higher Education Institute of Hong Kong (THEi), Chai Wan, Hong Kong
| | - Kai Yuen Cheong
- Technological and Higher Education Institute of Hong Kong (THEi), Chai Wan, Hong Kong
| | - Anthony Weldon
- Technological and Higher Education Institute of Hong Kong (THEi), Chai Wan, Hong Kong
| |
Collapse
|
10
|
Tekin AT, Çebi F. Click and sales prediction for OTAs’ digital advertisements: Fuzzy clustering based approach. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2020. [DOI: 10.3233/jifs-189123] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Within the most productive route, online travel agencies (OTAs) intend to use advanced digital media ads to expand their piece of the industry as a whole. The metasearch engine platforms are among the most consistently used digital media environments by OTAs. Most OTAs offer day by day deals in metasearch engine platforms that are paying per click for each hotel to get reservations. The administration of offering methodologies is critical along these lines to reduce costs and increase revenue for online travel agencies. In this study, we tried to predict both the number of impressions and the regular Click-Through-Rate (CTR) level of hotel advertising for each hotel and the daily sales amount. A significant commitment of our research is to use an extended dataset generated by integrating the most informative features implemented in various related studies as the rolling average for a different amount of day and shifted values for use in the proposed test stage for CTR, impression and sales prediction. The data is created in this study by one of Turkey’s largest OTA, and we are giving OTA’s a genuine application. The results at each prediction stage show that enriching the training data with the OTA-specific additional features, which are the most insightful and sliding window techniques, improves the prediction models ’ generalization capability, and tree-based boosting algorithms carry out the greatest results on this problem. Clustering the dataset according to its specifications also improves the results of the predictions.
Collapse
Affiliation(s)
- Ahmet Tezcan Tekin
- Istanbul Technical University Management Engineering Department, Besiktas, Istanbul, Turkey
| | - Ferhan Çebi
- Istanbul Technical University Management Engineering Department, Besiktas, Istanbul, Turkey
| |
Collapse
|
11
|
Fan J, Wang S, Li H, Yan Z, Zhang Y, Zheng X, Wang P. Modeling the ecological status response of rivers to multiple stressors using machine learning: A comparison of environmental DNA metabarcoding and morphological data. WATER RESEARCH 2020; 183:116004. [PMID: 32622231 DOI: 10.1016/j.watres.2020.116004] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/18/2020] [Revised: 05/29/2020] [Accepted: 05/30/2020] [Indexed: 06/11/2023]
Abstract
Understanding the ecological status response of rivers to multiple stressors is a precondition for river restoration and management. However, this requires the collection of appropriate data, including environmental variables and the status of aquatic organisms, and analysis via a suitable model that captures the nonlinear relationships between ecological status and various stressors. The morphological approach has been the standard data collection method employed for establishing the status of aquatic organisms. However, this approach is very laborious and restricted to a specific set of organisms. Recently, an environmental DNA (eDNA) metabarcoding data approach has been developed that is far more efficient than the morphological approach and potentially applicable to an unlimited set of organisms. However, it remains unclear how well eDNA metabarcoding data reflects the impacts of environmental stressors on aquatic ecosystems compared with morphological data, which is essential for clarifying the potential applications of eDNA metabarcoding data in the ecological monitoring and management of rivers. The present work addresses this issue by modeling organism diversity based on three indices with respect to multiple environmental variables in both the catchment and reach scales. This is done by corresponding support vector machine (SVM) models constructed from eDNA metabarcoding and morphological data on 24 sampling locations in the Taizi River basin, China. According to the mean absolute percent error (MAPE) between the measured diversity index values and the index values predicted by the SVM models, the SVM models constructed from eDNA metabarcoding data (MAPE = 3.87) provide more accurate predictions than the SVM models constructed from morphological data (MAPE = 28.36), revealing that the eDNA metabarcoding data better reflects environmental conditions. In addition, the sensitivity of SVM model predictions of the ecological indices for both catchment-scale and reach-scale stressors is evaluated, and the stressors having the greatest impact on the ecological status of rivers are identified. The results demonstrate that the ecological status of rivers is more sensitive to environmental stressors at the reach scale than to stressors at the catchment scale. Therefore, our study is helpful in exploring the potential applications of eDNA metabarcoding data and SVM modeling in the ecological monitoring and management of rivers.
Collapse
Affiliation(s)
- Juntao Fan
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, 100012, China
| | - Shuping Wang
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, 100012, China
| | - Hong Li
- Lancaster Environment Centre, Lancaster University, LA1 4YQ, UK; UK Centre for Ecology & Hydrology, MacLean Building, Wallingford, OX108 BB, UK
| | - Zhenguang Yan
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, 100012, China.
| | - Yizhang Zhang
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, 100012, China; Chinese Research Academy of Environmental Sciences Tianjin Branch, Tianjin, 300457, China
| | - Xin Zheng
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, 100012, China
| | - Pengyuan Wang
- State Key Laboratory of Environmental Criteria and Risk Assessment, Chinese Research Academy of Environmental Sciences, Beijing, 100012, China
| |
Collapse
|
12
|
Abstract
A laboratory analysis of concrete samples requires significant experimental time and cost. In addition, advancement in data mining provide valuable tool for researchers to extract information regarding relations among experiment and physical properties in a more elaborate way to improve prediction models performance and guide concrete mix design. A 90 samples data set is developed and used in this research. The experiment is designed to study the effect of natural silica addition at different levels on physical properties of concrete mainly compressive strength. Compressive strength is measured after 3 and 28 days for different levels of milling time. Support vector regression and neural network models are developed for predicting the compressive strength of concrete using five input variables including silica additive fraction. The SVR model metrics are compared with ANN model and showed good correlation coefficient of 0.929 but less than ANN. The advantage of SVR over ANN is shown in the developed regression model which can be interpreted physically. The silica fraction variable ranked third after curing time and cement ratio variable which indicates its importance.
Collapse
|
13
|
Rodríguez-Pérez R, Vogt M, Bajorath J. Support Vector Machine Classification and Regression Prioritize Different Structural Features for Binary Compound Activity and Potency Value Prediction. ACS OMEGA 2017; 2:6371-6379. [PMID: 30023518 PMCID: PMC6045367 DOI: 10.1021/acsomega.7b01079] [Citation(s) in RCA: 50] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2017] [Accepted: 09/22/2017] [Indexed: 05/15/2023]
Abstract
In computational chemistry and chemoinformatics, the support vector machine (SVM) algorithm is among the most widely used machine learning methods for the identification of new active compounds. In addition, support vector regression (SVR) has become a preferred approach for modeling nonlinear structure-activity relationships and predicting compound potency values. For the closely related SVM and SVR methods, fingerprints (i.e., bit string or feature set representations of chemical structure and properties) are generally preferred descriptors. Herein, we have compared SVM and SVR calculations for the same compound data sets to evaluate which features are responsible for predictions. On the basis of systematic feature weight analysis, rather surprising results were obtained. Fingerprint features were frequently identified that contributed differently to the corresponding SVM and SVR models. The overlap between feature sets determining the predictive performance of SVM and SVR was only very small. Furthermore, features were identified that had opposite effects on SVM and SVR predictions. Feature weight analysis in combination with feature mapping made it also possible to interpret individual predictions, thus balancing the black box character of SVM/SVR modeling.
Collapse
|
14
|
Abstract
BACKGROUND Predicting the response to a drug for cancer disease patients based on genomic information is an important problem in modern clinical oncology. This problem occurs in part because many available drug sensitivity prediction algorithms do not consider better quality cancer cell lines and the adoption of new feature representations; both lead to the accurate prediction of drug responses. By predicting accurate drug responses to cancer, oncologists gain a more complete understanding of the effective treatments for each patient, which is a core goal in precision medicine. RESULTS In this paper, we model cancer drug sensitivity as a link prediction, which is shown to be an effective technique. We evaluate our proposed link prediction algorithms and compare them with an existing drug sensitivity prediction approach based on clinical trial data. The experimental results based on the clinical trial data show the stability of our link prediction algorithms, which yield the highest area under the ROC curve (AUC) and are statistically significant. CONCLUSIONS We propose a link prediction approach to obtain new feature representation. Compared with an existing approach, the results show that incorporating the new feature representation to the link prediction algorithms has significantly improved the performance.
Collapse
Affiliation(s)
- Turki Turki
- Department of Computer Science, King Abdulaziz University, P.O. Box 80221, Jeddah, 21589, Saudi Arabia. .,Bioinformatics Program and Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA.
| | - Zhi Wei
- Bioinformatics Program and Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA.
| |
Collapse
|