1
|
Unmasking the sky: high-resolution PM 2.5 prediction in Texas using machine learning techniques. JOURNAL OF EXPOSURE SCIENCE & ENVIRONMENTAL EPIDEMIOLOGY 2024:10.1038/s41370-024-00659-w. [PMID: 38561475 DOI: 10.1038/s41370-024-00659-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Revised: 03/06/2024] [Accepted: 03/07/2024] [Indexed: 04/04/2024]
Abstract
BACKGROUND Although PM2.5 (fine particulate matter with an aerodynamic diameter less than 2.5 µm) is an air pollutant of great concern in Texas, limited regulatory monitors pose a significant challenge for decision-making and environmental studies. OBJECTIVE This study aimed to predict PM2.5 concentrations at a fine spatial scale on a daily basis by using novel machine learning approaches and incorporating satellite-derived Aerosol Optical Depth (AOD) and a variety of weather and land use variables. METHODS We compiled a comprehensive dataset in Texas from 2013 to 2017, including ground-level PM2.5 concentrations from regulatory monitors; AOD values at 1-km resolution based on images retrieved from the MODIS satellite; and weather, land-use, population density, among others. We built predictive models for each year separately to estimate PM2.5 concentrations using two machine learning approaches called gradient boosted trees and random forest. We evaluated the model prediction performance using in-sample and out-of-sample validations. RESULTS Our predictive models demonstrate excellent in-sample model performance, as indicated by high R2 values generated from the gradient boosting models (0.94-0.97) and random forest models (0.81-0.90). However, the out-of-sample R2 values fall within a range of 0.52-0.75 for gradient boosting models and 0.44-0.69 for random forest models. Model performance varies slightly across years. A generally decreasing trend in predicted PM2.5 concentrations over time is observed in Eastern Texas. IMPACT STATEMENT We utilized machine learning approaches to predict PM2.5 levels in Texas. Both gradient boosting and random forest models perform well. Gradient boosting models perform slightly better than random forest models. Our models showed excellent in-sample prediction performance (R2 > 0.9).
Collapse
|
2
|
Pseudo-value regression trees. LIFETIME DATA ANALYSIS 2024; 30:439-471. [PMID: 38403840 DOI: 10.1007/s10985-024-09618-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/23/2023] [Accepted: 01/19/2024] [Indexed: 02/27/2024]
Abstract
This paper presents a semi-parametric modeling technique for estimating the survival function from a set of right-censored time-to-event data. Our method, named pseudo-value regression trees (PRT), is based on the pseudo-value regression framework, modeling individual-specific survival probabilities by computing pseudo-values and relating them to a set of covariates. The standard approach to pseudo-value regression is to fit a main-effects model using generalized estimating equations (GEE). PRT extend this approach by building a multivariate regression tree with pseudo-value outcome and by successively fitting a set of regularized additive models to the data in the nodes of the tree. Due to the combination of tree learning and additive modeling, PRT are able to perform variable selection and to identify relevant interactions between the covariates, thereby addressing several limitations of the standard GEE approach. In addition, PRT include time-dependent effects in the node-wise models. Interpretability of the PRT fits is ensured by controlling the tree depth. Based on the results of two simulation studies, we investigate the properties of the PRT method and compare it to several alternative modeling techniques. Furthermore, we illustrate PRT by analyzing survival in 3,652 patients enrolled for a randomized study on primary invasive breast cancer.
Collapse
|
3
|
Modeling Epidemiology Data with Machine Learning Technique to Detect Risk Factors for Gastric Cancer. J Gastrointest Cancer 2024; 55:287-296. [PMID: 37428282 DOI: 10.1007/s12029-023-00952-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/09/2023] [Indexed: 07/11/2023]
Abstract
PURPOSE Gastric cancer (GC) ranks as the 7th most common cancer worldwide and a leading cause of cancer mortality. In Iran, stomach malignancies are the most common fatal cancers with higher than world average incidence. In recent years, methods like machine learning that provide the opportunity of merging health issues with computational power and learning capacity have caught considerable attention for prediction and diagnosis of diseases. In this study, we aimed to model GC data to find risk factors and identify GC cases in Golestan Cohort Study (GCS), using gradient boosting as a machine learning technique. METHODS Since the GC class (280) was smaller than not-GC (49,467), "Synthetic Minority Oversampling Technique" was used to balance the dataset. Seventy percent of the data was used to train the gradient boosting algorithm and find effective factors on gastric cancer, and the remaining 30% was used for accuracy assessment. RESULTS Our results indicated that out of 19 factors, age, social economical status, tea temperature, body mass index, gender, and education were the top six effective factors with impact rates of 0.24, 0.16, 0.13, 0.13, and 0.07, respectively. The trained model classified 70 out of 72 GC patients in the test set, correctly. CONCLUSION The results indicate that this model can effectively detect gastric cancer (GC) by utilizing important risk factors, thus avoiding the need for invasive procedures. The model's performance is reliable when provided with an adequate amount of input data, and as the dataset expands, its accuracy and generalization improve significantly. Overall, the trained system's success stems from its ability to identify risk factors and identify cancer patients.
Collapse
|
4
|
Artificial intelligence-based prediction models for acute myeloid leukemia using real-life data: A DATAML registry study. Leuk Res 2024; 136:107437. [PMID: 38215555 DOI: 10.1016/j.leukres.2024.107437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Revised: 12/19/2023] [Accepted: 01/05/2024] [Indexed: 01/14/2024]
Abstract
We designed artificial intelligence-based prediction models (AIPM) using 52 diagnostic variables from 3687 patients included in the DATAML registry treated with intensive chemotherapy (IC, N = 3030) or azacitidine (AZA, N = 657) for an acute myeloid leukemia (AML). A neural network called multilayer perceptron (MLP) achieved a prediction accuracy for overall survival (OS) of 68.5% and 62.1% in the IC and AZA cohorts, respectively. The Boruta algorithm could select the most important variables for prediction without decreasing accuracy. Thirteen features were retained with this algorithm in the IC cohort: age, cytogenetic risk, white blood cells count, LDH, platelet count, albumin, MPO expression, mean corpuscular volume, CD117 expression, NPM1 mutation, AML status (de novo or secondary), multilineage dysplasia and ASXL1 mutation; and 7 variables in the AZA cohort: blood blasts, serum ferritin, CD56, LDH, hemoglobin, CD13 and disseminated intravascular coagulation (DIC). We believe that AIPM could help hematologists to deal with the huge amount of data available at diagnosis, enabling them to have an OS estimation and guide their treatment choice. Our registry-based AIPM could offer a large real-life dataset with original and exhaustive features and select a low number of diagnostic features with an equivalent accuracy of prediction, more appropriate to routine practice.
Collapse
|
5
|
An interpretable time series machine learning method for varying forecast and nowcast lengths in wastewater-based epidemiology. MethodsX 2023; 11:102382. [PMID: 37822674 PMCID: PMC10562867 DOI: 10.1016/j.mex.2023.102382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 09/15/2023] [Indexed: 10/13/2023] Open
Abstract
Wastewater-based epidemiology has emerged as a viable tool for monitoring disease prevalence in a population. This paper details a time series machine learning (TSML) method for predicting COVID-19 cases from wastewater and environmental variables. The TSML method utilizes a number of techniques to create an interpretable, hypothesis-driven framework for machine learning that can handle different nowcast and forecast lengths. Some of the techniques employed include:•Feature engineering to construct interpretable features, like site-specific lead times, hypothesized to be potential predictors of COVID-19 cases.•Feature selection to identify features with the best predictive performance for the tasks of nowcasting and forecasting.•Prequential evaluation to prevent data leakage while evaluating the performance of the machine learning algorithm.
Collapse
|
6
|
Identifying relapse predictors in individual participant data with decision trees. BMC Psychiatry 2023; 23:835. [PMID: 37957596 PMCID: PMC10644580 DOI: 10.1186/s12888-023-05214-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Accepted: 09/22/2023] [Indexed: 11/15/2023] Open
Abstract
BACKGROUND Depression is a highly common and recurrent condition. Predicting who is at most risk of relapse or recurrence can inform clinical practice. Applying machine-learning methods to Individual Participant Data (IPD) can be promising to improve the accuracy of risk predictions. METHODS Individual data of four Randomized Controlled Trials (RCTs) evaluating antidepressant treatment compared to psychological interventions with tapering ([Formula: see text]) were used to identify predictors of relapse and/or recurrence. Ten baseline predictors were assessed. Decision trees with and without gradient boosting were applied. To study the robustness of decision-tree classifications, we also performed a complementary logistic regression analysis. RESULTS The combination of age, age of onset of depression, and depression severity significantly enhances the prediction of relapse risk when compared to classifiers solely based on depression severity. The studied decision trees can (i) identify relapse patients at intake with an accuracy, specificity, and sensitivity of about 55% (without gradient boosting) and 58% (with gradient boosting), and (ii) slightly outperform classifiers that are based on logistic regression. CONCLUSIONS Decision tree classifiers based on multiple-rather than single-risk indicators may be useful for developing treatment stratification strategies. These classification models have the potential to contribute to the development of methods aimed at effectively prioritizing treatment for those individuals who require it the most. Our results also underline the existing gaps in understanding how to accurately predict depressive relapse.
Collapse
|
7
|
Fishery catch records support machine learning-based prediction of illegal fishing off US West Coast. PeerJ 2023; 11:e16215. [PMID: 37872950 PMCID: PMC10590572 DOI: 10.7717/peerj.16215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Accepted: 09/11/2023] [Indexed: 10/25/2023] Open
Abstract
Illegal, unreported, and unregulated (IUU) fishing is a major problem worldwide, often made more challenging by a lack of at-sea and shoreside monitoring of commercial fishery catches. Off the US West Coast, as in many places, a primary concern for enforcement and management is whether vessels are illegally fishing in locations where they are not permitted to fish. We explored the use of supervised machine learning analysis in a partially observed fishery to identify potentially illicit behaviors when vessels did not have observers on board. We built classification models (random forest and gradient boosting ensemble tree estimators) using labeled data from nearly 10,000 fishing trips for which we had landing records (i.e., catch data) and observer data. We identified a set of variables related to catch (e.g., catch weights and species) and delivery port that could predict, with 97% accuracy, whether vessels fished in state versus federal waters. Notably, our model performances were robust to inter-annual variability in the fishery environments during recent anomalously warm years. We applied these models to nearly 60,000 unobserved landing records and identified more than 500 instances in which vessels may have illegally fished in federal waters. This project was developed at the request of fisheries enforcement investigators, and now an automated system analyzes all new unobserved landings records to identify those in need of additional investigation for potential violations. Similar approaches informed by the spatial preferences of species landed may support monitoring and enforcement efforts in any number of partially observed, or even totally unobserved, fisheries globally.
Collapse
|
8
|
Machine learning methods for anomaly classification in wastewater treatment plants. JOURNAL OF ENVIRONMENTAL MANAGEMENT 2023; 344:118594. [PMID: 37473555 DOI: 10.1016/j.jenvman.2023.118594] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Revised: 06/21/2023] [Accepted: 07/03/2023] [Indexed: 07/22/2023]
Abstract
Modern wastewater treatment plants base their biological processes on advanced control systems which ensure compliance with discharge limits and minimize energy consumption responding to information from on-line probes. The correct readings of probes are particularly crucial for intermittent aeration controllers, which rely on real-time measurements of ammonia and oxygen in biological tanks. These data are also an important resource for developing artificial intelligence algorithms that can identify process or sensor anomalies, thus guiding the choices of plant operators and automatic process controllers. However, using anomaly detection and classification algorithms in real-time wastewater treatment is challenging because of the noisy nature of sensor measurements, the difficulty of obtaining labeled real-plant data, and the complex and interdependent mechanisms that govern biological processes. This work aims at thoroughly exploring the performance of machine learning methods in detecting and classifying the main anomalies in plants operating with intermittent aeration. Using oxygen, ammonia and aeration power measurements from a set of plants in Italy, we perform both binary and multiclass classification, and we compare them through a rigorous validation procedure that includes a test on an unknown dataset, proposing a new evaluation protocol. The classification methods explored are support vector machine, multilayer perceptron, random forest, and two gradient boosting methods (LightGBM and XGBoost). The best performance was achieved using the gradient boosting ensemble algorithms, with up to 96% of anomalies detected and up to 84% and 62% of anomalies classified correctly on the first and second datasets respectively.
Collapse
|
9
|
Diagnosis of Parkinson's disease based on voice signals using SHAP and hard voting ensemble method. Comput Methods Biomech Biomed Engin 2023:1-17. [PMID: 37771234 DOI: 10.1080/10255842.2023.2263125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Accepted: 09/17/2023] [Indexed: 09/30/2023]
Abstract
Parkinson's disease (PD) is the second most common progressive neurological condition after Alzheimer's. The significant number of individuals afflicted with this illness makes it essential to develop a method to diagnose the conditions in their early phases. PD is typically identified from motor symptoms or via other Neuroimaging techniques. Expensive, time-consuming, and unavailable to the general public, these methods are not very accurate. Another issue to be addressed is the black-box nature of machine learning methods that needs interpretation. These issues encourage us to develop a novel technique using Shapley additive explanations (SHAP) and Hard Voting Ensemble Method based on voice signals to diagnose PD more accurately. Another purpose of this study is to interpret the output of the model and determine the most important features in diagnosing PD. The present article uses Pearson Correlation Coefficients to understand the relationship between input features and the output. Input features with high correlation are selected and then classified by the Extreme Gradient Boosting, Light Gradient Boosting Machine, Gradient Boosting, and Bagging. Moreover, the weights in Hard Voting Ensemble Method are determined based on the performance of the mentioned classifiers. At the final stage, it uses SHAP to determine the most important features in PD diagnosis. The effectiveness of the proposed method is validated using 'Parkinson Dataset with Replicated Acoustic Features' from the UCI machine learning repository. It has achieved an accuracy of 85.42%. The findings demonstrate that the proposed method outperformed state-of-the-art approaches and can assist physicians in diagnosing Parkinson's cases.
Collapse
|
10
|
Practical guidelines for the use of gradient boosting for molecular property prediction. J Cheminform 2023; 15:73. [PMID: 37641120 PMCID: PMC10464382 DOI: 10.1186/s13321-023-00743-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 08/09/2023] [Indexed: 08/31/2023] Open
Abstract
Decision tree ensembles are among the most robust, high-performing and computationally efficient machine learning approaches for quantitative structure-activity relationship (QSAR) modeling. Among them, gradient boosting has recently garnered particular attention, for its performance in data science competitions, virtual screening campaigns, and bioactivity prediction. However, different variants of gradient boosting exist, the most popular being XGBoost, LightGBM and CatBoost. Our study provides the first comprehensive comparison of these approaches for QSAR. To this end, we trained 157,590 gradient boosting models, which were evaluated on 16 datasets and 94 endpoints, comprising 1.4 million compounds in total. Our results show that XGBoost generally achieves the best predictive performance, while LightGBM requires the least training time, especially for larger datasets. In terms of feature importance, the models surprisingly rank molecular features differently, reflecting differences in regularization techniques and decision tree structures. Thus, expert knowledge must always be employed when evaluating data-driven explanations of bioactivity. Furthermore, our results show that the relevance of each hyperparameter varies greatly across datasets and that it is crucial to optimize as many hyperparameters as possible to maximize the predictive performance. In conclusion, our study provides the first set of guidelines for cheminformatics practitioners to effectively train, optimize and evaluate gradient boosting models for virtual screening and QSAR applications.
Collapse
|
11
|
Predicting energy use in construction using Extreme Gradient Boosting. PeerJ Comput Sci 2023; 9:e1500. [PMID: 37705620 PMCID: PMC10496006 DOI: 10.7717/peerj-cs.1500] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Accepted: 07/04/2023] [Indexed: 09/15/2023]
Abstract
Annual increases in global energy consumption are an unavoidable consequence of a growing global economy and population. Among different sectors, the construction industry consumes an average of 20.1% of the world's total energy. Therefore, exploring methods for estimating the amount of energy used is critical. There are several approaches that have been developed to address this issue. The proposed methods are expected to contribute to energy savings as well as reduce the risks of global warming. There are diverse types of computational approaches to predicting energy use. These existing approaches belong to the statistics-based, engineering-based, and machine learning-based categories. Machine learning-based frameworks showed better performance compared to these other approaches. In our study, we proposed using Extreme Gradient Boosting (XGB), a tree-based ensemble learning algorithm, to tackle the issue. We used a dataset containing energy consumption hourly recorded in an office building in Shanghai, China, from January 1, 2015, to December 31, 2016. The experimental results demonstrated that the XGB model developed using both historical and date features worked better than those developed using only one type of feature. The best-performing model achieved RMSE and MAPE values of 109.00 and 0.24, respectively.
Collapse
|
12
|
Efficient data transmission on wireless communication through a privacy-enhanced blockchain process. PeerJ Comput Sci 2023; 9:e1308. [PMID: 37346706 PMCID: PMC10280508 DOI: 10.7717/peerj-cs.1308] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Accepted: 03/02/2023] [Indexed: 06/23/2023]
Abstract
In the medical era, wearables often manage and find the specific data points to check important data like resting heart rate, ECG voltage, SPO2, sleep patterns like length, interruptions, and intensity, and physical activity like kind, duration, and levels. These digital biomarkers are created mainly through passive data collection from various sensors. The critical issues with this method are time and sensitivity. We reviewed the newest wireless communication trends employed in hospitals using wearable technology and privacy and Block chain to solve this problem. Based on sensors, this wireless technology controls the data gathered from numerous locations. In this study, the wearable sensor contains data from the various departments of the system. The gradient boosting method and the hybrid microwave transmission method have been proposed to find the location and convince people. The patient health decision has been submitted to hybrid microwave transmission using gradient boosting. This will help to trace the mobile phones using the calls from the threatening person, and the data is gathered from the database while tracing. From this concern, the data analysis process is based on decision-making. They adapted the data encountered by the detailed data in the statistical modeling of the system to produce exploratory data analysis for satisfying the data from the database. Complete data is classified with a 97% outcome by removing unwanted data and making it a 98% successful data classification.
Collapse
|
13
|
A boosting first-hitting-time model for survival analysis in high-dimensional settings. LIFETIME DATA ANALYSIS 2023; 29:420-440. [PMID: 35476164 PMCID: PMC10006065 DOI: 10.1007/s10985-022-09553-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 03/25/2022] [Indexed: 06/13/2023]
Abstract
In this paper we propose a boosting algorithm to extend the applicability of a first hitting time model to high-dimensional frameworks. Based on an underlying stochastic process, first hitting time models do not require the proportional hazards assumption, hardly verifiable in the high-dimensional context, and represent a valid parametric alternative to the Cox model for modelling time-to-event responses. First hitting time models also offer a natural way to integrate low-dimensional clinical and high-dimensional molecular information in a prediction model, that avoids complicated weighting schemes typical of current methods. The performance of our novel boosting algorithm is illustrated in three real data examples.
Collapse
|
14
|
scEvoNet: a gradient boosting-based method for prediction of cell state evolution. BMC Bioinformatics 2023; 24:83. [PMID: 36879200 PMCID: PMC9990205 DOI: 10.1186/s12859-023-05213-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Accepted: 02/27/2023] [Indexed: 03/08/2023] Open
Abstract
BACKGROUND Exploring the function or the developmental history of cells in various organisms provides insights into a given cell type's core molecular characteristics and putative evolutionary mechanisms. Numerous computational methods now exist for analyzing single-cell data and identifying cell states. These methods mostly rely on the expression of genes considered as markers for a given cell state. Yet, there is a lack of scRNA-seq computational tools to study the evolution of cell states, particularly how cell states change their molecular profiles. This can include novel gene activation or the novel deployment of programs already existing in other cell types, known as co-option. RESULTS Here we present scEvoNet, a Python tool for predicting cell type evolution in cross-species or cancer-related scRNA-seq datasets. ScEvoNet builds the confusion matrix of cell states and a bipartite network connecting genes and cell states. It allows a user to obtain a set of genes shared by the characteristic signature of two cell states even between distantly-related datasets. These genes can be used as indicators of either evolutionary divergence or co-option occurring during organism or tumor evolution. Our results on cancer and developmental datasets indicate that scEvoNet is a helpful tool for the initial screening of such genes as well as for measuring cell state similarities. CONCLUSION The scEvoNet package is implemented in Python and is freely available from https://github.com/monsoro/scEvoNet . Utilizing this framework and exploring the continuum of transcriptome states between developmental stages and species will help explain cell state dynamics.
Collapse
|
15
|
A machine learning tool for identifying non-metastatic colorectal cancer in primary care. Eur J Cancer 2023; 182:100-106. [PMID: 36758474 DOI: 10.1016/j.ejca.2023.01.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 01/06/2023] [Indexed: 01/21/2023]
Abstract
BACKGROUND Primary health care (PHC) is often the first point of contact when diagnosing colorectal cancer (CRC). Human limitations in processing large amounts of information warrant the use of machine learning as a diagnostic prediction tool for CRC. AIM To develop a predictive model for identifying non-metastatic CRC (NMCRC) among PHC patients using diagnostic data analysed with machine learning. DESIGN AND SETTING A case-control study containing data on PHC visits for 542 patients >18 years old diagnosed with NMCRC in the Västra Götaland Region, Sweden, during 2011, and 2,139 matched controls. METHOD Stochastic gradient boosting (SGB) was used to construct a model for predicting the presence of NMCRC based on diagnostic codes from PHC consultations during the year before the date of cancer diagnosis and the total number of consultations. Variables with a normalised relative influence (NRI) >1% were considered having an important contribution to the model. Risks of having NMCRC were calculated using odds ratios of marginal effects. RESULTS Of the 361 variables used as predictors in the stochastic gradient boosting model, 184 had non-zero influence, with 16 variables having NRI >1% and a combined NRI of 63.3%. Variables representing anaemia and bleeding had a combined NRI of 27.6%. The model had a sensitivity of 73.3% and a specificity of 83.5%. Change in bowel habit had the highest odds ratios of marginal effects at 28.8. CONCLUSION Machine learning is useful for identifying variables of importance for predicting NMCRC in PHC. Malignant diagnoses may be hidden behind benign symptoms such as haemorrhoids.
Collapse
|
16
|
Two-stage RFID approach for localizing objects in smart homes based on gradient boosted decision trees with under- and over-sampling. JOURNAL OF RELIABLE INTELLIGENT ENVIRONMENTS 2023:1-10. [PMID: 36684414 PMCID: PMC9838260 DOI: 10.1007/s40860-022-00199-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Accepted: 12/27/2022] [Indexed: 01/14/2023]
Abstract
Developing automated systems with a reasonable cost for long-term care for elders is a promising research direction. Such smart systems are based on realizing activities of daily living (ADLs) to enable aging in place while preserving the quality of life of all inhabitants in smart homes. One of the research directions is based on localizing items used by elders to monitor their activities with fine-grained details of the progress. In this paper, we shed the light on this issue by presenting an approach for localizing items in smart homes. The presented method is based on applying machine learning algorithms to Radio Frequency IDentification (RFID) tags readings. Our approach achieves the required task through two stages. The first stage detects in which room the selected object is located. Then, the second one determines the exact position of the selected object inside the detected room. Additionally, we present an efficient approach based on gradient boosted decision trees for detecting the location of the selected object in a real-world smart home. Moreover, we employ some techniques of over- and under-sampling with data clustering for improving the performance of the presented techniques. Many experiments are conducted in this work to evaluate the performance of the presented approach for localizing objects in a real smart home. The results of the experiments have shown that our approach provides remarkable performance.
Collapse
|
17
|
Detection of Covid-19 and other pneumonia cases from CT and X-ray chest images using deep learning based on feature reuse residual block and depthwise dilated convolutions neural network. Appl Soft Comput 2023; 133:109906. [PMID: 36504726 PMCID: PMC9726212 DOI: 10.1016/j.asoc.2022.109906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2022] [Revised: 11/29/2022] [Accepted: 12/01/2022] [Indexed: 12/12/2022]
Abstract
Covid-19 has become a worldwide epidemic which has caused the death of millions in a very short time. This disease, which is transmitted rapidly, has mutated and different variations have emerged. Early diagnosis is important to prevent the spread of this disease. In this study, a new deep learning-based architecture is proposed for rapid detection of Covid-19 and other symptoms using CT and X-ray chest images. This method, called CovidDWNet, is based on a structure based on feature reuse residual block (FRB) and depthwise dilated convolutions (DDC) units. The FRB and DDC units efficiently acquired various features in the chest scan images and it was seen that the proposed architecture significantly improved its performance. In addition, the feature maps obtained with the CovidDWNet architecture were estimated with the Gradient boosting (GB) algorithm. With the CovidDWNet+GB architecture, which is a combination of CovidDWNet and GB, a performance increase of approximately 7% in CT images and between 3% and 4% in X-ray images has been achieved. The CovidDWNet+GB architecture achieved the highest success compared to other architectures, with 99.84% and 100% accuracy rates, respectively, on different datasets containing binary class (Covid-19 and Normal) CT images. Similarly, the proposed architecture showed the highest success with 96.81% accuracy in multi-class (Covid-19, Lung Opacity, Normal and Viral Pneumonia) X-ray images and 96.32% accuracy in the dataset containing X-ray and CT images. When the time to predict the disease in CT or X-ray images is examined, it is possible to say that it has a high speed because the CovidDWNet+GB method predicts thousands of images within seconds.
Collapse
|
18
|
Statistical and Machine Learning Methods for Discovering Prognostic Biomarkers for Survival Outcomes. Methods Mol Biol 2023; 2629:11-21. [PMID: 36929071 DOI: 10.1007/978-1-0716-2986-4_2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2023]
Abstract
Discovering molecular biomarkers for predicting patient survival outcomes is an essential step toward improving prognosis and therapeutic decision-making in the treatment of severe diseases such as cancer. Due to the high-dimensionality nature of omics datasets, statistical methods such as the least absolute shrinkage and selection operator (Lasso) have been widely applied for cancer biomarker discovery. Due to their scalability and demonstrated prediction performance, machine learning methods such as XGBoost and neural network models have also been gaining popularity in the community recently. However, compared to more traditional survival methods such as Kaplan-Meier and Cox regression methods, high-dimensional methods for survival outcomes are still less well known to biomedical researchers. In this chapter, we will discuss the key analytical procedures in employing these methods for identifying biomarkers associated with survival data. We will also identify important considerations that emerged from the analysis of actual omics data. Some typical instances of misapplication and misinterpretation of machine learning methods will also be discussed. Using lung cancer and head and neck cancer datasets as demonstrations, we provide step-by-step instructions and sample R codes for prioritizing prognostic biomarkers.
Collapse
|
19
|
Gradient boosting with extreme-value theory for wildfire prediction. EXTREMES 2023; 26:273-299. [PMID: 37091211 PMCID: PMC10115709 DOI: 10.1007/s10687-022-00454-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/03/2022] [Revised: 10/28/2022] [Accepted: 10/31/2022] [Indexed: 05/03/2023]
Abstract
This paper details the approach of the team Kohrrelation in the 2021 Extreme Value Analysis data challenge, dealing with the prediction of wildfire counts and sizes over the contiguous US. Our approach uses ideas from extreme-value theory in a machine learning context with theoretically justified loss functions for gradient boosting. We devise a spatial cross-validation scheme and show that in our setting it provides a better proxy for test set performance than naive cross-validation. The predictions are benchmarked against boosting approaches with different loss functions, and perform competitively in terms of the score criterion, finally placing second in the competition ranking.
Collapse
|
20
|
Correlates of quality of life, happiness and life satisfaction among European adults older than 50 years: A machine-learning approach. Arch Gerontol Geriatr 2022; 103:104791. [PMID: 35998473 DOI: 10.1016/j.archger.2022.104791] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2022] [Revised: 07/26/2022] [Accepted: 08/12/2022] [Indexed: 12/24/2022]
Abstract
BACKGROUND AND OBJECTIVES Previous research has documented the role of different categories of psychosocial factors (i.e., sociodemographic factors, personality, subjective life circumstances, activity, physical health, and childhood circumstances) in predicting subjective well-being and quality of life among older adults. No previous study has simultaneously modeled a large number of these psychosocial factors using a well-powered sample and machine learning algorithms to predict quality of life, happiness, and life satisfaction among older adults. The aim of this paper was to investigate the correlates of quality of life, happiness, and life satisfaction among European adults older than 50 years using machine learning techniques. RESEARCH DESIGN AND METHODS Data drawn from the Survey of Health, Ageing and Retirement in Europe (SHARE) Wave 7 were used. Participants were 62,500 persons aged 50 years and over living in 26 Continental EU Member States, Switzerland, and Israel. Multiple machine learning regression approaches were used. RESULTS The algorithms captured 53%, 33%, and 18% of the variance of quality of life, life satisfaction, and happiness, respectively. The most important categories of correlates of quality of life and life satisfaction were physical health and subjective life circumstances. Sociodemographic factors (mostly country of residence) and psychological variables were the most important categories of correlates of happiness. DISCUSSION AND IMPLICATIONS This study highlights subjective poverty, self-perceived health, country of residence, subjective survival probability, and personality factors (especially neuroticism) as important correlates of quality of life, happiness, and life satisfaction. These findings provide evidence-based recommendations for practice and/or policy implications.
Collapse
|
21
|
Interpretable machine learning to model biomass and waste gasification. BIORESOURCE TECHNOLOGY 2022; 364:128062. [PMID: 36202285 DOI: 10.1016/j.biortech.2022.128062] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Revised: 09/27/2022] [Accepted: 09/29/2022] [Indexed: 06/16/2023]
Abstract
Machine learning has been regarded as a promising method to better model thermochemical processes such as gasification. However, their black box nature can limit how much one can trust and learn from the developed models. Here seven different machine learning methods have been adopted to model the gasification of biomass and waste across a wide range of operating conditions. Gradient boosting regression has been found to outperform the other model types with a coefficient of determination (R2) of 0.90 when averaged across ten key gasification outputs. Global and local model interpretability methods have been used to illuminate the developed black box models. The studied models were most strongly influenced by the feedstock's particle size and the type of gasifying agent employed. By combining global and local interpretability methods, the understanding of black box models has been improved. This allows policy makers and investors to make more educated decisions about gasification process design.
Collapse
|
22
|
Cognition-Enhanced Machine Learning for Better Predictions with Limited Data. Top Cogn Sci 2022; 14:739-755. [PMID: 34529347 PMCID: PMC9786646 DOI: 10.1111/tops.12574] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Revised: 08/20/2021] [Accepted: 08/20/2021] [Indexed: 12/30/2022]
Abstract
The fields of machine learning (ML) and cognitive science have developed complementary approaches to computationally modeling human behavior. ML's primary concern is maximizing prediction accuracy; cognitive science's primary concern is explaining the underlying mechanisms. Cross-talk between these disciplines is limited, likely because the tasks and goals usually differ. The domain of e-learning and knowledge acquisition constitutes a fruitful intersection for the two fields' methodologies to be integrated because accurately tracking learning and forgetting over time and predicting future performance based on learning histories are central to developing effective, personalized learning tools. Here, we show how a state-of-the-art ML model can be enhanced by incorporating insights from a cognitive model of human memory. This was done by exploiting the predictive performance equation's (PPE) narrow but highly specialized domain knowledge with regard to the temporal dynamics of learning and forgetting. Specifically, the PPE was used to engineer timing-related input features for a gradient-boosted decision trees (GBDT) model. The resulting PPE-enhanced GBDT outperformed the default GBDT, especially under conditions in which limited data were available for training. Results suggest that integrating cognitive and ML models could be particularly productive if the available data are too high-dimensional to be explained by a cognitive model but not sufficiently large to effectively train a modern ML algorithm. Here, the cognitive model's insights pertaining to only one aspect of the data were enough to jump-start the ML model's ability to make predictions-a finding that holds promise for future explorations.
Collapse
|
23
|
Time series-based PM 2.5 concentration prediction in Jing-Jin-Ji area using machine learning algorithm models. Heliyon 2022; 8:e10691. [PMID: 36185154 PMCID: PMC9519508 DOI: 10.1016/j.heliyon.2022.e10691] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 07/24/2022] [Accepted: 09/14/2022] [Indexed: 12/03/2022] Open
Abstract
Globally all countries encounter air pollution problems along their development path. As a significant indicator of air quality, PM2.5 concentration has long been proven to be affecting the population’s death rate. Machine learning algorithms proven to outperform traditional statistical approaches are widely used in air pollution prediction. However research on the model selection discussion and environmental interpretation of model prediction results is still scarce and urgently needed to lead the policy making on air pollution control. Our research compared four types of machine learning algorisms LinearSVR, K-Nearest Neighbor, Lasso regression, Gradient boosting by looking into their performance in predicting PM2.5 concentrations among different cities and seasons. The results show that the machine learning model is able to forecast the next day PM2.5 concentration based on the previous five days' data with better accuracy. The comparative experiments show that based on city level the Gradient Boosting prediction model has better prediction performance with mean absolute error (MAE) of 9 ug/m3 and root mean square error (RMSE) of 10.25–16.76 ug/m3, lower compared with the other three models, and based on season level four models have the best prediction performances in winter time and the worst in summer time. And more importantly the demonstration of models' different performances in each city and each season is of great significance in environmental policy implications.
Collapse
|
24
|
Machine learning models in the prediction of 1-year mortality in patients with advanced hepatocellular cancer on immunotherapy: a proof-of-concept study. Hepatol Int 2022; 16:879-891. [PMID: 35779202 DOI: 10.1007/s12072-022-10370-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/02/2021] [Accepted: 05/22/2022] [Indexed: 11/28/2022]
Abstract
INTRODUCTION Immunotherapy is a new promising treatment for patients with advanced hepatocellular carcinoma (HCC), but is costly and potentially associated with considerable side effects. This study aimed to evaluate the role of machine learning (ML) models in predicting the 1-year cancer-related mortality in advanced HCC patients treated with immunotherapy. METHOD 395 HCC patients who had received immunotherapy (including nivolumab, pembrolizumab or ipilimumab) between 2014 and 2019 in Hong Kong were included. The whole data sets were randomly divided into training (n = 316) and internal validation (n = 79) set. The data set, including 47 clinical variables, was used to construct six different ML models in predicting the risk of 1-year mortality. The performances of ML models were measured by the area under receiver operating characteristic curve (AUC) and their performances were compared with C-Reactive protein and Alpha Fetoprotein in ImmunoTherapY score (CRAFITY) and albumin-bilirubin (ALBI) score. The ML models were further validated with an external cohort between 2020 and 2021. RESULTS The 1-year cancer-related mortality was 51.1%. Of the six ML models, the random forest (RF) has the highest AUC of 0.92 (95% CI 0.87-0.98), which was better than logistic regression (0.82, p = 0.01) as well as the CRAFITY (0.68, p < 0.01) and ALBI score (0.84, p = 0.04). RF had the lowest false positive (2.0%) and false negative rate (5.2%), and performed better than CRAFITY score in the external validation cohort (0.91 vs 0.66, p < 0.01). High baseline AFP, bilirubin and alkaline phosphatase were three common risk factors identified by all ML models. CONCLUSION ML models could predict 1-year cancer-related mortality in HCC patients treated with immunotherapy, which may help to select patients who would benefit from this treatment.
Collapse
|
25
|
Effectiveness of resampling methods in coping with imbalanced crash data: Crash type analysis and predictive modeling. ACCIDENT; ANALYSIS AND PREVENTION 2021; 159:106240. [PMID: 34144225 DOI: 10.1016/j.aap.2021.106240] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/11/2020] [Revised: 05/31/2021] [Accepted: 06/02/2021] [Indexed: 06/12/2023]
Abstract
Crash data analysis is commonly subjected to imbalanced data. Varied by facility and control types, some crash types are more frequent than others. However, uncommon crash types are routinely more severe and associated with higher economic and societal costs, and thus crucial to prevent. It is paramount to develop inferential models that can reliably predict crash types and identify attributing factors, especially for the severe types. The current process of modeling towards infrequent events generally disregards disparity in data representation, which can lead to biased models. Therefore, mitigating and managing imbalanced data is essential to the development of meaningful and robust models that help reveal effective countermeasures. This study focuses on comparing the effects of resampling techniques on the performance of both machine learning and classical statistical models for classifying and predicting different crash types on freeways. Specifically, a mixed sampling approach featuring a cluster-based under-sampling coupled with three popular over-sampling methods (i.e., random over-sampling, synthetic minority over-sampling, and adaptive synthetic sampling) were investigated with respect to four crash classification models, including three ensemble machine learning models (CatBoost, XGBoost, and Random Forests) and one classic statistical model (Nested Logit). This study concluded that all three resampling methods consistently enhanced the performance of all models. Among the three over-sampling methods, the adaptive synthetic sampling approach performed best and tremendously improved the prediction of minority crash types without impeding the prediction of the majority crash type. This is likely due to the density-based approach of adaptive synthetic sampling in creating synthetic instances that are more congruent with the underlying manifold structure embodied in the high-dimensional feature space.
Collapse
|
26
|
Dairy management practices associated with multi-drug resistant fecal commensals and Salmonella in cull cows: a machine learning approach. PeerJ 2021; 9:e11732. [PMID: 34316397 PMCID: PMC8288115 DOI: 10.7717/peerj.11732] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2021] [Accepted: 06/16/2021] [Indexed: 11/20/2022] Open
Abstract
Background Understanding the effects of herd management practices on the prevalence of multidrug-resistant pathogenic Salmonella and commensals Enterococcus spp. and Escherichia coli in dairy cattle is key in reducing antibacterial resistant infections in humans originating from food animals. Our objective was to explore the herd and cow level features associated with the multi-drug resistant, and resistance phenotypes shared between Salmonella, E. coli and Enterococcus spp. using machine learning algorithms. Methods Randomly collected fecal samples from cull dairy cows from six dairy farms in central California were tested for multi-drug resistance phenotypes of Salmonella, E. coli and Enterococcus spp. Using data on herd management practices collected from a questionnaire, we built three machine learning algorithms (decision tree classifier, random forest, and gradient boosting decision trees) to predict the cows shedding multidrug-resistant Salmonella and commensal bacteria. Results The decision tree classifier identified rolling herd average milk production as an important feature for predicting fecal shedding of multi-drug resistance in Salmonella or commensal bacteria. The number of culled animals, monthly culling frequency and percentage, herd size, and proportion of Holstein cows in the herd were found to be influential herd characteristics predicting fecal shedding of multidrug-resistant phenotypes based on random forest models for Salmonella and commensal bacteria. Gradient boosting models showed that higher culling frequency and monthly culling percentages were associated with fecal shedding of multidrug resistant Salmonella or commensal bacteria. In contrast, an overall increase in the number of culled animals on a culling day showed a negative trend with classifying a cow as shedding multidrug-resistant bacteria. Increasing rolling herd average milk production and spring season were positively associated with fecal shedding of multidrug- resistant Salmonella. Only six individual cows were detected sharing tetracycline resistance phenotypes between Salmonella and either of the commensal bacteria. Discussion Percent culled and culling rate reflect the increase in culling over time adjusting for herd size and were associated with shedding multidrug resistant bacteria. In contrast, number culled was negatively associated with shedding multidrug resistant bacteria which may reflect producer decisions to prioritize the culling of otherwise healthy but low-producing cows based on milk or beef prices (with respect to dairy beef), amongst other factors. Using a data-driven suite of machine learning algorithms we identified generalizable and distant associations between antimicrobial resistance in Salmonella and fecal commensal bacteria, that can help develop a producer-friendly and data-informed risk assessment tool to reduce shedding of multidrug-resistant bacteria in cull dairy cows.
Collapse
|
27
|
Economic Policy Uncertainty Index Meets Ensemble Learning. COMPUTATIONAL ECONOMICS 2021; 60:401-437. [PMID: 34305322 PMCID: PMC8280278 DOI: 10.1007/s10614-021-10153-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 07/06/2021] [Indexed: 06/13/2023]
Abstract
We utilize a battery of ensemble learning techniques [ensemble linear regression (LM), random forest], as well as two gradient boosting techniques [Gradient Boosting Decision Tree and Extreme Gradient Boosting (XGBoost)] to scrutinize the possibilities of enhancing the predictive accuracy of Economic Policy Uncertainty (EPU) index. Applied to a data-rich environment of the Newsbank media database, our LM and XGBoost assessments mostly outperform the other two ensemble learning procedures, as well as the original EPU index. Our LM and XGBoost estimates bring EPU closer to the stylized facts of uncertainty than other uncertainty estimates. LM and XGBoost indicators are more countercyclical and have more pronounced leading properties. We find that EPU is more strongly correlated to financial volatility measures than to consumers' assessments of uncertainty. This corroborates that the media place a much higher weight on the financial sector than on the economic issues of consumers. Further on, we considerably widen the scope of search terms included in the calculation of EPU index. Using ensemble learning techniques on such a rich set of keywords, we mostly manage to outperform the standard EPU in terms of correlation with standard uncertainty proxies. We also find that the predictive accuracy of EPU index can be considerably increased using a more diversified set of uncertainty-related terms than the original EPU framework. Our estimates perform much better in a monthly setting (targeting the industrial production growth) than targeting quarterly GDP growth. This speaks in favor of uncertainty as a purely short-term phenomenon.
Collapse
|
28
|
2018 Survey of factors associated with antimicrobial drug use and stewardship practices in adult cows on conventional California dairies: immediate post-Senate Bill 27 impact. PeerJ 2021; 9:e11596. [PMID: 34306825 PMCID: PMC8284309 DOI: 10.7717/peerj.11596] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Accepted: 05/21/2021] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Antimicrobial drugs (AMD) are critical for the treatment, control, and prevention of diseases in humans and food-animals. Good AMD stewardship practices and judicious use of AMD are beneficial to the preservation of animal and human health from antimicrobial resistance threat. This study reports on changes in AMD use and stewardship practices on California (CA) dairies, following the implementation of CA Senate Bill 27 (SB 27; codified as Food and Agricultural Code, FAC 14400-14408; here onward referred to as SB 27), by modeling the associations between management practices on CA conventional dairies and seven outcome variables relating to AMD use and stewardship practices following SB 27. METHODS A survey questionnaire was mailed to 1,282 grade A licensed dairies in CA in spring of 2018. Responses from 132 conventional dairies from 16 counties were included for analyses. Multivariate logistic regression models were specified to explore the associations between survey factors and six outcome variables: producers' familiarity with the Food and Drug Administration's (FDA), Silver Spring, WA, USA medically important antimicrobial drugs (MIAD) term; change in over-the-counter (OTC) AMD use; initiation or increased use of alternatives to AMD; changes to prevent disease outbreaks; changes in AMD costs; and better animal health post SB 27. We employed machine learning classification models to determine which of the survey factors were the most important predictors of good-excellent AMD stewardship practices of CA conventional dairy producers. RESULTS Having a valid veterinary-client-patient-relationship, involving a veterinarian in training employees on treatment protocols and decisions on AMDs used to treat sick cows, tracking milk and/or meat withdrawal intervals for treated cows, and participating in dairy quality assurance programs were positively associated with producers' familiarity with MIADs. Use or increased use of alternatives to AMDs since 2018 was associated with decreased use of AMDs that were previously available OTC prior to SB 27. Important variables associated with good-excellent AMD stewardship knowledge by CA conventional dairy producers included having written or computerized animal health protocols, keeping a drug inventory log, awareness that use of MIADs required a prescription following implementation of SB 27, involving a veterinarian in AMD treatment duration determination, and using selective dry cow treatment. CONCLUSIONS Our study identified management factors associated with reported AMD use and antimicrobial stewardship practices on conventional dairies in CA within a year from implementation of SB 27. Producers will benefit from extension outreach efforts that incorporate the findings of this survey by further highlighting the significance of these management practices and encouraging those that are associated with judicious AMD use and stewardship practices on CA conventional dairies.
Collapse
|
29
|
Machine learning algorithm for characterizing risks of hypertension, at an early stage in Bangladesh. Diabetes Metab Syndr 2021; 15:877-884. [PMID: 33892404 DOI: 10.1016/j.dsx.2021.03.035] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/18/2021] [Revised: 03/24/2021] [Accepted: 03/31/2021] [Indexed: 12/30/2022]
Abstract
BACKGROUND AND AIMS Hypertension has become a major public health issue as the prevalence and risk of premature death and disability among adults due to hypertension has increased globally. The main objective is to characterize the risk factors of hypertension among adults in Bangladesh using machine learning (ML) algorithms. MATERIALS AND METHODS The hypertension data was derived from Bangladesh demographic and health survey, 2017-18, which included 6965 people aged 35 and above. Two most promising risk factor identification methods, namely least absolute shrinkage operator (LASSO) and support vector machine recursive feature elimination (SVMRFE) are implemented to detect the critical risk factors of hypertension. Additionally, four well-known ML algorithms as artificial neural network, decision tree, random forest, and gradient boosting (GB) have been used to predict hypertension. Performance scores of these algorithms were evaluated by accuracy, precision, recall, F-measure, and area under the curve (AUC). RESULTS The results clarify that age, BMI, wealth index, working status, and marital status for LASSO and age, BMI, marital status, diabetes and region for SVMRFE appear to be the top-most five significant risk factors for hypertension. Our findings reveal that the combination of SVMRFE-GB gives the maximum accuracy (66.98%), recall (97.92%), F-measure (78.99%), and AUC (0.669) compared to others. CONCLUSION GB-based algorithm confirms the best performer for prediction of hypertension, at an early stage in Bangladesh. Therefore, this study highly suggests that the policymakers make proper judgments for controlling hypertension using SVMRFE-GB-based combination to save time and reduce cost for Bangladeshi adults.
Collapse
|
30
|
Machine learning transition temperatures from 2D structure. J Mol Graph Model 2021; 105:107848. [PMID: 33667863 DOI: 10.1016/j.jmgm.2021.107848] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2020] [Revised: 01/11/2021] [Accepted: 01/19/2021] [Indexed: 10/22/2022]
Abstract
A priori knowledge of physicochemical properties such as melting and boiling could expedite materials discovery. However, theoretical modeling from first principles poses a challenge for efficient virtual screening of potential candidates. As an alternative, the tools of data science are becoming increasingly important for exploring chemical datasets and predicting material properties. Herein, we extend a molecular representation, or set of descriptors, first developed for quantitative structure-property relationship modeling by Yalkowsky and coworkers known as the Unified Physicochemical Property Estimation Relationships (UPPER). This molecular representation has group-constitutive and geometrical descriptors that map to enthalpy and entropy; two thermodynamic quantities that drive thermal phase transitions. We extend the UPPER representation to include additional information about sp2-bonded fragments. Additionally, instead of using the UPPER descriptors in a series of thermodynamically-inspired calculations, as per Yalkowsky, we use the descriptors to construct a vector representation for use with machine learning techniques. The concise and easy-to-compute representation, combined with a gradient-boosting decision tree model, provides an appealing framework for predicting experimental transition temperatures in a diverse chemical space. An application to energetic materials shows that the method is predictive, despite a relatively modest energetics reference dataset. We also report competitive results on diverse public datasets of melting points (i.e., OCHEM, Enamine, Bradley, and Bergström) comprised of over 47k structures. Open source software is available at https://github.com/USArmyResearchLab/ARL-UPPER.
Collapse
|
31
|
Prediction modelling of COVID using machine learning methods from B-cell dataset. RESULTS IN PHYSICS 2021; 21:103813. [PMID: 33495725 PMCID: PMC7816944 DOI: 10.1016/j.rinp.2021.103813] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Revised: 12/25/2020] [Accepted: 12/30/2020] [Indexed: 05/03/2023]
Abstract
Coronavirus is a pandemic that has become a concern for the whole world. This disease has stepped out to its greatest extent and is expanding day by day. Coronavirus, termed as a worldwide disease, has caused more than 8 lakh deaths worldwide. The foremost cause of the spread of coronavirus is SARS-CoV and SARS-CoV-2, which are part of the coronavirus family. Thus, predicting the patients suffering from such pandemic diseases would help to formulate the difference in inaccurate and infeasible time duration. This paper mainly focuses on the prediction of SARS-CoV and SARS-CoV-2 using the B-cells dataset. The paper also proposes different ensemble learning strategies that came out to be beneficial while making predictions. The predictions are made using various machine learning models. The numerous machine learning models, such as SVM, Naïve Bayes, K-nearest neighbors, AdaBoost, Gradient boosting, XGBoost, Random forest, ensembles, and neural networks are used in predicting and analyzing the dataset. The most accurate result was obtained using the proposed algorithm with 0.919 AUC score and 87.248% validation accuracy for predicting SARS-CoV and 0.923 AUC and 87.7934% validation accuracy for predicting SARS-CoV-2 virus.
Collapse
|
32
|
Filtering de novo indels in parent-offspring trios. BMC Bioinformatics 2020; 21:547. [PMID: 33323105 PMCID: PMC7739476 DOI: 10.1186/s12859-020-03900-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Accepted: 11/19/2020] [Indexed: 12/02/2022] Open
Abstract
Background Identification of de novo indels from whole genome or exome sequencing data of parent-offspring trios is a challenging task in human disease studies and clinical practices. Existing computational approaches usually yield high false positive rate. Results In this study, we developed a gradient boosting approach for filtering de novo indels obtained by any computational approaches. Through application on the real genome sequencing data, our approach showed it could significantly reduce the false positive rate of de novo indels without a significant compromise on sensitivity. Conclusions The software DNMFilter_Indel was written in a combination of Java and R and freely available from the website at https://github.com/yongzhuang/DNMFilter_Indel.
Collapse
|
33
|
Gradient boosting for Parkinson's disease diagnosis from voice recordings. BMC Med Inform Decis Mak 2020; 20:228. [PMID: 32933493 PMCID: PMC7493334 DOI: 10.1186/s12911-020-01250-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Accepted: 09/08/2020] [Indexed: 12/18/2022] Open
Abstract
Background Parkinson’s Disease (PD) is a clinically diagnosed neurodegenerative disorder that affects both motor and non-motor neural circuits. Speech deterioration (hypokinetic dysarthria) is a common symptom, which often presents early in the disease course. Machine learning can help movement disorders specialists improve their diagnostic accuracy using non-invasive and inexpensive voice recordings. Method We used “Parkinson Dataset with Replicated Acoustic Features Data Set” from the UCI-Machine Learning repository. The dataset included 44 speech-test based acoustic features from patients with PD and controls. We analyzed the data using various machine learning algorithms including Light and Extreme Gradient Boosting, Random Forest, Support Vector Machines, K-nearest neighborhood, Least Absolute Shrinkage and Selection Operator Regression, as well as logistic regression. We also implemented a variable importance analysis to identify important variables classifying patients with PD. Results The cohort included a total of 80 subjects: 40 patients with PD (55% men) and 40 controls (67.5% men). Disease duration was 5 years or less for all subjects, with a mean Unified Parkinson’s Disease Rating Scale (UPDRS) score of 19.6 (SD 8.1), and none were taking PD medication. The mean age for PD subjects and controls was 69.6 (SD 7.8) and 66.4 (SD 8.4), respectively. Our best-performing model used Light Gradient Boosting to provide an AUC of 0.951 with 95% confidence interval 0.946–0.955 in 4-fold cross validation using only seven acoustic features. Conclusions Machine learning can accurately detect Parkinson’s disease using an inexpensive and non-invasive voice recording. Light Gradient Boosting outperformed other machine learning algorithms. Such approaches could be used to inexpensively screen large patient populations for Parkinson’s disease.
Collapse
|
34
|
LDNFSGB: prediction of long non-coding rna and disease association using network feature similarity and gradient boosting. BMC Bioinformatics 2020; 21:377. [PMID: 32883200 PMCID: PMC7469344 DOI: 10.1186/s12859-020-03721-0] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Accepted: 08/21/2020] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND A large number of experimental studies show that the mutation and regulation of long non-coding RNAs (lncRNAs) are associated with various human diseases. Accurate prediction of lncRNA-disease associations can provide a new perspective for the diagnosis and treatment of diseases. The main function of many lncRNAs is still unclear and using traditional experiments to detect lncRNA-disease associations is time-consuming. RESULTS In this paper, we develop a novel and effective method for the prediction of lncRNA-disease associations using network feature similarity and gradient boosting (LDNFSGB). In LDNFSGB, we first construct a comprehensive feature vector to effectively extract the global and local information of lncRNAs and diseases through considering the disease semantic similarity (DISSS), the lncRNA function similarity (LNCFS), the lncRNA Gaussian interaction profile kernel similarity (LNCGS), the disease Gaussian interaction profile kernel similarity (DISGS), and the lncRNA-disease interaction (LNCDIS). Particularly, two methods are used to calculate the DISSS (LNCFS) for considering the local and global information of disease semantics (lncRNA functions) respectively. An autoencoder is then used to reduce the dimensionality of the feature vector to obtain the optimal feature parameter from the original feature set. Furthermore, we employ the gradient boosting algorithm to obtain the lncRNA-disease association prediction. CONCLUSIONS In this study, hold-out, leave-one-out cross-validation, and ten-fold cross-validation methods are implemented on three publicly available datasets to evaluate the performance of LDNFSGB. Extensive experiments show that LDNFSGB dramatically outperforms other state-of-the-art methods. The case studies on six diseases, including cancers and non-cancers, further demonstrate the effectiveness of our method in real-world applications.
Collapse
|
35
|
Using gradient boosting with stability selection on health insurance claims data to identify disease trajectories in chronic obstructive pulmonary disease. Stat Methods Med Res 2020; 29:3684-3694. [PMID: 32646307 DOI: 10.1177/0962280220938088] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
OBJECTIVE We propose a data-driven method to detect temporal patterns of disease progression in high-dimensional claims data based on gradient boosting with stability selection. MATERIALS AND METHODS We identified patients with chronic obstructive pulmonary disease in a German health insurance claims database with 6.5 million individuals and divided them into a group of patients with the highest disease severity and a group of control patients with lower severity. We then used gradient boosting with stability selection to determine variables correlating with a chronic obstructive pulmonary disease diagnosis of highest severity and subsequently model the temporal progression of the disease using the selected variables. RESULTS We identified a network of 20 diagnoses (e.g. respiratory failure), medications (e.g. anticholinergic drugs) and procedures associated with a subsequent chronic obstructive pulmonary disease diagnosis of highest severity. Furthermore, the network successfully captured temporal patterns, such as disease progressions from lower to higher severity grades. DISCUSSION The temporal trajectories identified by our data-driven approach are compatible with existing knowledge about chronic obstructive pulmonary disease showing that the method can reliably select relevant variables in a high-dimensional context. CONCLUSION We provide a generalizable approach for the automatic detection of disease trajectories in claims data. This could help to diagnose diseases early, identify unknown risk factors and optimize treatment plans.
Collapse
|
36
|
Detection of medications associated with Alzheimer's disease using ensemble methods and cooperative game theory. Int J Med Inform 2020; 141:104142. [PMID: 32531724 DOI: 10.1016/j.ijmedinf.2020.104142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2019] [Revised: 11/22/2019] [Accepted: 04/05/2020] [Indexed: 11/27/2022]
Abstract
OBJECTIVE To study the feasibility of evaluating feature importance with Shapley Values and ensemble methods in the context of pharmacoepidemiology and medication safety. METHODS We detected medications associated with Alzheimer's disease (AD) by examining the additive feature attribution with combined approach of Gradient Boosting and Shapley Values in the Medication use and Alzheimer's disease (MEDALZ) study, a nested case-control study of 70,719 verified AD cases in Finland. Our methodological approach is to do binary classification using Gradient boosting (an ensemble of weak classifiers) in a supervised learning manner. Then we apply Shapley Values (from cooperative game theory) to analyze how feature combinations affect the classification result. Medication use with a five to one year time-window before AD diagnosis was ascertained from Prescription register. RESULTS Antipsychotics with low or medium dose, antidepressants with medium to high dose, and cardiovascular medications with medium to high dose were identified as the contributing features for separating cases with AD from controls. Medium to high amount of irregularity in the purchase pattern were an indicating feature for separating AD cases from controls. The similarity of medication purchases between AD cases and controls made the feature evaluation challenging. CONCLUSIONS The combined approach of Gradient Boosting and feature evaluation with Shapley Values identified features that were consistent with findings from previous hypothesis-driven studies. Additionally, the results from the additive feature attribution identified new candidates for future studies on AD risk factors. Our approach also shows promise for studies based on observational studies, where feature identification and interactions in populations are of interest; and the applicability of using Shapley Values for evaluating feature relevance in pattern recognition tasks.
Collapse
|
37
|
Aberrant posterior cingulate connectivity classify first-episode schizophrenia from controls: A machine learning study. Schizophr Res 2020; 220:187-193. [PMID: 32220502 DOI: 10.1016/j.schres.2020.03.022] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/15/2019] [Revised: 02/23/2020] [Accepted: 03/10/2020] [Indexed: 02/08/2023]
Abstract
BACKGROUND Posterior cingulate cortex (PCC) is a key aspect of the default mode network (DMN). Aberrant PCC functional connectivity (FC) is implicated in schizophrenia, but the potential for PCC related changes as biological classifier of schizophrenia has not yet been evaluated. METHODS We conducted a data-driven approach using resting-state functional MRI data to explore differences in PCC-based region- and voxel-wise FC patterns, to distinguish between patients with first-episode schizophrenia (FES) and demographically matched healthy controls (HC). Discriminative PCC FCs were selected via false discovery rate estimation. A gradient boosting classifier was trained and validated based on 100 FES vs. 93 HC. Subsequently, classification models were tested in an independent dataset of 87 FES patients and 80 HC using resting-state data acquired on a different MRI scanner. RESULTS Patients with FES had reduced connectivity between PCC and frontal areas, left parahippocampal regions, left anterior cingulate cortex, and right inferior parietal lobule, but hyperconnectivity with left lateral temporal regions. Predictive voxel-wise clusters were similar to region-wise selected brain areas functionally connected with PCC in relation to discriminating FES from HC subject categories. Region-wise analysis of FCs yielded a relatively high predictive level for schizophrenia, with an average accuracy of 72.28% in the independent samples, while selected voxel-wise connectivity yielded an accuracy of 68.72%. CONCLUSION FES exhibited a pattern of both increased and decreased PCC-based connectivity, but was related to predominant hypoconnectivity between PCC and brain areas associated with DMN, that may be a useful differential feature revealing underpinnings of neuropathophysiology for schizophrenia.
Collapse
|
38
|
A distributed multitask multimodal approach for the prediction of Alzheimer's disease in a longitudinal study. Neuroimage 2020; 206:116317. [PMID: 31678502 DOI: 10.1016/j.neuroimage.2019.116317] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2019] [Revised: 10/24/2019] [Accepted: 10/26/2019] [Indexed: 01/19/2023] Open
Abstract
Predicting the progression of Alzheimer's Disease (AD) has been held back for decades due to the lack of sufficient longitudinal data required for the development of novel machine learning algorithms. This study proposes a novel machine learning algorithm for predicting the progression of Alzheimer's disease using a distributed multimodal, multitask learning method. More specifically, each individual task is defined as a regression model, which predicts cognitive scores at a single time point. Since the prediction tasks for multiple intervals are related to each other in chronological order, multitask regression models have been developed to track the relationship between subsequent tasks. Furthermore, since subjects have various combinations of recording modalities together with other genetic, neuropsychological and demographic risk factors, special attention is given to the fact that each modality may experience a specific sparsity pattern. The model is hence generalized by exploiting multiple individual multitask regression coefficient matrices for each modality. The outcome for each independent modality-specific learner is then integrated with complementary information, known as risk factor parameters, revealing the most prevalent trends of the multimodal data. This new feature space is then used as input to the gradient boosting kernel in search for a more accurate prediction. This proposed model not only captures the complex relationships between the different feature representations, but it also ignores any unrelated information which might skew the regression coefficients. Comparative assessments are made between the performance of the proposed method with several other well-established methods using different multimodal platforms. The results indicate that by capturing the interrelatedness between the different modalities and extracting only relevant information in the data, even in an incomplete longitudinal dataset, will yield minimized prediction errors.
Collapse
|
39
|
Machine learning algorithms can classify outdoor terrain types during running using accelerometry data. Gait Posture 2019; 74:176-181. [PMID: 31539798 DOI: 10.1016/j.gaitpost.2019.09.005] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/20/2019] [Revised: 08/02/2019] [Accepted: 09/04/2019] [Indexed: 02/02/2023]
Abstract
BACKGROUND Running is a popular physical activity that benefits health; however, running surface characteristics may influence loading impact and injury risk. Machine learning algorithms could automatically identify running surface from wearable motion sensors to quantify running exposures, and perhaps loading and injury risk for a runner. RESEARCH QUESTION (1) How accurately can machine learning algorithms identify surface type from three-dimensional accelerometer sensors? (2) Does the sensor count (single or two-sensor setup) affect model accuracy? METHODS Twenty-nine healthy adults (23.3 ± 3.6 years, 1.8 ± 0.1 m, and 63.6 ± 8.5 kg) participated in this study. Participants ran on three different surfaces (concrete, synthetic, woodchip) while fit with two three-dimensional accelerometers (lower-back and right tibia). Summary features (n = 208) were extracted from the accelerometer signals. Feature-based Gradient Boosting (GB) and signal-based deep learning Convolutional Neural Network (CNN) models were developed. Models were trained on 90% of the data and tested on the remaining 10%. The process was repeated five times, with data randomly shuffled between train-test splits, to quantify model performance variability. RESULTS All models and configurations achieved greater than 90% average accuracy. The highest performing models were the two-sensor GB and tibia-sensor CNN (average accuracy of 97.0 ± 0.7 and 96.1 ± 2.6%, respectively). SIGNIFICANCE Machine learning algorithms trained on running data from a single- or dual-sensor accelerometer setup can accurately distinguish between surfaces types. Automatic identification of surfaces encountered during running activities could help runners and coaches better monitor training load, improve performance, and reduce injury rates.
Collapse
|
40
|
An ensemble-based model of PM 2.5 concentration across the contiguous United States with high spatiotemporal resolution. ENVIRONMENT INTERNATIONAL 2019; 130:104909. [PMID: 31272018 PMCID: PMC7063579 DOI: 10.1016/j.envint.2019.104909] [Citation(s) in RCA: 255] [Impact Index Per Article: 51.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Revised: 06/03/2019] [Accepted: 06/06/2019] [Indexed: 05/17/2023]
Abstract
Various approaches have been proposed to model PM2.5 in the recent decade, with satellite-derived aerosol optical depth, land-use variables, chemical transport model predictions, and several meteorological variables as major predictor variables. Our study used an ensemble model that integrated multiple machine learning algorithms and predictor variables to estimate daily PM2.5 at a resolution of 1 km × 1 km across the contiguous United States. We used a generalized additive model that accounted for geographic difference to combine PM2.5 estimates from neural network, random forest, and gradient boosting. The three machine learning algorithms were based on multiple predictor variables, including satellite data, meteorological variables, land-use variables, elevation, chemical transport model predictions, several reanalysis datasets, and others. The model training results from 2000 to 2015 indicated good model performance with a 10-fold cross-validated R2 of 0.86 for daily PM2.5 predictions. For annual PM2.5 estimates, the cross-validated R2 was 0.89. Our model demonstrated good performance up to 60 μg/m3. Using trained PM2.5 model and predictor variables, we predicted daily PM2.5 from 2000 to 2015 at every 1 km × 1 km grid cell in the contiguous United States. We also used localized land-use variables within 1 km × 1 km grids to downscale PM2.5 predictions to 100 m × 100 m grid cells. To characterize uncertainty, we used meteorological variables, land-use variables, and elevation to model the monthly standard deviation of the difference between daily monitored and predicted PM2.5 for every 1 km × 1 km grid cell. This PM2.5 prediction dataset, including the downscaled and uncertainty predictions, allows epidemiologists to accurately estimate the adverse health effect of PM2.5. Compared with model performance of individual base learners, an ensemble model would achieve a better overall estimation. It is worth exploring other ensemble model formats to synthesize estimations from different models or from different groups to improve overall performance.
Collapse
|
41
|
Predictive analytics with gradient boosting in clinical medicine. ANNALS OF TRANSLATIONAL MEDICINE 2019; 7:152. [PMID: 31157273 DOI: 10.21037/atm.2019.03.29] [Citation(s) in RCA: 78] [Impact Index Per Article: 15.6] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Predictive analytics play an important role in clinical research. An accurate predictive model can help clinicians stratify risk thereby allowing the identification of a target population which might benefit from a certain intervention. Conventionally, predictive analytics is performed using parametric modeling which comes with a number of assumptions. For example, generalized linear regression models require linearity and additivity to hold for the underlying data. However, these assumptions may not hold in practice. Especially in the era of big data, a large number of covariates or features can be extracted from an electronic database which might have complex interactions and higher-order terms among the covariates. Conventional modeling methods have trouble capturing such high-dimensional relationships. However, some sophisticated machine learning techniques have been invented to handle this situation. Gradient boosting is one of these techniques which is able to recursively fit a weak learner to the residual so as to improve model performance with a gradually increasing number of iterations. It can automatically discover complex data structure, including nonlinearity and high-order interactions, even in the context of hundreds, thousands, or tens-of-thousands of potential predictors. This paper aims to introduce how gradient boosting works. The principles behind this learning machine are explained with a small example in a step-by-step manner. The formal implementation of gradient tree boosting is then illustrated with the caret package. In the simulated example complexity of data structure is created by generating certain interactions between the covariates. This example shows that gradient boosting can better capture these complex relationships than a generalized linear model-based approach.
Collapse
|
42
|
Development and validation of a novel prediction model to identify patients in need of specialized trauma care during field triage: design and rationale of the GOAT study. Diagn Progn Res 2019; 3:12. [PMID: 31245626 PMCID: PMC6584978 DOI: 10.1186/s41512-019-0058-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/04/2019] [Accepted: 04/14/2019] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Adequate field triage of trauma patients is crucial to transport patients to the right hospital. Mistriage and subsequent interhospital transfers should be minimized to reduce avoidable mortality, life-long disabilities, and costs. Availability of a prehospital triage tool may help to identify patients in need of specialized trauma care and to determine the optimal transportation destination. METHODS The GOAT (Gradient Boosted Trauma Triage) study is a prospective, multi-site, cross-sectional diagnostic study. Patients transported by at least five ground Emergency Medical Services to any receiving hospital within the Netherlands are eligible for inclusion. The reference standards for the need of specialized trauma care are an Injury Severity Score ≥ 16 and early critical resource use, which will both be assessed by trauma registrars after the final diagnosis is made. Variable selection will be based on ease of use in practice and clinical expertise. A gradient boosting decision tree algorithm will be used to develop the prediction model. Model accuracy will be assessed in terms of discrimination (c-statistic) and calibration (intercept, slope, and plot) on individual participant's data from each participating cluster (i.e., Emergency Medical Service) through internal-external cross-validation. A reference model will be externally validated on each cluster as well. The resulting model statistics will be investigated, compared, and summarized through an individual participant's data meta-analysis. DISCUSSION The GOAT study protocol describes the development of a new prediction model for identifying patients in need of specialized trauma care. The aim is to attain acceptable undertriage rates and to minimize mortality rates and life-long disabilities.
Collapse
|
43
|
An open-source tool to identify active travel from hip-worn accelerometer, GPS and GIS data. Int J Behav Nutr Phys Act 2018; 15:91. [PMID: 30241483 PMCID: PMC6150970 DOI: 10.1186/s12966-018-0724-y] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Accepted: 09/07/2018] [Indexed: 11/22/2022] Open
Abstract
Background Increases in physical activity through active travel have the potential to have large beneficial effects on populations, through both better health outcomes and reduced motorized traffic. However accurately identifying travel mode in large datasets is problematic. Here we provide an open source tool to quantify time spent stationary and in four travel modes(walking, cycling, train, motorised vehicle) from accelerometer measured physical activity data, combined with GPS and GIS data. Methods The Examining Neighbourhood Activities in Built Living Environments in London study evaluates the effect of the built environment on health behaviours, including physical activity. Participants wore accelerometers and GPS receivers on the hip for 7 days. We time-matched accelerometer and GPS, and then extracted data from the commutes of 326 adult participants, using stated commute times and modes, which were manually checked to confirm stated travel mode. This yielded examples of five travel modes: walking, cycling, motorised vehicle, train and stationary. We used this example data to train a gradient boosted tree, a form of supervised machine learning algorithm, on each data point (131,537 points), rather than on journeys. Accuracy during training was assessed using five-fold cross-validation. We also manually identified the travel behaviour of both 21 participants from ENABLE London (402,749 points), and 10 participants from a separate study (STAMP-2, 210,936 points), who were not included in the training data. We compared our predictions against this manual identification to further test accuracy and test generalisability. Results Applying the algorithm, we correctly identified travel mode 97.3% of the time in cross-validation (mean sensitivity 96.3%, mean active travel sensitivity 94.6%). We showed 96.0% agreement between manual identification and prediction of 21 individuals’ travel modes (mean sensitivity 92.3%, mean active travel sensitivity 84.9%) and 96.5% agreement between the STAMP-2 study and predictions (mean sensitivity 85.5%, mean active travel sensitivity 78.9%). Conclusion We present a generalizable tool that identifies time spent stationary and time spent walking with very high precision, time spent in trains or vehicles with good precision, and time spent cycling with moderate precisionIn studies where both accelerometer and GPS data are available this tool complements analyses of physical activity, showing whether differences in PA may be explained by differences in travel mode. All code necessary to replicate, fit and predict to other datasets is provided to facilitate use by other researchers. Electronic supplementary material The online version of this article (10.1186/s12966-018-0724-y) contains supplementary material, which is available to authorized users.
Collapse
|
44
|
Clinical prediction of HBV and HCV related hepatic fibrosis using machine learning. EBioMedicine 2018; 35:124-132. [PMID: 30100397 PMCID: PMC6154783 DOI: 10.1016/j.ebiom.2018.07.041] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2018] [Revised: 07/25/2018] [Accepted: 07/30/2018] [Indexed: 12/14/2022] Open
Abstract
Clinical prediction of advanced hepatic fibrosis (HF) and cirrhosis has long been challenging due to the gold standard, liver biopsy, being an invasive approach with certain limitations. Less invasive blood test tandem with a cutting-edge machine learning algorithm shows promising diagnostic potential. In this study, we constructed and compared machine learning methods with the FIB-4 score in a discovery dataset (n = 490) of hepatitis B virus (HBV) patients. Models were validated in an independent HBV dataset (n = 86). We further employed these models on two independent hepatitis C virus (HCV) datasets (n = 254 and 230) to examine their applicability. In the discovery data, gradient boosting (GB) stably outperformed other methods as well as FIB-4 scores (p < .001) in the prediction of advanced HF and cirrhosis. In the HBV validation dataset, for classification between early and advanced HF, the area under receiver operating characteristic curves (AUROC) of GB model was 0.918, while FIB-4 was 0.841; for classification between non-cirrhosis and cirrhosis, GB showed AUROC of 0.871, while FIB-4 was 0.830. Additionally, GB-based prediction demonstrated good classification capacity on two HCV datasets while higher cutoffs for both GB and FIB-4 scores were required to achieve comparable specificity and sensitivity. Using the same parameters as FIB-4, the GB-based prediction system demonstrated steady improvements relative to FIB-4 in HBV and HCV cohorts with different cutoff values required in different etiological groups. A user-friendly web tool, LiveBoost, makes our prediction models freely accessible for further clinical studies and applications.
Collapse
|
45
|
Cyanotoxin level prediction in a reservoir using gradient boosted regression trees: a case study. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2018; 25:22658-22671. [PMID: 29846899 DOI: 10.1007/s11356-018-2219-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/24/2017] [Accepted: 05/03/2018] [Indexed: 06/08/2023]
Abstract
Cyanotoxins are a type of cyanobacteria that is poisonous and poses a health threat in waters that could be used for drinking or recreational purposes. Thus, it is necessary to predict their presence to avoid risks. This paper presents a nonparametric machine learning approach using a gradient boosted regression tree model (GBRT) for prediction of cyanotoxin contents from cyanobacterial concentrations determined experimentally in a reservoir located in the north of Spain. GBRT models seek and obtain good predictions in highly nonlinear problems, like the one treated here, where the studied variable presents low concentrations of cyanotoxins mixed with high concentration peaks. Two types of results have been obtained: firstly, the model allows the ranking or the dependent variables according to its importance in the model. Finally, the high performance and the simplicity of the model make the gradient boosted tree method attractive compared to conventional forecasting techniques.
Collapse
|
46
|
Supervised signal detection for adverse drug reactions in medication dispensing data. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2018; 161:25-38. [PMID: 29852965 DOI: 10.1016/j.cmpb.2018.03.021] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/24/2017] [Revised: 03/12/2018] [Accepted: 03/20/2018] [Indexed: 06/08/2023]
Abstract
MOTIVATION Adverse drug reactions (ADRs) are one of the leading causes of morbidity and mortality and thus should be detected early to reduce consequences on health outcomes. Medication dispensing data are comprehensive sources of information about medicine uses that can be utilized for the signal detection of ADRs. Sequence symmetry analysis (SSA) has been employed in previous studies to detect signals of ADRs from medication dispensing data, but it has a moderate sensitivity and tends to miss some ADR signals. With successful applications in various areas, supervised machine learning (SML) methods are promising in detecting ADR signals. Gold standards of known ADRs and non- ADRs from previous studies create opportunities to take into account additional domain knowledge to improve ADR signal detection with SML. OBJECTIVE We assess the utility of SML as a signal detection tool for ADRs in medication dispensing data with the consideration of domain knowledge from DrugBank and MedDRA. We compare the best performing SML method with SSA. METHODS We model the ADR signal detection problem as a supervised machine learning problem by linking medication dispensing data with domain knowledge bases. Suspected ADR signals are extracted from the Australian Pharmaceutical Benefit Scheme (PBS) medication dispensing data from 2013 to 2016. We construct predictive features for each signal candidate based on its occurrences in medication dispensing data as well as its pharmacological properties. Pharmaceutical knowledge bases including DrugBank and MedDRA are employed to provide pharmacological features for a signal candidate. Given a gold standard of known ADRs and non-ADRs, SML learns to differentiate between known ADRs and non-ADRs based on their combined predictive features from linked sources, and then predicts whether a new case is a potential ADR signal. RESULTS We evaluate the performance of six widely used SML methods with two gold standards of known ADRs and non-ADRs from previous studies. On average, gradient boosting classifier achieves the sensitivity of 77%, specificity of 81%, positive predictive value of 76%, negative predictive value of 82%, area under precision-recall curve of 81%, and area under receiver operating characteristic curve of 82%, most of which are higher than in other SML methods. In particular, gradient boosting classifier has 21% higher sensitivity than and comparable specificity with SSA. Furthermore, gradient boosting classifier detects 10% more unknown potential ADR signals than SSA. CONCLUSIONS Our study demonstrates that gradient boosting classifier is a promising supervised signal detection tool for ADRs in medication dispensing data to complement SSA.
Collapse
|
47
|
Commercial truck crash injury severity analysis using gradient boosting data mining model. JOURNAL OF SAFETY RESEARCH 2018; 65:115-124. [PMID: 29776520 DOI: 10.1016/j.jsr.2018.03.002] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/20/2017] [Revised: 02/08/2018] [Accepted: 03/06/2018] [Indexed: 06/08/2023]
Abstract
INTRODUCTION Truck crashes contribute to a large number of injuries and fatalities. This study seeks to identify the contributing factors affecting truck crash severity using 2010 to 2016 North Dakota and Colorado crash data provided by the Federal Motor Carrier Safety Administration. METHOD To fulfill a gap of previous studies, broad considerations of company and driver characteristics, such as company size and driver's license class, along with vehicle types and crash characteristics are researched. Gradient boosting, a data mining technique, is applied to comprehensively analyze the relationship between crash severities and a set of heterogeneous risk factors. RESULTS Twenty five variables were tested and 22 of them are identified as significant variables contributing to injury severities, however, top 11 variables account for more than 80% of injury forecasting. The relative variable importance analysis is conducted and furthermore marginal effects of all contributing factors are also illustrated in this research. Several factors such as trucking company attributes (e.g., company size), safety inspection values, trucking company commerce status (e.g., interstate or intrastate), time of day, driver's age, first harmful events, and registration condition are found to be significantly associated with crash injury severity. Even though most of the identified contributing factors are significant for all four levels of crash severity, their relative importance and marginal effect are all different. CONCLUSIONS For the first time, trucking company and driver characteristics are proved to have significant impact on truck crash injury severity. Some of the results in this study reinforce previous studies' conclusions. PRACTICAL APPLICATIONS Findings in this study can be helpful for transportation agencies to reduce injury severity, and develop efficient strategies to improve safety.
Collapse
|
48
|
A CBR framework with gradient boosting based feature selection for lung cancer subtype classification. Comput Biol Med 2017; 86:98-106. [PMID: 28527352 DOI: 10.1016/j.compbiomed.2017.05.010] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2017] [Revised: 05/10/2017] [Accepted: 05/10/2017] [Indexed: 11/19/2022]
Abstract
Molecular subtype classification represents a challenging field in lung cancer diagnosis. Although different methods have been proposed for biomarker selection, efficient discrimination between adenocarcinoma and squamous cell carcinoma in clinical practice presents several difficulties, especially when the latter is poorly differentiated. This is an area of growing importance, since certain treatments and other medical decisions are based on molecular and histological features. An urgent need exists for a system and a set of biomarkers that provide an accurate diagnosis. In this paper, a novel Case Based Reasoning framework with gradient boosting based feature selection is proposed and applied to the task of squamous cell carcinoma and adenocarcinoma discrimination, aiming to provide accurate diagnosis with a reduced set of genes. The proposed method was trained and evaluated on two independent datasets to validate its generalization capability. Furthermore, it achieved accuracy rates greater than those of traditional microarray analysis techniques, incorporating the advantages inherent to the Case Based Reasoning methodology (e.g. learning over time, adaptability, interpretability of solutions, etc.).
Collapse
|
49
|
SimBoost: a read-across approach for predicting drug-target binding affinities using gradient boosting machines. J Cheminform 2017; 9:24. [PMID: 29086119 PMCID: PMC5395521 DOI: 10.1186/s13321-017-0209-z] [Citation(s) in RCA: 142] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2016] [Accepted: 03/30/2017] [Indexed: 02/06/2023] Open
Abstract
Computational prediction of the interaction between drugs and targets is a standing challenge in the field of drug discovery. A number of rather accurate predictions were reported for various binary drug–target benchmark datasets. However, a notable drawback of a binary representation of interaction data is that missing endpoints for non-interacting drug–target pairs are not differentiated from inactive cases, and that predicted levels of activity depend on pre-defined binarization thresholds. In this paper, we present a method called SimBoost that predicts continuous (non-binary) values of binding affinities of compounds and proteins and thus incorporates the whole interaction spectrum from true negative to true positive interactions. Additionally, we propose a version of the method called SimBoostQuant which computes a prediction interval in order to assess the confidence of the predicted affinity, thus defining the Applicability Domain metrics explicitly. We evaluate SimBoost and SimBoostQuant on two established drug–target interaction benchmark datasets and one new dataset that we propose to use as a benchmark for read-across cheminformatics applications. We demonstrate that our methods outperform the previously reported models across the studied datasets.
Collapse
|
50
|
Patterns of waste generation: A gradient boosting model for short-term waste prediction in New York City. WASTE MANAGEMENT (NEW YORK, N.Y.) 2017; 62:3-11. [PMID: 28216080 DOI: 10.1016/j.wasman.2017.01.037] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Revised: 01/04/2017] [Accepted: 01/25/2017] [Indexed: 05/06/2023]
Abstract
Historical municipal solid waste (MSW) collection data supplied by the New York City Department of Sanitation (DSNY) was used in conjunction with other datasets related to New York City to forecast municipal solid waste generation across the city. Spatiotemporal tonnage data from the DSNY was combined with external data sets, including the Longitudinal Employer Household Dynamics data, the American Community Survey, the New York City Department of Finance's Primary Land Use and Tax Lot Output data, and historical weather data to build a Gradient Boosting Regression Model. The model was trained on historical data from 2005 to 2011 and validation was performed both temporally and spatially. With this model, we are able to accurately (R2>0.88) forecast weekly MSW generation tonnages for each of the 232 geographic sections in NYC across three waste streams of refuse, paper and metal/glass/plastic. Importantly, the model identifies regularity of urban waste generation and is also able to capture very short timescale fluctuations associated to holidays, special events, seasonal variations, and weather related events. This research shows New York City's waste generation trends and the importance of comprehensive data collection (especially weather patterns) in order to accurately predict waste generation.
Collapse
|