1
|
Zhao JO, Patel BK, Krishack P, Stutz MR, Pearson SD, Lin J, Lecompte-Osorio PA, Dugan KC, Kim S, Gras N, Pohlman A, Kress JP, Hall JB, Sperling AI, Adegunsoye A, Verhoef PA, Wolfe KS. Identification of Clinically Significant Cytokine Signature Clusters in Patients With Septic Shock. Crit Care Med 2023; 51:e253-e263. [PMID: 37678209 PMCID: PMC10840934 DOI: 10.1097/ccm.0000000000006032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/09/2023]
Abstract
OBJECTIVES To identify cytokine signature clusters in patients with septic shock. DESIGN Prospective observational cohort study. SETTING Single academic center in the United States. PATIENTS Adult (≥ 18 yr old) patients admitted to the medical ICU with septic shock requiring vasoactive medication support. INTERVENTIONS None. MEASUREMENTS AND MAIN RESULTS One hundred fourteen patients with septic shock completed cytokine measurement at time of enrollment (t 1 ) and 24 hours later (t 2 ). Unsupervised random forest analysis of the change in cytokines over time, defined as delta (t 2 -t 1 ), identified three clusters with distinct cytokine profiles. Patients in cluster 1 had the lowest initial levels of circulating cytokines that decreased over time. Patients in cluster 2 and cluster 3 had higher initial levels that decreased over time in cluster 2 and increased in cluster 3. Patients in clusters 2 and 3 had higher mortality compared with cluster 1 (clusters 1-3: 11% vs 31%; odds ratio [OR], 3.56 [1.10-14.23] vs 54% OR, 9.23 [2.89-37.22]). Cluster 3 was independently associated with in-hospital mortality (hazard ratio, 5.24; p = 0.005) in multivariable analysis. There were no significant differences in initial clinical severity scoring or steroid use between the clusters. Analysis of either t 1 or t 2 cytokine measurements alone or in combination did not reveal clusters with clear clinical significance. CONCLUSIONS Longitudinal measurement of cytokine profiles at initiation of vasoactive medications and 24 hours later revealed three distinct cytokine signature clusters that correlated with clinical outcomes.
Collapse
Affiliation(s)
- Jack O Zhao
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Bhakti K Patel
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Paulette Krishack
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Matthew R Stutz
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Steven D Pearson
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Julie Lin
- Pulmonary Medicine, MD Anderson Cancer Center, The University of Texas, Houston, TX
| | | | | | - Seoyoen Kim
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Nicole Gras
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Anne Pohlman
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - John P Kress
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Jesse B Hall
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Anne I Sperling
- Pulmonary & Critical Care, University of Virginia, Charlottesville, VA
| | - Ayodeji Adegunsoye
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| | - Philip A Verhoef
- Critical Care Medicine, Hawaii Permanente Medical Group, Honolulu, HI
| | - Krysta S Wolfe
- Pulmonary and Critical Care, University of Chicago Medical Center, Chicago, IL
| |
Collapse
|
2
|
A new method based on ensemble time series for fast and accurate clustering. DATA TECHNOLOGIES AND APPLICATIONS 2023. [DOI: 10.1108/dta-08-2022-0300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/18/2023]
Abstract
PurposeThe common methods for clustering time series are the use of specific distance criteria or the use of standard clustering algorithms. Ensemble clustering is one of the common techniques used in data mining to increase the accuracy of clustering. In this study, based on segmentation, selecting the best segments, and using ensemble clustering for selected segments, a multistep approach has been developed for the whole clustering of time series data.Design/methodology/approachFirst, this approach divides the time series dataset into equal segments. In the next step, using one or more internal clustering criteria, the best segments are selected, and then the selected segments are combined for final clustering. By using a loop and how to select the best segments for the final clustering (using one criterion or several criteria simultaneously), two algorithms have been developed in different settings. A logarithmic relationship limits the number of segments created in the loop.FindingAccording to Rand's external criteria and statistical tests, at first, the best setting of the two developed algorithms has been selected. Then this setting has been compared to different algorithms in the literature on clustering accuracy and execution time. The obtained results indicate more accuracy and less execution time for the proposed approach.Originality/valueThis paper proposed a fast and accurate approach for time series clustering in three main steps. This is the first work that uses a combination of segmentation and ensemble clustering. More accuracy and less execution time are the remarkable achievements of this study.
Collapse
|
3
|
Rudar J, Golding GB, Kremer SC, Hajibabaei M. Decision Tree Ensembles Utilizing Multivariate Splits Are Effective at Investigating Beta Diversity in Medically Relevant 16S Amplicon Sequencing Data. Microbiol Spectr 2023; 11:e0206522. [PMID: 36877086 PMCID: PMC10100742 DOI: 10.1128/spectrum.02065-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 02/11/2023] [Indexed: 03/07/2023] Open
Abstract
Developing an understanding of how microbial communities vary across conditions is an important analytical step. We used 16S rRNA data isolated from human stool samples to investigate whether learned dissimilarities, such as those produced using unsupervised decision tree ensembles, can be used to improve the analysis of the composition of bacterial communities in patients suffering from Crohn's disease and adenomas/colorectal cancers. We also introduce a workflow capable of learning dissimilarities, projecting them into a lower dimensional space, and identifying features that impact the location of samples in the projections. For example, when used with the centered log ratio transformation, our new workflow (TreeOrdination) could identify differences in the microbial communities of Crohn's disease patients and healthy controls. Further investigation of our models elucidated the global impact amplicon sequence variants (ASVs) had on the locations of samples in the projected space and how each ASV impacted individual samples in this space. Furthermore, this approach can be used to integrate patient data easily into the model and results in models that generalize well to unseen data. Models employing multivariate splits can improve the analysis of complex high-throughput sequencing data sets because they are better able to learn about the underlying structure of the data set. IMPORTANCE There is an ever-increasing level of interest in accurately modeling and understanding the roles that commensal organisms play in human health and disease. We show that learned representations can be used to create informative ordinations. We also demonstrate that the application of modern model introspection algorithms can be used to investigate and quantify the impacts of taxa in these ordinations, and that the taxa identified by these approaches have been associated with immune-mediated inflammatory diseases and colorectal cancer.
Collapse
Affiliation(s)
- Josip Rudar
- Department of Integrative Biology & Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| | - G. Brian Golding
- Department of Biology, McMaster University, Hamilton, Ontario, Canada
| | - Stefan C. Kremer
- School of Computer Science, University of Guelph, Guelph, Ontario, Canada
| | - Mehrdad Hajibabaei
- Department of Integrative Biology & Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, Canada
| |
Collapse
|
4
|
Lee J, Suttiratana SC, Sen I, Kong G. E-Cigarette Marketing on Social Media: A Scoping Review. CURRENT ADDICTION REPORTS 2023. [DOI: 10.1007/s40429-022-00463-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
|
5
|
McCandlish JA, Ayer T, Chhatwal J. Cost-Effectiveness and Value-of-Information Analysis Using Machine Learning-Based Metamodeling: A Case of Hepatitis C Treatment. Med Decis Making 2023; 43:68-77. [PMID: 36113098 DOI: 10.1177/0272989x221125418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
BACKGROUND Metamodels can address some of the limitations of complex simulation models by formulating a mathematical relationship between input parameters and simulation model outcomes. Our objective was to develop and compare the performance of a machine learning (ML)-based metamodel against a conventional metamodeling approach in replicating the findings of a complex simulation model. METHODS We constructed 3 ML-based metamodels using random forest, support vector regression, and artificial neural networks and a linear regression-based metamodel from a previously validated microsimulation model of the natural history hepatitis C virus (HCV) consisting of 40 input parameters. Outcomes of interest included societal costs and quality-adjusted life-years (QALYs), the incremental cost-effectiveness (ICER) of HCV treatment versus no treatment, cost-effectiveness analysis curve (CEAC), and expected value of perfect information (EVPI). We evaluated metamodel performance using root mean squared error (RMSE) and Pearson's R2 on the normalized data. RESULTS The R2 values for the linear regression metamodel for QALYs without treatment, QALYs with treatment, societal cost without treatment, societal cost with treatment, and ICER were 0.92, 0.98, 0.85, 0.92, and 0.60, respectively. The corresponding R2 values for our ML-based metamodels were 0.96, 0.97, 0.90, 0.95, and 0.49 for support vector regression; 0.99, 0.83, 0.99, 0.99, and 0.82 for artificial neural network; and 0.99, 0.99, 0.99, 0.99, and 0.98 for random forest. Similar trends were observed for RMSE. The CEAC and EVPI curves produced by the random forest metamodel matched the results of the simulation output more closely than the linear regression metamodel. CONCLUSIONS ML-based metamodels generally outperformed traditional linear regression metamodels at replicating results from complex simulation models, with random forest metamodels performing best. HIGHLIGHTS Decision-analytic models are frequently used by policy makers and other stakeholders to assess the impact of new medical technologies and interventions. However, complex models can impose limitations on conducting probabilistic sensitivity analysis and value-of-information analysis, and may not be suitable for developing online decision-support tools.Metamodels, which accurately formulate a mathematical relationship between input parameters and model outcomes, can replicate complex simulation models and address the above limitation.The machine learning-based random forest model can outperform linear regression in replicating the findings of a complex simulation model. Such a metamodel can be used for conducting cost-effectiveness and value-of-information analyses or developing online decision support tools.
Collapse
Affiliation(s)
| | - Turgay Ayer
- Georgia Institute of Technology, Atlanta, Georgia
| | - Jagpreet Chhatwal
- Massachusetts General Hospital Institute for Technology Assessment, Boston, Massachusetts.,Harvard Medical School, Boston, Massachusetts
| |
Collapse
|
6
|
Rudar J, Porter TM, Wright M, Golding GB, Hajibabaei M. LANDMark: an ensemble approach to the supervised selection of biomarkers in high-throughput sequencing data. BMC Bioinformatics 2022; 23:110. [PMID: 35361114 PMCID: PMC8969335 DOI: 10.1186/s12859-022-04631-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Accepted: 03/07/2022] [Indexed: 11/10/2022] Open
Abstract
Background Identification of biomarkers, which are measurable characteristics of biological datasets, can be challenging. Although amplicon sequence variants (ASVs) can be considered potential biomarkers, identifying important ASVs in high-throughput sequencing datasets is challenging. Noise, algorithmic failures to account for specific distributional properties, and feature interactions can complicate the discovery of ASV biomarkers. In addition, these issues can impact the replicability of various models and elevate false-discovery rates. Contemporary machine learning approaches can be leveraged to address these issues. Ensembles of decision trees are particularly effective at classifying the types of data commonly generated in high-throughput sequencing (HTS) studies due to their robustness when the number of features in the training data is orders of magnitude larger than the number of samples. In addition, when combined with appropriate model introspection algorithms, machine learning algorithms can also be used to discover and select potential biomarkers. However, the construction of these models could introduce various biases which potentially obfuscate feature discovery. Results We developed a decision tree ensemble, LANDMark, which uses oblique and non-linear cuts at each node. In synthetic and toy tests LANDMark consistently ranked as the best classifier and often outperformed the Random Forest classifier. When trained on the full metabarcoding dataset obtained from Canada’s Wood Buffalo National Park, LANDMark was able to create highly predictive models and achieved an overall balanced accuracy score of 0.96 ± 0.06. The use of recursive feature elimination did not impact LANDMark’s generalization performance and, when trained on data from the BE amplicon, it was able to outperform the Linear Support Vector Machine, Logistic Regression models, and Stochastic Gradient Descent models (p ≤ 0.05). Finally, LANDMark distinguishes itself due to its ability to learn smoother non-linear decision boundaries. Conclusions Our work introduces LANDMark, a meta-classifier which blends the characteristics of several machine learning models into a decision tree and ensemble learning framework. To our knowledge, this is the first study to apply this type of ensemble approach to amplicon sequencing data and we have shown that analyzing these datasets using LANDMark can produce highly predictive and consistent models. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04631-z.
Collapse
Affiliation(s)
- Josip Rudar
- Department of Integrative Biology & Centre for Biodiversity Genomics, University of Guelph, 50 Stone Road East, Guelph, ON, N1G 2W1, Canada.
| | - Teresita M Porter
- Department of Integrative Biology & Centre for Biodiversity Genomics, University of Guelph, 50 Stone Road East, Guelph, ON, N1G 2W1, Canada
| | - Michael Wright
- Department of Integrative Biology & Centre for Biodiversity Genomics, University of Guelph, 50 Stone Road East, Guelph, ON, N1G 2W1, Canada
| | - G Brian Golding
- Department of Biology, McMaster University, 1280 Main St. West, Hamilton, ON, L8S 4K1, Canada
| | - Mehrdad Hajibabaei
- Department of Integrative Biology & Centre for Biodiversity Genomics, University of Guelph, 50 Stone Road East, Guelph, ON, N1G 2W1, Canada.
| |
Collapse
|
7
|
Lin Z, Laska E, Siegel C. A general iterative clustering algorithm. Stat Anal Data Min 2022; 15:433-446. [PMID: 36061078 PMCID: PMC9438941 DOI: 10.1002/sam.11573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The quality of a cluster analysis of unlabeled units depends on the quality of the between units dissimilarity measures. Data‐dependent dissimilarity is more objective than data independent geometric measures such as Euclidean distance. As suggested by Breiman, many data driven approaches are based on decision tree ensembles, such as a random forest (RF), that produce a proximity matrix that can easily be transformed into a dissimilarity matrix. An RF can be obtained using labels that distinguish units with real data from units with synthetic data. The resulting dissimilarity matrix is input to a clustering program and units are assigned labels corresponding to cluster membership. We introduce a general iterative cluster (GIC) algorithm that improves the proximity matrix and clusters of the base RF. The cluster labels are used to grow a new RF yielding an updated proximity matrix, which is entered into the clustering program. The process is repeated until convergence. The same procedure can be used with many base procedures such as the extremely randomized tree ensemble. We evaluate the performance of the GIC algorithm using benchmark and simulated data sets. The properties measured by the Silhouette score are substantially superior to the base clustering algorithm. The GIC package has been released in R:
https://cran.r‐project.org/web/packages/GIC/index.html.
Collapse
Affiliation(s)
- Ziqiang Lin
- Department of Psychiatry New York University Langone School of Medicine New York NY USA
| | - Eugene Laska
- Department of Psychiatry New York University Langone School of Medicine New York NY USA
- Department of Population Health, Division of Biostatistics New York University Langone School of Medicine New York NY USA
- One Park Avenue, New York NY 10016 USA
| | - Carole Siegel
- Department of Psychiatry New York University Langone School of Medicine New York NY USA
- Department of Population Health, Division of Biostatistics New York University Langone School of Medicine New York NY USA
| |
Collapse
|
8
|
Distance-based clustering challenges for unbiased benchmarking studies. Sci Rep 2021; 11:18988. [PMID: 34556686 PMCID: PMC8460803 DOI: 10.1038/s41598-021-98126-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Accepted: 09/02/2021] [Indexed: 02/08/2023] Open
Abstract
Benchmark datasets with predefined cluster structures and high-dimensional biomedical datasets outline the challenges of cluster analysis: clustering algorithms are limited in their clustering ability in the presence of clusters defining distance-based structures resulting in a biased clustering solution. Data sets might not have cluster structures. Clustering yields arbitrary labels and often depends on the trial, leading to varying results. Moreover, recent research indicated that all partition comparison measures can yield the same results for different clustering solutions. Consequently, algorithm selection and parameter optimization by unsupervised quality measures (QM) are always biased and misleading. Only if the predefined structures happen to meet the particular clustering criterion and QM, can the clusters be recovered. Results are presented based on 41 open-source algorithms which are particularly useful in biomedical scenarios. Furthermore, comparative analysis with mirrored density plots provides a significantly more detailed benchmark than that with the typically used box plots or violin plots.
Collapse
|