1
|
Huang AA, Huang SY. Dendrogram of transparent feature importance machine learning statistics to classify associations for heart failure: A reanalysis of a retrospective cohort study of the Medical Information Mart for Intensive Care III (MIMIC-III) database. PLoS One 2023; 18:e0288819. [PMID: 37471315 PMCID: PMC10358877 DOI: 10.1371/journal.pone.0288819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Accepted: 07/04/2023] [Indexed: 07/22/2023] Open
Abstract
BACKGROUND There is a continual push for developing accurate predictors for Intensive Care Unit (ICU) admitted heart failure (HF) patients and in-hospital mortality. OBJECTIVE The study aimed to utilize transparent machine learning and create hierarchical clustering of key predictors based off of model importance statistics gain, cover, and frequency. METHODS Inclusion criteria of complete patient information for in-hospital mortality in the ICU with HF from the MIMIC-III database were randomly divided into a training (n = 941, 80%) and test (n = 235, 20%). A grid search was set to find hyperparameters. Machine Learning with XGBoost were used to predict mortality followed by feature importance with Shapely Additive Explanations (SHAP) and hierarchical clustering of model metrics with a dendrogram and heat map. RESULTS Of the 1,176 heart failure ICU patients that met inclusion criteria for the study, 558 (47.5%) were males. The mean age was 74.05 (SD = 12.85). XGBoost model had an area under the receiver operator curve of 0.662. The highest overall SHAP explanations were urine output, leukocytes, bicarbonate, and platelets. Average urine output was 1899.28 (SD = 1272.36) mL/day with the hospital mortality group having 1345.97 (SD = 1136.58) mL/day and the group without hospital mortality having 1986.91 (SD = 1271.16) mL/day. The average leukocyte count in the cohort was 10.72 (SD = 5.23) cells per microliter. For the hospital mortality group the leukocyte count was 13.47 (SD = 7.42) cells per microliter and for the group without hospital mortality the leukocyte count was 10.28 (SD = 4.66) cells per microliter. The average bicarbonate value was 26.91 (SD = 5.17) mEq/L. Amongst the group with hospital mortality the average bicarbonate value was 24.00 (SD = 5.42) mEq/L. Amongst the group without hospital mortality the average bicarbonate value was 27.37 (SD = 4.98) mEq/L. The average platelet value was 241.52 platelets per microliter. For the group with hospital mortality the average platelet value was 216.21 platelets per microliter. For the group without hospital mortality the average platelet value was 245.47 platelets per microliter. Cluster 1 of the dendrogram grouped the temperature, platelets, urine output, Saturation of partial pressure of Oxygen (SPO2), Leukocyte count, lymphocyte count, bicarbonate, anion gap, respiratory rate, PCO2, BMI, and age as most similar in having the highest aggregate gain, cover, and frequency metrics. CONCLUSION Machine Learning models that incorporate dendrograms and heat maps can offer additional summaries of model statistics in differentiating factors between in patient ICU mortality in heart failure patients.
Collapse
Affiliation(s)
- Alexander A. Huang
- Department of MD Education, Northwestern University Feinberg School of Medicine, Chicago, IL, United States of America
| | - Samuel Y. Huang
- Department of Internal Medicine, Virginia Commonwealth University School of Medicine, Richmond, VA, United States of America
| |
Collapse
|
2
|
Wei ZG, Zhang XD, Cao M, Liu F, Qian Y, Zhang SW. Comparison of Methods for Picking the Operational Taxonomic Units From Amplicon Sequences. Front Microbiol 2021; 12:644012. [PMID: 33841367 PMCID: PMC8024490 DOI: 10.3389/fmicb.2021.644012] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2020] [Accepted: 02/17/2021] [Indexed: 12/31/2022] Open
Abstract
With the advent of next-generation sequencing technology, it has become convenient and cost efficient to thoroughly characterize the microbial diversity and taxonomic composition in various environmental samples. Millions of sequencing data can be generated, and how to utilize this enormous sequence resource has become a critical concern for microbial ecologists. One particular challenge is the OTUs (operational taxonomic units) picking in 16S rRNA sequence analysis. Lucky, this challenge can be directly addressed by sequence clustering that attempts to group similar sequences. Therefore, numerous clustering methods have been proposed to help to cluster 16S rRNA sequences into OTUs. However, each method has its clustering mechanism, and different methods produce diverse outputs. Even a slight parameter change for the same method can also generate distinct results, and how to choose an appropriate method has become a challenge for inexperienced users. A lot of time and resources can be wasted in selecting clustering tools and analyzing the clustering results. In this study, we introduced the recent advance of clustering methods for OTUs picking, which mainly focus on three aspects: (i) the principles of existing clustering algorithms, (ii) benchmark dataset construction for OTU picking and evaluation metrics, and (iii) the performance of different methods with various distance thresholds on benchmark datasets. This paper aims to assist biological researchers to select the reasonable clustering methods for analyzing their collected sequences and help algorithm developers to design more efficient sequences clustering methods.
Collapse
Affiliation(s)
- Ze-Gang Wei
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi’an, China
| | - Xiao-Dan Zhang
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
| | - Ming Cao
- Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, China
- School of Mathematics and Statistics, Shaanxi Xueqian Normal University, Xi’an, China
| | - Fei Liu
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
| | - Yu Qian
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
| | - Shao-Wu Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi’an, China
| |
Collapse
|
3
|
Zhang P, Liu C, Zheng X, Wu L, Liu Z, Liao B, Shi Y, Li X, Xu J, Chen S. Full-Length Multi-Barcoding: DNA Barcoding from Single Ingredient to Complex Mixtures. Genes (Basel) 2019; 10:E343. [PMID: 31067783 PMCID: PMC6562688 DOI: 10.3390/genes10050343] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2019] [Revised: 04/22/2019] [Accepted: 04/29/2019] [Indexed: 11/22/2022] Open
Abstract
DNA barcoding has been used for decades, although it has mostly been applied to somesingle-species. Traditional Chinese medicine (TCM), which is mainly used in the form ofcombination-one type of the multi-species, identification is crucial for clinical usage.Next-generation Sequencing (NGS) has been used to address this authentication issue for the pastfew years, but conventional NGS technology is hampered in application due to its short sequencingreads and systematic errors. Here, a novel method, Full-length multi-barcoding (FLMB) vialong-read sequencing, is employed for the identification of biological compositions in herbalcompound formulas in adequate and well controlled studies. By directly sequencing the full-lengthamplicons of ITS2 and psbA-trnH through single-molecule real-time (SMRT) technology, thebiological composition of a classical prescription Sheng-Mai-San (SMS) was analyzed. At the sametime, clone-dependent Sanger sequencing was carried out as a parallel control. Further, anotherformula-Sanwei-Jili-San (SJS)-was analyzed with genes of ITS2 and CO1. All the ingredients inthe samples of SMS and SJS were successfully authenticated at the species level, and 11 exogenousspecies were also checked, some of which were considered as common contaminations in theseproducts. Methodology analysis demonstrated that this method was sensitive, accurate andreliable. FLMB, a superior but feasible approach for the identification of biological complexmixture, was established and elucidated, which shows perfect interpretation for DNA barcodingthat could lead its application in multi-species mixtures.
Collapse
Affiliation(s)
- Peng Zhang
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing 102488, China.
- Key Laboratory of Beijing for Identification and Safety Evaluation of Chinese Medicine, Institute of ChineseMateria Medica, China Academy of Chinese Medical Sciences, Beijing 100700, China.
| | - Chunsheng Liu
- School of Chinese Materia Medica, Beijing University of Chinese Medicine, Beijing 102488, China.
| | - Xiasheng Zheng
- Guangdong Provincial Key Laboratory of New Drug Development and Research of Chinese Medicine, Guangzhou University of Chinese Medicine, Guangzhou 510006, China.
| | - Lan Wu
- Key Laboratory of Beijing for Identification and Safety Evaluation of Chinese Medicine, Institute of ChineseMateria Medica, China Academy of Chinese Medical Sciences, Beijing 100700, China.
| | - Zhixiang Liu
- Key Laboratory of Beijing for Identification and Safety Evaluation of Chinese Medicine, Institute of ChineseMateria Medica, China Academy of Chinese Medical Sciences, Beijing 100700, China.
| | - Baosheng Liao
- Key Laboratory of Beijing for Identification and Safety Evaluation of Chinese Medicine, Institute of ChineseMateria Medica, China Academy of Chinese Medical Sciences, Beijing 100700, China.
| | - Yuhua Shi
- Key Laboratory of Beijing for Identification and Safety Evaluation of Chinese Medicine, Institute of ChineseMateria Medica, China Academy of Chinese Medical Sciences, Beijing 100700, China.
| | - Xiwen Li
- Key Laboratory of Beijing for Identification and Safety Evaluation of Chinese Medicine, Institute of ChineseMateria Medica, China Academy of Chinese Medical Sciences, Beijing 100700, China.
| | - Jiang Xu
- Key Laboratory of Beijing for Identification and Safety Evaluation of Chinese Medicine, Institute of ChineseMateria Medica, China Academy of Chinese Medical Sciences, Beijing 100700, China.
| | - Shilin Chen
- Key Laboratory of Beijing for Identification and Safety Evaluation of Chinese Medicine, Institute of ChineseMateria Medica, China Academy of Chinese Medical Sciences, Beijing 100700, China.
| |
Collapse
|
4
|
Dresch P, Falbesoner J, Ennemoser C, Hittorf M, Kuhnert R, Peintner U. Emerging from the ice-fungal communities are diverse and dynamic in earliest soil developmental stages of a receding glacier. Environ Microbiol 2019; 21:1864-1880. [PMID: 30888722 PMCID: PMC6849718 DOI: 10.1111/1462-2920.14598] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Revised: 03/04/2019] [Accepted: 03/04/2019] [Indexed: 11/30/2022]
Abstract
We used amplicon sequencing and isolation of fungi from in-growth mesh bags to identify active fungi in three earliest stages of soil development (SSD) at a glacier forefield (0-3, 9-14, 18-25 years after retreat of glacial ice). Soil organic matter and nutrient concentrations were extremely low, but the fungal diversity was high [220 operational taxonomic units (OTUs)/138 cultivated OTUs]. A clear successional trend was observed along SSDs, and species richness increased with time. Distinct changes in fungal community composition occurred with the advent of vascular plants. Fungal communities of recently deglaciated soil are most distinctive and rather similar to communities typical for cryoconite or ice. This indicates melting water as an important inoculum for native soil. Moreover, distinct seasonal differences were detected in fungal communities. Some fungal taxa, especially of the class Microbotryomycetes, showed a clear preference for winter and early SSD. Our results provide insight into new facets regarding the ecology of fungal taxa, for example, by showing that many fungal taxa might have an alternative, saprobial lifestyle in snow-covered, as supposed for a few biotrophic plant pathogens of class Pucciniomycetes. The isolated fungi include a high proportion of unknown species, which can be formally described and used for experimental approaches.
Collapse
Affiliation(s)
- Philipp Dresch
- Institute of MicrobiologyUniversity InnsbruckInnsbruckAustria
| | | | | | | | - Regina Kuhnert
- Institute of MicrobiologyUniversity InnsbruckInnsbruckAustria
| | - Ursula Peintner
- Institute of MicrobiologyUniversity InnsbruckInnsbruckAustria
| |
Collapse
|
5
|
Cai Y, Zheng W, Yao J, Yang Y, Mai V, Mao Q, Sun Y. ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time. PLoS Comput Biol 2017; 13:e1005518. [PMID: 28437450 PMCID: PMC5421816 DOI: 10.1371/journal.pcbi.1005518] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2016] [Revised: 05/08/2017] [Accepted: 04/13/2017] [Indexed: 12/30/2022] Open
Abstract
The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html.
Collapse
Affiliation(s)
- Yunpeng Cai
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- * E-mail: (YC); (YS)
| | - Wei Zheng
- Department of Computer Science and Engineering, The State University of New York at Buffalo, Buffalo, New York, United States of America
| | - Jin Yao
- Department of Microbiology and Immunology, The State University of New York at Buffalo, Buffalo, New York, United States of America
| | - Yujie Yang
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Volker Mai
- Department of Epidemiology, University of Florida, Gainesville, Florida, United States of America
| | - Qi Mao
- Department of Microbiology and Immunology, The State University of New York at Buffalo, Buffalo, New York, United States of America
| | - Yijun Sun
- Department of Computer Science and Engineering, The State University of New York at Buffalo, Buffalo, New York, United States of America
- Department of Microbiology and Immunology, The State University of New York at Buffalo, Buffalo, New York, United States of America
- Department of Biostatistics, The State University of New York at Buffalo, Buffalo, New York, United States of America
- * E-mail: (YC); (YS)
| |
Collapse
|