1
|
Kidd M, Drozdov IA, Chirindel A, Nicolas G, Imagawa D, Gulati A, Tsuchikawa T, Prasad V, Halim AB, Strosberg J. NETest® 2.0-A decade of innovation in neuroendocrine tumor diagnostics. J Neuroendocrinol 2025; 37:e70002. [PMID: 39945192 PMCID: PMC11975799 DOI: 10.1111/jne.70002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/05/2024] [Revised: 01/27/2025] [Accepted: 01/31/2025] [Indexed: 04/09/2025]
Abstract
Gastroenteropancreatic neuroendocrine neoplasms (GEP-NENs) are challenging to diagnose and manage. Because there is a critical need for a reliable biomarker, we previously developed the NETest, a liquid biopsy test that quantifies the expression of 51 neuroendocrine tumor (NET)-specific genes in blood using real-time PCR (NETest 1.0). In this study, we have leveraged our well-established laboratory approach (blood collection, RNA isolation, qPCR) with contemporary supervised machine learning methods and expanded training and testing sets to improve the discrimination and calibration of the NETest algorithm (NETest 2.0). qPCR measurements of RNA-stabilized blood-derived gene expression of 51 NET markers were used to train two supervised classifiers. The first classifier trained on 78 Controls and 162 NETs, distinguishing NETs from controls; the second, trained on 134 stable disease samples, 61 progressive disease samples, differentiated stable from progressive NET disease. In all cases, 80% of data was retained for model training, while remaining 20% were used for performance evaluation. The predictive performance of the AI system was assessed using sensitivity, specificity, and Area under Received Operating Characteristic curves (AUROC). The algorithm with the highest performance was retained for validation in two independent validation sets. Validation Cohort #I consisted of 277 patients and 186 healthy controls from the United States, Latin America, Europe, Africa and Asia, while Validation Cohort #II comprised 291 European patients from the Swiss NET Registry. A specificity cohort of 147 gastrointestinal, pancreatic and lung malignancies (non-NETs) was also evaluated. NETest 2.0 Algorithm #1 (Random Forest/gene expression normalized to ATG4B) achieved an AUROC of 0.91 for distinguishing NETs from controls (Validation Cohort #I), with a sensitivity of 95% and specificity of 81%. In Validation Cohort #II, 92% of NETs with image-positive disease were detected. The AUROC for differentiating NETs from other malignancies was 0.95; the sensitivity was 92% and specificity 90%. NETest 2.0 Algorithm #2 (Random Forest/gene expression normalized to ALG9) demonstrated an AUROC of 0.81 in Validation Cohort #I and 0.82 in Validation Cohort #II for differentiating stable from progressive disease, with specificities of 81% and 82%, respectively. Model performance was not affected by gender, ethnicity or age. Substantial improvements in performance for both algorithms were identified in head-to-head comparisons with NETest 1.0 (diagnostic: p = 1.73 × 10-9; prognostic: p = 1.02 × 10-10). NETest 2.0 exhibited improved diagnostic and prognostic capabilities over NETest 1.0. The assay also demonstrated improved sensitivity for differentiating NETs from other gastrointestinal, pancreatic and lung malignancies. The validation of this tool in geographically diverse cohorts highlights their potential for widespread clinical use.
Collapse
Affiliation(s)
- M. Kidd
- Wren LaboratoriesBranfordConnecticutUSA
| | | | | | | | - D. Imagawa
- University of California—IrvineOrangeCaliforniaUSA
| | - A. Gulati
- Bennett Cancer CenterStamfordConnecticutUSA
| | | | - V. Prasad
- Mallinckrodt Institute of RadiologyWashington University in St. LouisSt. LouisMissouriUSA
| | | | | |
Collapse
|
2
|
Stathopoulou KM, Georgakopoulos S, Tasoulis S, Plagianakos VP. Investigating the overlap of machine learning algorithms in the final results of RNA-seq analysis on gene expression estimation. Health Inf Sci Syst 2024; 12:14. [PMID: 38435719 PMCID: PMC10904690 DOI: 10.1007/s13755-023-00265-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Accepted: 12/05/2023] [Indexed: 03/05/2024] Open
Abstract
Advances in computer science in combination with the next-generation sequencing have introduced a new era in biology, enabling advanced state-of-the-art analysis of complex biological data. Bioinformatics is evolving as a union field between computer Science and biology, enabling the representation, storage, management, analysis and exploration of many types of data with a plethora of machine learning algorithms and computing tools. In this study, we used machine learning algorithms to detect differentially expressed genes between different types of cancer and showing the existence overlap to final results from RNA-sequencing analysis. The datasets were obtained from the National Center for Biotechnology Information resource. Specifically, dataset GSE68086 which corresponds to PMID:200,068,086. This dataset consists of 171 blood platelet samples collected from patients with six different tumors and healthy individuals. All steps for RNA-sequencing analysis (preprocessing, read alignment, transcriptome reconstruction, expression quantification and differential expression analysis) were followed. Machine Learning- based Random Forest and Gradient Boosting algorithms were applied to predict significant genes. The Rstudio statistical tool was used for the analysis.
Collapse
Affiliation(s)
- Kalliopi-Maria Stathopoulou
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Papasiopoulou 2-4, 35100 Lamia, Greece
| | | | - Sotiris Tasoulis
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Papasiopoulou 2-4, 35100 Lamia, Greece
| | - Vassilis P. Plagianakos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Papasiopoulou 2-4, 35100 Lamia, Greece
| |
Collapse
|
3
|
Cai K, Fu W, Liu H, Yang X, Wang Z, Zhao X. Leveraging Bioinformatics and Machine Learning for Identifying Prognostic Biomarkers and Predicting Clinical Outcomes in Lung Adenocarcinoma. Genes (Basel) 2024; 15:1497. [PMID: 39766765 PMCID: PMC11675206 DOI: 10.3390/genes15121497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2024] [Revised: 11/06/2024] [Accepted: 11/21/2024] [Indexed: 01/11/2025] Open
Abstract
Background/Objectives: There exist significant challenges for lung adenocarcinoma (LUAD) due to its poor prognosis and limited treatment options, particularly in the advanced stages. It is crucial to identify genetic biomarkers for improving outcome predictions and guiding personalized therapies. Methods: In this study, we utilize a multi-step approach that combines principled sure independence screening, penalized regression methods and information gain to identify the key genetic features of the ultra-high dimensional RNA-sequencing data from LUAD patients. We then evaluate three methods of survival analysis: the Cox model, survival tree, and random survival forests (RSFs), to compare their predictive performance. Additionally, a protein-protein interaction network is used to explore the biological significance of identified genes. Results:DKK1 and TNS4 are consistently selected as significant predictors across all feature selection methods. The Kaplan-Meier method shows that high expression levels of these genes are strongly correlated with poorer survival outcomes, suggesting their potential as prognostic biomarkers. RSF outperforms Cox and survival tree methods, showing higher AUC and C-index values. The protein-protein interaction network highlights key nodes such as VEGFC and LAMA3, which play central roles in LUAD progression. Conclusions: Our findings provide valuable insights into the genetic mechanisms of LUAD. These results contribute to the development of more accurate prognostic tools and personalized treatment strategies for LUAD.
Collapse
Affiliation(s)
- Kaida Cai
- Department of Epidemiology and Biostatistics, School of Public Health, Southeast University, Nanjing 210009, China
- Department of Statistics and Actuarial Science, School of Mathematics, Southeast University, Nanjing 211189, China; (W.F.); (H.L.); (X.Y.); (Z.W.); (X.Z.)
- Key Laboratory of Environmental Medicine Engineering, Ministry of Education, School of Public Health, Southeast University, Nanjing 210009, China
| | - Wenzhi Fu
- Department of Statistics and Actuarial Science, School of Mathematics, Southeast University, Nanjing 211189, China; (W.F.); (H.L.); (X.Y.); (Z.W.); (X.Z.)
| | - Hanwen Liu
- Department of Statistics and Actuarial Science, School of Mathematics, Southeast University, Nanjing 211189, China; (W.F.); (H.L.); (X.Y.); (Z.W.); (X.Z.)
| | - Xiaofang Yang
- Department of Statistics and Actuarial Science, School of Mathematics, Southeast University, Nanjing 211189, China; (W.F.); (H.L.); (X.Y.); (Z.W.); (X.Z.)
| | - Zhengyan Wang
- Department of Statistics and Actuarial Science, School of Mathematics, Southeast University, Nanjing 211189, China; (W.F.); (H.L.); (X.Y.); (Z.W.); (X.Z.)
| | - Xin Zhao
- Department of Statistics and Actuarial Science, School of Mathematics, Southeast University, Nanjing 211189, China; (W.F.); (H.L.); (X.Y.); (Z.W.); (X.Z.)
- Key Laboratory of Measurement and Control of Complex Systems of Engineering, Ministry of Education, Southeast University, Nanjing 210096, China
| |
Collapse
|
4
|
Modlin IM, Kidd M, Drozdov IA, Boegemann M, Bodei L, Kunikowska J, Malczewska A, Bernemann C, Koduru SV, Rahbar K. Development of a multigenomic liquid biopsy (PROSTest) for prostate cancer in whole blood. Prostate 2024; 84:850-865. [PMID: 38571290 DOI: 10.1002/pros.24704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Revised: 03/04/2024] [Accepted: 03/25/2024] [Indexed: 04/05/2024]
Abstract
INTRODUCTION We describe the development of a molecular assay from publicly available tumor tissue mRNA databases using machine learning and present preliminary evidence of functionality as a diagnostic and monitoring tool for prostate cancer (PCa) in whole blood. MATERIALS AND METHODS We assessed 1055 PCas (public microarray data sets) to identify putative mRNA biomarkers. Specificity was confirmed against 32 different solid and hematological cancers from The Cancer Genome Atlas (n = 10,990). This defined a 27-gene panel which was validated by qPCR in 50 histologically confirmed PCa surgical specimens and matched blood. An ensemble classifier (Random Forest, Support Vector Machines, XGBoost) was trained in age-matched PCas (n = 294), and in 72 controls and 64 BPH. Classifier performance was validated in two independent sets (n = 263 PCas; n = 99 controls). We assessed the panel as a postoperative disease monitor in a radical prostatectomy cohort (RPC: n = 47). RESULTS A PCa-specific 27-gene panel was identified. Matched blood and tumor gene expression levels were concordant (r = 0.72, p < 0.0001). The ensemble classifier ("PROSTest") was scaled 0%-100% and the industry-standard operating point of ≥50% used to define a PCa. Using this, the PROSTest exhibited an 85% sensitivity and 95% specificity for PCa versus controls. In two independent sets, the metrics were 92%-95% sensitivity and 100% specificity. In the RPCs (n = 47), PROSTest scores decreased from 72% ± 7% to 33% ± 16% (p < 0.0001, Mann-Whitney test). PROSTest was 26% ± 8% in 37 with normal postoperative PSA levels (<0.1 ng/mL). In 10 with elevated postoperative PSA, PROSTest was 60% ± 4%. CONCLUSION A 27-gene whole blood signature for PCa is concordant with tissue mRNA levels. Measuring blood expression provides a minimally invasive genomic tool that may facilitate prostate cancer management.
Collapse
Affiliation(s)
- Irvin M Modlin
- Yale University School of Medicine, New Haven, Connecticut, USA
| | - Mark Kidd
- Wren Laboratories LLC, Branford, Connecticut, USA
| | | | - Martin Boegemann
- Department of Urology, Münster University Hospital, Münster, Germany
| | - Lisa Bodei
- Department of Radiology, Molecular Imaging and Therapy Service, Memorial Sloan Kettering Cancer Center, New York, New York, USA
| | - Jolanta Kunikowska
- Department of Nuclear Medicine, Medical University of Warsaw, Warsaw, Poland
| | - Anna Malczewska
- Department of Endocrinology, Medical University of Silesia, Katowice, Poland
| | | | | | - Kambiz Rahbar
- Department of Nuclear Medicine, Münster University Hospital, Münster, Germany
| |
Collapse
|
5
|
Lee YH, Chang J, Lee JE, Jung YS, Lee D, Lee HS. Essential elements of physical fitness analysis in male adolescent athletes using machine learning. PLoS One 2024; 19:e0298870. [PMID: 38564629 PMCID: PMC10986970 DOI: 10.1371/journal.pone.0298870] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Accepted: 02/01/2024] [Indexed: 04/04/2024] Open
Abstract
Physical fitness (PF) includes various factors that significantly impacts athletic performance. Analyzing PF is critical in developing customized training methods for athletes based on the sports in which they compete. Previous approaches to analyzing PF have relied on statistical or machine learning algorithms that focus on predicting athlete injury or performance. In this study, six machine learning algorithms were used to analyze the PF of 1,489 male adolescent athletes across five sports, including track & field, football, baseball, swimming, and badminton. Furthermore, the machine learning models were utilized to analyze the essential elements of PF using feature importance of XGBoost, and SHAP values. As a result, XGBoost represents the highest performance, with an average accuracy of 90.14, an area under the curve of 0.86, and F1-score of 0.87, demonstrating the similarity between the sports. Feature importance of XGBoost, and SHAP value provided a quantitative assessment of the relative importance of PF in sports by comparing two sports within each of the five sports. This analysis is expected to be useful in analyzing the essential PF elements of athletes in various sports and recommending personalized exercise methods accordingly.
Collapse
Affiliation(s)
- Yun-Hwan Lee
- Department of Exercise and Medical Science, Graduate School, Dankook University, Cheonan, Republic of Korea
- Institute of Medical-Sports, Dankook University, Cheonan, Republic of Korea
| | - Jisuk Chang
- Department of Sports Management, Dankook University, Cheonan, Republic of Korea
| | - Ji-Eun Lee
- Department of Exercise and Medical Science, Graduate School, Dankook University, Cheonan, Republic of Korea
| | - Yeon-Sung Jung
- The Sport Science Center in Gyeonggi, Seoul, Republic of Korea
| | - Dongheon Lee
- Department of Biomedical Engineering, Chungnam National University Hospital, Daejeon, Republic of Korea
- Department of Biomedical Engineering, Chungnam National University College of Medicine, Daejeon, Republic of Korea
| | - Ho-Seong Lee
- Department of Exercise and Medical Science, Graduate School, Dankook University, Cheonan, Republic of Korea
- Institute of Medical-Sports, Dankook University, Cheonan, Republic of Korea
| |
Collapse
|
6
|
Fuller GW, Hasan M, Hodkinson P, McAlpine D, Goodacre S, Bath PA, Sbaffi L, Omer Y, Wallis L, Marincowitz C. Training and testing of a gradient boosted machine learning model to predict adverse outcome in patients presenting to emergency departments with suspected covid-19 infection in a middle-income setting. PLOS DIGITAL HEALTH 2023; 2:e0000309. [PMID: 37729117 PMCID: PMC10511129 DOI: 10.1371/journal.pdig.0000309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 06/27/2023] [Indexed: 09/22/2023]
Abstract
COVID-19 infection rates remain high in South Africa. Clinical prediction models may be helpful for rapid triage, and supporting clinical decision making, for patients with suspected COVID-19 infection. The Western Cape, South Africa, has integrated electronic health care data facilitating large-scale linked routine datasets. The aim of this study was to develop a machine learning model to predict adverse outcome in patients presenting with suspected COVID-19 suitable for use in a middle-income setting. A retrospective cohort study was conducted using linked, routine data, from patients presenting with suspected COVID-19 infection to public-sector emergency departments (EDs) in the Western Cape, South Africa between 27th August 2020 and 31st October 2021. The primary outcome was death or critical care admission at 30 days. An XGBoost machine learning model was trained and internally tested using split-sample validation. External validation was performed in 3 test cohorts: Western Cape patients presenting during the Omicron COVID-19 wave, a UK cohort during the ancestral COVID-19 wave, and a Sudanese cohort during ancestral and Eta waves. A total of 282,051 cases were included in a complete case training dataset. The prevalence of 30-day adverse outcome was 4.0%. The most important features for predicting adverse outcome were the requirement for supplemental oxygen, peripheral oxygen saturations, level of consciousness and age. Internal validation using split-sample test data revealed excellent discrimination (C-statistic 0.91, 95% CI 0.90 to 0.91) and calibration (CITL of 1.05). The model achieved C-statistics of 0.84 (95% CI 0.84 to 0.85), 0.72 (95% CI 0.71 to 0.73), and 0.62, (95% CI 0.59 to 0.65) in the Omicron, UK, and Sudanese test cohorts. Results were materially unchanged in sensitivity analyses examining missing data. An XGBoost machine learning model achieved good discrimination and calibration in prediction of adverse outcome in patients presenting with suspected COVID19 to Western Cape EDs. Performance was reduced in temporal and geographical external validation.
Collapse
Affiliation(s)
- Gordon Ward Fuller
- Centre for Urgent and Emergency Care Research (CURE), Health Services Research School of Health and Related Research, University of Sheffield, Sheffield, United Kingdom
| | - Madina Hasan
- Centre for Urgent and Emergency Care Research (CURE), Health Services Research School of Health and Related Research, University of Sheffield, Sheffield, United Kingdom
| | - Peter Hodkinson
- Division of Emergency Medicine, University of Cape Town, Cape Town, South Africa
| | - David McAlpine
- Division of Emergency Medicine, University of Cape Town, Cape Town, South Africa
| | - Steve Goodacre
- Centre for Urgent and Emergency Care Research (CURE), Health Services Research School of Health and Related Research, University of Sheffield, Sheffield, United Kingdom
| | - Peter A. Bath
- Centre for Urgent and Emergency Care Research (CURE), Health Services Research School of Health and Related Research, University of Sheffield, Sheffield, United Kingdom
- Information School, University of Sheffield, Sheffield, United Kingdom
| | - Laura Sbaffi
- Information School, University of Sheffield, Sheffield, United Kingdom
| | - Yasein Omer
- Division of Emergency Medicine, University of Cape Town, Cape Town, South Africa
| | - Lee Wallis
- Division of Emergency Medicine, University of Cape Town, Cape Town, South Africa
| | - Carl Marincowitz
- Centre for Urgent and Emergency Care Research (CURE), Health Services Research School of Health and Related Research, University of Sheffield, Sheffield, United Kingdom
| |
Collapse
|
7
|
Dimitsaki S, Gavriilidis GI, Dimitriadis VK, Natsiavas P. Benchmarking of Machine Learning classifiers on plasma proteomic for COVID-19 severity prediction through interpretable artificial intelligence. Artif Intell Med 2023; 137:102490. [PMID: 36868685 PMCID: PMC9846931 DOI: 10.1016/j.artmed.2023.102490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Revised: 01/10/2023] [Accepted: 01/11/2023] [Indexed: 01/19/2023]
Abstract
The SARS-CoV-2 pandemic highlighted the need for software tools that could facilitate patient triage regarding potential disease severity or even death. In this article, an ensemble of Machine Learning (ML) algorithms is evaluated in terms of predicting the severity of their condition using plasma proteomics and clinical data as input. An overview of AI-based technical developments to support COVID-19 patient management is presented outlining the landscape of relevant technical developments. Based on this review, the use of an ensemble of ML algorithms that analyze clinical and biological data (i.e., plasma proteomics) of COVID-19 patients is designed and deployed to evaluate the potential use of AI for early COVID-19 patient triage. The proposed pipeline is evaluated using three publicly available datasets for training and testing. Three ML "tasks" are defined, and several algorithms are tested through a hyperparameter tuning method to identify the highest-performance models. As overfitting is one of the typical pitfalls for such approaches (mainly due to the size of the training/validation datasets), a variety of evaluation metrics are used to mitigate this risk. In the evaluation procedure, recall scores ranged from 0.6 to 0.74 and F1-score from 0.62 to 0.75. The best performance is observed via Multi-Layer Perceptron (MLP) and Support Vector Machines (SVM) algorithms. Additionally, input data (proteomics and clinical data) were ranked based on corresponding Shapley additive explanation (SHAP) values and evaluated for their prognosticated capacity and immuno-biological credence. This "interpretable" approach revealed that our ML models could discern critical COVID-19 cases predominantly based on patient's age and plasma proteins on B cell dysfunction, hyper-activation of inflammatory pathways like Toll-like receptors, and hypo-activation of developmental and immune pathways like SCF/c-Kit signaling. Finally, the herein computational workflow is corroborated in an independent dataset and MLP superiority along with the implication of the abovementioned predictive biological pathways are corroborated. Regarding limitations of the presented ML pipeline, the datasets used in this study contain less than 1000 observations and a significant number of input features hence constituting a high-dimensional low-sample (HDLS) dataset which could be sensitive to overfitting. An advantage of the proposed pipeline is that it combines biological data (plasma proteomics) with clinical-phenotypic data. Thus, in principle, the presented approach could enable patient triage in a timely fashion if used on already trained models. However, larger datasets and further systematic validation are needed to confirm the potential clinical value of this approach. The code is available on Github: https://github.com/inab-certh/Predicting-COVID-19-severity-through-interpretable-AI-analysis-of-plasma-proteomics.
Collapse
Affiliation(s)
- Stella Dimitsaki
- Institute of Applied Biosciences, Centre for Research & Technology Hellas, Thermi, Thessaloniki, Greece.
| | - George I Gavriilidis
- Institute of Applied Biosciences, Centre for Research & Technology Hellas, Thermi, Thessaloniki, Greece
| | - Vlasios K Dimitriadis
- Institute of Applied Biosciences, Centre for Research & Technology Hellas, Thermi, Thessaloniki, Greece
| | - Pantelis Natsiavas
- Institute of Applied Biosciences, Centre for Research & Technology Hellas, Thermi, Thessaloniki, Greece
| |
Collapse
|
8
|
Sen Puliparambil B, Tomal JH, Yan Y. A Novel Algorithm for Feature Selection Using Penalized Regression with Applications to Single-Cell RNA Sequencing Data. BIOLOGY 2022; 11:biology11101495. [PMID: 36290397 PMCID: PMC9598401 DOI: 10.3390/biology11101495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Revised: 09/21/2022] [Accepted: 09/30/2022] [Indexed: 11/05/2022]
Abstract
With the emergence of single-cell RNA sequencing (scRNA-seq) technology, scientists are able to examine gene expression at single-cell resolution. Analysis of scRNA-seq data has its own challenges, which stem from its high dimensionality. The method of machine learning comes with the potential of gene (feature) selection from the high-dimensional scRNA-seq data. Even though there exist multiple machine learning methods that appear to be suitable for feature selection, such as penalized regression, there is no rigorous comparison of their performances across data sets, where each poses its own challenges. Therefore, in this paper, we analyzed and compared multiple penalized regression methods for scRNA-seq data. Given the scRNA-seq data sets we analyzed, the results show that sparse group lasso (SGL) outperforms the other six methods (ridge, lasso, elastic net, drop lasso, group lasso, and big lasso) using the metrics area under the receiver operating curve (AUC) and computation time. Building on these findings, we proposed a new algorithm for feature selection using penalized regression methods. The proposed algorithm works by selecting a small subset of genes and applying SGL to select the differentially expressed genes in scRNA-seq data. By using hierarchical clustering to group genes, the proposed method bypasses the need for domain-specific knowledge for gene grouping information. In addition, the proposed algorithm provided consistently better AUC for the data sets used.
Collapse
Affiliation(s)
- Bhavithry Sen Puliparambil
- Master of Science in Data Science Program, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada
- Correspondence:
| | - Jabed H. Tomal
- Department of Mathematics and Statistics, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada
| | - Yan Yan
- Department of Computing Science, Thompson Rivers University, 805 TRU Way, Kamloops, BC V2C 0C8, Canada
| |
Collapse
|
9
|
Le H, Peng B, Uy J, Carrillo D, Zhang Y, Aevermann BD, Scheuermann RH. Machine learning for cell type classification from single nucleus RNA sequencing data. PLoS One 2022; 17:e0275070. [PMID: 36149937 PMCID: PMC9506651 DOI: 10.1371/journal.pone.0275070] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 09/09/2022] [Indexed: 11/18/2022] Open
Abstract
With the advent of single cell/nucleus RNA sequencing (sc/snRNA-seq), the field of cell phenotyping is now a data-driven exercise providing statistical evidence to support cell type/state categorization. However, the task of classifying cells into specific, well-defined categories with the empirical data provided by sc/snRNA-seq remains nontrivial due to the difficulty in determining specific differences between related cell types with close transcriptional similarities, resulting in challenges with matching cell types identified in separate experiments. To investigate possible approaches to overcome these obstacles, we explored the use of supervised machine learning methods-logistic regression, support vector machines, random forests, neural networks, and light gradient boosting machine (LightGBM)-as approaches to classify cell types using snRNA-seq datasets from human brain middle temporal gyrus (MTG) and human kidney. Classification accuracy was evaluated using an F-beta score weighted in favor of precision to account for technical artifacts of gene expression dropout. We examined the impact of hyperparameter optimization and feature selection methods on F-beta score performance. We found that the best performing model for granular cell type classification in both datasets is a multinomial logistic regression classifier and that an effective feature selection step was the most influential factor in optimizing the performance of the machine learning pipelines.
Collapse
Affiliation(s)
- Huy Le
- Department of Bioengineering, University of California, San Diego, CA, United States of America
| | - Beverly Peng
- Department of Bioengineering, University of California, San Diego, CA, United States of America
| | - Janelle Uy
- Department of Bioengineering, University of California, San Diego, CA, United States of America
| | - Daniel Carrillo
- Department of Bioengineering, University of California, San Diego, CA, United States of America
| | - Yun Zhang
- Department of informatics, J. Craig Venter Institute, La Jolla, CA, United States of America
| | - Brian D. Aevermann
- Department of informatics, J. Craig Venter Institute, La Jolla, CA, United States of America
| | - Richard H. Scheuermann
- Department of informatics, J. Craig Venter Institute, La Jolla, CA, United States of America
- Department of Pathology, University of California, San Diego, CA, United States of America
- La Jolla Institute for Immunology, San Diego, CA, United States of America
| |
Collapse
|