1
|
Hartoyo A, Argasiński J, Trenk A, Przybylska K, Błasiak A, Crimi A. Synergistic eigenanalysis of covariance and Hessian matrices for enhanced binary classification on health datasets. Comput Biol Med 2025; 190:109985. [PMID: 40132299 DOI: 10.1016/j.compbiomed.2025.109985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 10/06/2024] [Accepted: 03/03/2025] [Indexed: 03/27/2025]
Abstract
Covariance and Hessian matrices have been analyzed separately in the literature for classification problems. However, integrating these matrices has the potential to enhance their combined power in improving classification performance. We present a novel approach that combines the eigenanalysis of a covariance matrix evaluated on a training set with a Hessian matrix evaluated on a deep learning model to achieve optimal class separability in binary classification tasks. Our approach is substantiated by formal proofs that establish its capability to maximize between-class mean distance (the concept of separation) and minimize within-class variances (the concept of compactness), which together define the two linear discriminant analysis (LDA) criteria, particularly under ideal data conditions such as isotropy around class means and dominant leading eigenvalues. By projecting data into the combined space of the most relevant eigendirections from both matrices, we achieve optimal class separability as per these LDA criteria. Empirical validation across neural and health datasets consistently supports our theoretical framework and demonstrates that our method outperforms established methods. Our method stands out by addressing both separation and compactness criteria, unlike PCA and the Hessian method, which predominantly emphasize one criterion each. This comprehensive approach captures intricate patterns and relationships, enhancing classification performance. Furthermore, through the utilization of both LDA criteria, our method outperforms LDA itself by leveraging higher-dimensional feature spaces, in accordance with Cover's theorem, which favors linear separability in higher dimensions. Additionally, our approach sheds light on complex DNN decision-making, rendering them comprehensible within a 2D space.
Collapse
Affiliation(s)
- Agus Hartoyo
- Sano - Centre for Computational Personalised Medicine, International Research Foundation, Krakow, Poland; School of Computing, Telkom University, Bandung, Indonesia.
| | - Jan Argasiński
- Sano - Centre for Computational Personalised Medicine, International Research Foundation, Krakow, Poland; Department of Human-Centered Artificial Intelligence, Institute of Applied Computer Science, Faculty of Physics, Astronomy and Applied Computer Science, Jagiellonian University, Krakow, Poland
| | - Aleksandra Trenk
- Department of Neurophysiology and Chronobiology, Institute of Zoology and Biomedical Research, Faculty of Biology, Jagiellonian University, Krakow, Poland
| | - Kinga Przybylska
- Department of Neurophysiology and Chronobiology, Institute of Zoology and Biomedical Research, Faculty of Biology, Jagiellonian University, Krakow, Poland; Doctoral School of Exact and Natural Sciences, Jagiellonian University, Krakow, Poland
| | - Anna Błasiak
- Department of Neurophysiology and Chronobiology, Institute of Zoology and Biomedical Research, Faculty of Biology, Jagiellonian University, Krakow, Poland
| | | |
Collapse
|
2
|
Wang P, Zhang J. Prediction of Composite Clinical Outcomes for Childhood Neuroblastoma Using Multi-Omics Data and Machine Learning. Int J Mol Sci 2024; 26:136. [PMID: 39795994 PMCID: PMC11720239 DOI: 10.3390/ijms26010136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Revised: 05/20/2024] [Accepted: 05/22/2024] [Indexed: 01/13/2025] Open
Abstract
Neuroblastoma is a common malignant tumor in childhood that seriously endangers the health and lives of children, making it essential to find effective prognostic markers to accurately predict their clinical outcomes. The development of high-throughput technology in the biomedical field has made it possible to obtain multi-omics data, whose integration can compensate for missing or unreliable information in a single data source. In this study, we integrated clinical data and two omics data, i.e., gene expression and DNA methylation data, to study the prognosis of neuroblastoma. Since the features in omics data are redundant, it is crucial to conduct feature selection on them. We proposed a two-step feature selection (TSFS) method to quickly and accurately select the optimal features, where the first step aims at selecting candidate features and the second step is to remove redundant features among them using our proposed maximal association coefficient (MAC). Our goal is to predict composite clinical outcomes for neuroblastoma patients, i.e., their survival time and vital status at the last follow-up, which was validated to be two inter-correlated tasks. We conducted a series of experiments and evaluated the experimental results using accuracy and AUC (area under the ROC curve) evaluation metrics, which indicated that by the combination of the integration of the three types of data, our proposed TSFS method and a multi-task learning method can synergistically improve the reliability and accuracy of the prediction models.
Collapse
Affiliation(s)
| | - Junying Zhang
- School of Computer Science and Technology, Xidian University, Xi’an 710126, China;
| |
Collapse
|
3
|
Qiao W, Xie T, Lu J, Jia T. Development of machine learning models for the prediction of the skin sensitization potential of cosmetic compounds. PeerJ 2024; 12:e18672. [PMID: 39686995 PMCID: PMC11648681 DOI: 10.7717/peerj.18672] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2024] [Accepted: 11/19/2024] [Indexed: 12/18/2024] Open
Abstract
Background To enhance the accuracy of allergen detection in cosmetic compounds, we developed a co-culture system that combines HaCaT keratinocytes (transfected with a luciferase plasmid driven by the AKR1C2 promoter) and THP-1 cells for machine learning applications. Methods Following chemical exposure, cell cytotoxicity was assessed using CCK-8 to determine appropriate stimulation concentrations. RNA-Seq was subsequently employed to analyze THP-1 cells, followed by differential expression gene (DEG) analysis and weighted gene co-expression net-work analysis (WGCNA). Using two data preprocessing methods and three feature extraction techniques, we constructed and validated models with eight machine learning algorithms. Results Our results demonstrated the effectiveness of this integrated approach. The best performing models were random forest (RF) and voom-based diagonal quadratic discriminant analysis (voomDQDA), both achieving 100% accuracy. Support vector machine (SVM) and voom based nearest shrunken centroids (voomNSC) showed excellent performance with 96.7% test accuracy, followed by voom-based diagonal linear discriminant analysis (voomDLDA) at 95.2%. Nearest shrunken centroids (NSC), Poisson linear discriminant analysis (PLDA) and negative binomial linear discriminant analysis (NBLDA) achieved 90.5% and 90.2% accuracy, respectively. K-nearest neighbors (KNN) showed the lowest accuracy at 85.7%. Conclusion This study highlights the potential of integrating co-culture systems, RNA-Seq, and machine learning to develop more accurate and comprehensive in vitro methods for skin sensitization testing. Our findings contribute to the advancement of cosmetic safety assessments, potentially reducing the reliance on animal testing.
Collapse
Affiliation(s)
- Wu Qiao
- Pigeon Manufacturing (Shanghai) Co., Ltd., Shanghai, China
| | - Tong Xie
- Pigeon Manufacturing (Shanghai) Co., Ltd., Shanghai, China
| | - Jing Lu
- Pigeon Manufacturing (Shanghai) Co., Ltd., Shanghai, China
| | - Tinghan Jia
- Pigeon Manufacturing (Shanghai) Co., Ltd., Shanghai, China
| |
Collapse
|
4
|
Feng S, Wang Z, Jin Y, Xu S. TabDEG: Classifying differentially expressed genes from RNA-seq data based on feature extraction and deep learning framework. PLoS One 2024; 19:e0305857. [PMID: 39037985 PMCID: PMC11262683 DOI: 10.1371/journal.pone.0305857] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 06/05/2024] [Indexed: 07/24/2024] Open
Abstract
Traditional differential expression genes (DEGs) identification models have limitations in small sample size datasets because they require meeting distribution assumptions, otherwise resulting high false positive/negative rates due to sample variation. In contrast, tabular data model based on deep learning (DL) frameworks do not need to consider the data distribution types and sample variation. However, applying DL to RNA-Seq data is still a challenge due to the lack of proper labeling and the small sample size compared to the number of genes. Data augmentation (DA) extracts data features using different methods and procedures, which can significantly increase complementary pseudo-values from limited data without significant additional cost. Based on this, we combine DA and DL framework-based tabular data model, propose a model TabDEG, to predict DEGs and their up-regulation/down-regulation directions from gene expression data obtained from the Cancer Genome Atlas database. Compared to five counterpart methods, TabDEG has high sensitivity and low misclassification rates. Experiment shows that TabDEG is robust and effective in enhancing data features to facilitate classification of high-dimensional small sample size datasets and validates that TabDEG-predicted DEGs are mapped to important gene ontology terms and pathways associated with cancer.
Collapse
Affiliation(s)
- Sifan Feng
- School of Mathematics and Statistics, Guangdong University of Technology, Guangzhou, Guangdong, China
| | - Zhenyou Wang
- School of Mathematics and Statistics, Guangdong University of Technology, Guangzhou, Guangdong, China
| | - Yinghua Jin
- School of Mathematics and Statistics, Guangdong University of Technology, Guangzhou, Guangdong, China
| | - Shengbin Xu
- School of Mathematics and Statistics, Guangdong University of Technology, Guangzhou, Guangdong, China
| |
Collapse
|
5
|
Lee H, Ma T, Ke H, Ye Z, Chen S. dCCA: detecting differential covariation patterns between two types of high-throughput omics data. Brief Bioinform 2024; 25:bbae288. [PMID: 38888456 PMCID: PMC11184902 DOI: 10.1093/bib/bbae288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Revised: 05/01/2024] [Accepted: 06/03/2024] [Indexed: 06/20/2024] Open
Abstract
MOTIVATION The advent of multimodal omics data has provided an unprecedented opportunity to systematically investigate underlying biological mechanisms from distinct yet complementary angles. However, the joint analysis of multi-omics data remains challenging because it requires modeling interactions between multiple sets of high-throughput variables. Furthermore, these interaction patterns may vary across different clinical groups, reflecting disease-related biological processes. RESULTS We propose a novel approach called Differential Canonical Correlation Analysis (dCCA) to capture differential covariation patterns between two multivariate vectors across clinical groups. Unlike classical Canonical Correlation Analysis, which maximizes the correlation between two multivariate vectors, dCCA aims to maximally recover differentially expressed multivariate-to-multivariate covariation patterns between groups. We have developed computational algorithms and a toolkit to sparsely select paired subsets of variables from two sets of multivariate variables while maximizing the differential covariation. Extensive simulation analyses demonstrate the superior performance of dCCA in selecting variables of interest and recovering differential correlations. We applied dCCA to the Pan-Kidney cohort from the Cancer Genome Atlas Program database and identified differentially expressed covariations between noncoding RNAs and gene expressions. AVAILABILITY AND IMPLEMENTATION The R package that implements dCCA is available at https://github.com/hwiyoungstat/dCCA.
Collapse
Affiliation(s)
- Hwiyoung Lee
- Maryland Psychiatric Research Center, School of Medicine, University of Maryland, Baltimore, MD 21201, United States
- The University of Maryland Institute for Health Computing (UM-IHC), North Bethesda, MD 20852, United States
| | - Tianzhou Ma
- Department of Epidemiology and Biostatistics, University of Maryland, College Park, MD 20742, United States
| | - Hongjie Ke
- Department of Epidemiology and Biostatistics, University of Maryland, College Park, MD 20742, United States
| | - Zhenyao Ye
- The University of Maryland Institute for Health Computing (UM-IHC), North Bethesda, MD 20852, United States
- Division of Biostatistics and Bioinformatics, Department of Epidemiology and Public Health, School of Medicine, University of Maryland, Baltimore, MD 21201, United States
| | - Shuo Chen
- Maryland Psychiatric Research Center, School of Medicine, University of Maryland, Baltimore, MD 21201, United States
- The University of Maryland Institute for Health Computing (UM-IHC), North Bethesda, MD 20852, United States
- Division of Biostatistics and Bioinformatics, Department of Epidemiology and Public Health, School of Medicine, University of Maryland, Baltimore, MD 21201, United States
| |
Collapse
|
6
|
Genç M. Penalized logistic regression with prior information for microarray gene expression classification. Int J Biostat 2024; 20:107-122. [PMID: 36427223 DOI: 10.1515/ijb-2022-0025] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 11/07/2022] [Indexed: 02/17/2024]
Abstract
Cancer classification and gene selection are important applications in DNA microarray gene expression data analysis. Since DNA microarray data suffers from the high-dimensionality problem, automatic gene selection methods are used to enhance the classification performance of expert classifier systems. In this paper, a new penalized logistic regression method that performs simultaneous gene coefficient estimation and variable selection in DNA microarray data is discussed. The method employs prior information about the gene coefficients to improve the classification accuracy of the underlying model. The coordinate descent algorithm with screening rules is given to obtain the gene coefficient estimates of the proposed method efficiently. The performance of the method is examined on five high-dimensional cancer classification datasets using the area under the curve, the number of selected genes, misclassification rate and F-score measures. The real data analysis results indicate that the proposed method achieves a good cancer classification performance with a small misclassification rate, large area under the curve and F-score by trading off some sparsity level of the underlying model. Hence, the proposed method can be seen as a reliable penalized logistic regression method in the scope of high-dimensional cancer classification.
Collapse
Affiliation(s)
- Murat Genç
- Department of Management Information Systems, Faculty of Economics and Administrative Sciences, Tarsus University Mersin, Mersin 33400, Türkiye
| |
Collapse
|
7
|
Nie F, Chen H, Xiang S, Zhang C, Yan S, Li X. On the Equivalence of Linear Discriminant Analysis and Least Squares Regression. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:5710-5720. [PMID: 36306294 DOI: 10.1109/tnnls.2022.3208944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Studying the relationship between linear discriminant analysis (LDA) and least squares regression (LSR) is of great theoretical and practical significance. It is well-known that the two-class LDA is equivalent to an LSR problem, and directly casting multiclass LDA as an LSR problem, however, becomes more challenging. Recent study reveals that the equivalence between multiclass LDA and LSR can be established based on a special class indicator matrix, but under a mild condition which may not hold under the scenarios with low-dimensional or oversampled data. In this article, we show that the equivalence between multiclass LDA and LSR can be established based on arbitrary linearly independent class indicator vectors and without any condition. In addition, we show that LDA is also equivalent to a constrained LSR based on the data-dependent indicator vectors. It can be concluded that under exactly the same mild condition, such two regressions are both equivalent to the null space LDA method. Illuminated by the equivalence of LDA and LSR, we propose a direct LDA classifier to replace the conventional framework of LDA plus extra classifier. Extensive experiments well validate the above theoretic analysis.
Collapse
|
8
|
Ho IL, Li CY, Wang F, Zhao L, Liu J, Yen EY, Dyke CA, Shah R, Liu Z, Çetin AO, Chu Y, Citron F, Attanasio S, Corti D, Darbaniyan F, Del Poggetto E, Loponte S, Liu J, Soeung M, Chen Z, Jiang S, Jiang H, Inoue A, Gao S, Deem A, Feng N, Ying H, Kim M, Giuliani V, Genovese G, Zhang J, Futreal A, Maitra A, Heffernan T, Wang L, Do KA, Gargiulo G, Draetta G, Carugo A, Lin R, Viale A. Clonal dominance defines metastatic dissemination in pancreatic cancer. SCIENCE ADVANCES 2024; 10:eadd9342. [PMID: 38478609 DOI: 10.1126/sciadv.add9342] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 02/08/2024] [Indexed: 02/08/2025]
Abstract
Tumors represent ecosystems where subclones compete during tumor growth. While extensively investigated, a comprehensive picture of the interplay of clonal lineages during dissemination is still lacking. Using patient-derived pancreatic cancer cells, we created orthotopically implanted clonal replica tumors to trace clonal dynamics of unperturbed tumor expansion and dissemination. This model revealed the multifaceted nature of tumor growth, with rapid changes in clonal fitness leading to continuous reshuffling of tumor architecture and alternating clonal dominance as a distinct feature of cancer growth. Regarding dissemination, a large fraction of tumor lineages could be found at secondary sites each having distinctive organ growth patterns as well as numerous undescribed behaviors such as abortive colonization. Paired analysis of primary and secondary sites revealed fitness as major contributor to dissemination. From the analysis of pro- and nonmetastatic isogenic subclones, we identified a transcriptomic signature able to identify metastatic cells in human tumors and predict patients' survival.
Collapse
Affiliation(s)
- I-Lin Ho
- The University of Texas MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX, USA
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Chieh-Yuan Li
- The University of Texas MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX, USA
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Fuchenchu Wang
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Li Zhao
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Jingjing Liu
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Er-Yen Yen
- The University of Texas MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX, USA
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Charles A Dyke
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Rutvi Shah
- The University of Texas MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX, USA
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Zhaoliang Liu
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Ali Osman Çetin
- Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association (MDC), Robert-Rössle-Str. 10, 13125 Berlin, Germany
| | - Yanshuo Chu
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Francesca Citron
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Sergio Attanasio
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Denise Corti
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Faezeh Darbaniyan
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Edoardo Del Poggetto
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Sara Loponte
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Jintan Liu
- The University of Texas MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX, USA
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Melinda Soeung
- The University of Texas MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX, USA
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Ziheng Chen
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Shan Jiang
- TRACTION platform, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Hong Jiang
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Akira Inoue
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Sisi Gao
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
- TRACTION platform, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Angela Deem
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Ningping Feng
- TRACTION platform, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Haoqiang Ying
- Department of Cellular and Molecular Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Michael Kim
- Department of Surgical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Virginia Giuliani
- Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association (MDC), Robert-Rössle-Str. 10, 13125 Berlin, Germany
| | - Giannicola Genovese
- Department of Genitourinary Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Jianhua Zhang
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Andrew Futreal
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Anirban Maitra
- Department of Pathology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Timothy Heffernan
- TRACTION platform, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Linghua Wang
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Kim-Anh Do
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Gaetano Gargiulo
- Max-Delbrück-Center for Molecular Medicine in the Helmholtz Association (MDC), Robert-Rössle-Str. 10, 13125 Berlin, Germany
| | - Giulio Draetta
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Alessandro Carugo
- TRACTION platform, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Ruitao Lin
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Andrea Viale
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| |
Collapse
|
9
|
Peng M, Lin B, Zhang J, Zhou Y, Lin B. scFSNN: a feature selection method based on neural network for single-cell RNA-seq data. BMC Genomics 2024; 25:264. [PMID: 38459442 PMCID: PMC10924397 DOI: 10.1186/s12864-024-10160-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 02/25/2024] [Indexed: 03/10/2024] Open
Abstract
While single-cell RNA sequencing (scRNA-seq) allows researchers to analyze gene expression in individual cells, its unique characteristics like over-dispersion, zero-inflation, high gene-gene correlation, and large data volume with many features pose challenges for most existing feature selection methods. In this paper, we present a feature selection method based on neural network (scFSNN) to solve classification problem for the scRNA-seq data. scFSNN is an embedded method that can automatically select features (genes) during model training, control the false discovery rate of selected features and adaptively determine the number of features to be eliminated. Extensive simulation and real data studies demonstrate its excellent feature selection ability and predictive performance.
Collapse
Affiliation(s)
- Minjiao Peng
- School of Mathematical Sciences, Shenzhen University, Nanshan, Shenzhen, 518060, Guangdong, China
- School of Mathematics and Statistics and KLAS, Northeast Normal University, Renmin Street, Changchun, 130000, Jilin, China
| | - Baoqin Lin
- Experimental Center, The First Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, Guangdong, 510405, China
| | - Jun Zhang
- School of Mathematical Sciences, Shenzhen University, Nanshan, Shenzhen, 518060, Guangdong, China
| | - Yan Zhou
- School of Mathematical Sciences, Shenzhen University, Nanshan, Shenzhen, 518060, Guangdong, China
| | - Bingqing Lin
- School of Mathematical Sciences, Shenzhen University, Nanshan, Shenzhen, 518060, Guangdong, China.
| |
Collapse
|
10
|
Senar N, van de Wiel M, Zwinderman AH, Hof MH. TOSCCA: a framework for interpretation and testing of sparse canonical correlations. BIOINFORMATICS ADVANCES 2024; 4:vbae021. [PMID: 38456127 PMCID: PMC10919946 DOI: 10.1093/bioadv/vbae021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 01/24/2024] [Accepted: 02/14/2024] [Indexed: 03/09/2024]
Abstract
Summary In clinical and biomedical research, multiple high-dimensional datasets are nowadays routinely collected from omics and imaging devices. Multivariate methods, such as Canonical Correlation Analysis (CCA), integrate two (or more) datasets to discover and understand underlying biological mechanisms. For an explorative method like CCA, interpretation is key. We present a sparse CCA method based on soft-thresholding that produces near-orthogonal components, allows for browsing over various sparsity levels, and permutation-based hypothesis testing. Our soft-thresholding approach avoids tuning of a penalty parameter. Such tuning is computationally burdensome and may render unintelligible results. In addition, unlike alternative approaches, our method is less dependent on the initialization. We examined the performance of our approach with simulations and illustrated its use on real cancer genomics data from drug sensitivity screens. Moreover, we compared its performance to Penalized Matrix Analysis (PMA), which is a popular alternative of sparse CCA with a focus on yielding interpretable results. Compared to PMA, our method offers improved interpretability of the results, while not compromising, or even improving, signal discovery. Availability and implementation The software and simulation framework are available at https://github.com/nuria-sv/toscca.
Collapse
Affiliation(s)
- Nuria Senar
- Department of Epidemiology & Data Science, Amsterdam School of Public Health, Amsterdam UMC, 1105 AZ Nord-Holland, The Netherlands
| | - Mark van de Wiel
- Department of Epidemiology & Data Science, Amsterdam School of Public Health, Amsterdam UMC, 1105 AZ Nord-Holland, The Netherlands
| | - Aeilko H Zwinderman
- Department of Epidemiology & Data Science, Amsterdam School of Public Health, Amsterdam UMC, 1105 AZ Nord-Holland, The Netherlands
| | - Michel H Hof
- Department of Epidemiology & Data Science, Amsterdam School of Public Health, Amsterdam UMC, 1105 AZ Nord-Holland, The Netherlands
| |
Collapse
|
11
|
Bao W, Liu Y, Chen B. Oral_voting_transfer: classification of oral microorganisms' function proteins with voting transfer model. Front Microbiol 2024; 14:1277121. [PMID: 38384719 PMCID: PMC10879614 DOI: 10.3389/fmicb.2023.1277121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 12/19/2023] [Indexed: 02/23/2024] Open
Abstract
Introduction The oral microbial group typically represents the human body's highly complex microbial group ecosystem. Oral microorganisms take part in human diseases, including Oral cavity inflammation, mucosal disease, periodontal disease, tooth decay, and oral cancer. On the other hand, oral microbes can also cause endocrine disorders, digestive function, and nerve function disorders, such as diabetes, digestive system diseases, and Alzheimer's disease. It was noted that the proteins of oral microbes play significant roles in these serious diseases. Having a good knowledge of oral microbes can be helpful in analyzing the procession of related diseases. Moreover, the high-dimensional features and imbalanced data lead to the complexity of oral microbial issues, which can hardly be solved with traditional experimental methods. Methods To deal with these challenges, we proposed a novel method, which is oral_voting_transfer, to deal with such classification issues in the field of oral microorganisms. Such a method employed three features to classify the five oral microorganisms, including Streptococcus mutans, Staphylococcus aureus, abiotrophy adjacent, bifidobacterial, and Capnocytophaga. Firstly, we utilized the highly effective model, which successfully classifies the organelle's proteins and transfers to deal with the oral microorganisms. And then, some classification methods can be treated as the local classifiers in this work. Finally, the results are voting from the transfer classifiers and the voting ones. Results and discussion The proposed method achieved the well performances in the five oral microorganisms. The oral_voting_transfer is a standalone tool, and all its source codes are publicly available at https://github.com/baowz12345/voting_transfer.
Collapse
Affiliation(s)
- Wenzheng Bao
- School of Information Engineering, Xuzhou University of Technology, Xuzhou, China
| | - Yujun Liu
- School of Information Engineering, Xuzhou University of Technology, Xuzhou, China
| | - Baitong Chen
- The Affiliated Xuzhou Municipal Hospital of Xuzhou Medical University, Xuzhou, China
- Department of Stomatology, Xuzhou First People’s Hospital, Xuzhou, China
| |
Collapse
|
12
|
Ahmed FF, Podder A, Bulbul MF, Hossain MA, Hasan M, Sarkar MAR, Kim D. Investigating the Precise Identification of Citrullination Sites with High- Performance Score Metrics Using a Powerful Computation Predicting Tool. Comb Chem High Throughput Screen 2024; 27:1381-1393. [PMID: 37702240 DOI: 10.2174/1386207326666230912151932] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 06/18/2023] [Accepted: 08/02/2023] [Indexed: 09/14/2023]
Abstract
BACKGROUND To elucidate the detailed mechanisms of citrullination at the molecular level and design drugs applicable to major human diseases, predicting protein citrullination sites (PCSs) is essential. Using experimental approaches to predict PCSs is time-consuming and costly. However, there is a limited scope of the current PCS predictors. In particular, most predictors are commonly used for PCS prediction and have limited performance scores. OBJECTIVE This work aims to provide an improved sophisticated predictor of citrullination sites using a benchmark dataset in a machine learning platform. METHODS This study presents a reliable citrullination site predictor based on a benchmark dataset containing a 1:1 ratio of positive and negative samples. We classified citrullination sites using the Composition of the K-Spaced Amino Acid Pairs (CKSAAP) and Support Vector Machine (SVM). RESULTS We developed PCS predictors using integrated machine-learning methods that produced the highest average scores. Using 10-fold cross-validation on test datasets, the True Positive Rate (TPR) was 98.34%, the True Negative Rate (TNR) was 99.44%, the accuracy was 98.89%, the Mathew Correlation Coefficient (MCC) was 98.21%, the Area Under the ROC Curve (AUC) was 0.999, and the partial Area Under the ROC Curve (pAUC) was 0.1968. CONCLUSION According to overall performance, our developed predictor has a significantly higher implementation in comparison with the current tools on the same benchmark dataset. Moreover, it showed better performance metrics on both test and training datasets. Our developed predictor is promising and can be implemented as a complementary technique for identifying fast and precise citrullination sites.
Collapse
Affiliation(s)
- Fee Faysal Ahmed
- Department of Mathematics, Jashore University of Science and Technology, Jashore, 7408, Bangladesh
| | - Anamika Podder
- Department of Mathematics, Jashore University of Science and Technology, Jashore, 7408, Bangladesh
| | - Md Farhad Bulbul
- Department of Mathematics, Jashore University of Science and Technology, Jashore, 7408, Bangladesh
- Department of Computer Science & Engineering, Pohang University of Science and Technology (POSTECH), 77 Cheongam, Pohang 37673, Korea
| | - Md Amzad Hossain
- Department of Electrical and Electronic Engineering, Jashore University of Science and Technology, Jashore -7408, Bangladesh
| | - Mahedi Hasan
- Department of Computer Science and Engineering, Jashore University of Science and Technology, Jashore, 7408, Bangladesh
| | - Md Abdur Rauf Sarkar
- Department of Genetic Engineering and Biotechnology, Jashore University of Science and Technology, Jashore 7408, Bangladesh
| | - Daijin Kim
- Department of Computer Science & Engineering, Pohang University of Science and Technology (POSTECH), 77 Cheongam, Pohang 37673, Korea
| |
Collapse
|
13
|
Qiao M. Factorized discriminant analysis for genetic signatures of neuronal phenotypes. Front Neuroinform 2023; 17:1265079. [PMID: 38156117 PMCID: PMC10752939 DOI: 10.3389/fninf.2023.1265079] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 11/06/2023] [Indexed: 12/30/2023] Open
Abstract
Navigating the complex landscape of single-cell transcriptomic data presents significant challenges. Central to this challenge is the identification of a meaningful representation of high-dimensional gene expression patterns that sheds light on the structural and functional properties of cell types. Pursuing model interpretability and computational simplicity, we often look for a linear transformation of the original data that aligns with key phenotypic features of cells. In response to this need, we introduce factorized linear discriminant analysis (FLDA), a novel method for linear dimensionality reduction. The crux of FLDA lies in identifying a linear function of gene expression levels that is highly correlated with one phenotypic feature while minimizing the influence of others. To augment this method, we integrate it with a sparsity-based regularization algorithm. This integration is crucial as it selects a subset of genes pivotal to a specific phenotypic feature or a combination thereof. To illustrate the effectiveness of FLDA, we apply it to transcriptomic datasets from neurons in the Drosophila optic lobe. We demonstrate that FLDA not only captures the inherent structural patterns aligned with phenotypic features but also uncovers key genes associated with each phenotype.
Collapse
|
14
|
Zengin HY, Karabulut E. Biomarker detection using corrected degree of domesticity in hybrid social network feature selection for improving classifier performance. BMC Bioinformatics 2023; 24:407. [PMID: 37904081 PMCID: PMC10617059 DOI: 10.1186/s12859-023-05540-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 10/20/2023] [Indexed: 11/01/2023] Open
Abstract
BACKGROUND Dimension reduction, especially feature selection, is an important step in improving classification performance for high-dimensional data. Particularly in cancer research, when reducing the number of features, i.e., genes, it is important to select the most informative features/potential biomarkers that could affect the diagnostic accuracy. Therefore, researchers continuously try to explore more efficient ways to reduce the large number of features/genes to a small but informative subset before the classification task. Hybrid methods have been extensively investigated for this purpose, and research to find the optimal approach is ongoing. Social network analysis is used as a part of a hybrid method, although there are several issues that have arisen when using social network tools, such as using a single environment for computing, constructing an adjacency matrix or computing network measures. Therefore, in our study, we apply a hybrid feature selection method consisting of several machine learning algorithms in addition to social network analysis with our proposed network metric, called the corrected degree of domesticity, in a single environment, R, to improve the support vector machine classifier's performance. In addition, we evaluate and compare the performances of several combinations used in the different steps of the method with a simulation experiment. RESULTS The proposed method improves the classifier's performance compared to using the whole feature set in all the cases we investigate. Additionally, in terms of the area under the receiver operating characteristic (ROC) curve, our approach improves classification performance compared to several approaches in the literature. CONCLUSION When using the corrected degree of domesticity as a network degree centrality measure, it is important to use our correction to compare nodes/features with no connection outside of their community since it provides a more accurate ranking among the features. Due to the nature of the hybrid method, which includes social network analysis, it is necessary to investigate possible combinations to provide an optimal solution for the microarray data used in the research.
Collapse
Affiliation(s)
- Hatice Yağmur Zengin
- Department of Biostatistics, Hacettepe University Faculty of Medicine, Sıhhiye, 06230, Ankara, Türkiye.
| | - Erdem Karabulut
- Department of Biostatistics, Hacettepe University Faculty of Medicine, Sıhhiye, 06230, Ankara, Türkiye
| |
Collapse
|
15
|
Khatun R, Akter M, Islam MM, Uddin MA, Talukder MA, Kamruzzaman J, Azad AKM, Paul BK, Almoyad MAA, Aryal S, Moni MA. Cancer Classification Utilizing Voting Classifier with Ensemble Feature Selection Method and Transcriptomic Data. Genes (Basel) 2023; 14:1802. [PMID: 37761941 PMCID: PMC10530870 DOI: 10.3390/genes14091802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 09/10/2023] [Accepted: 09/12/2023] [Indexed: 09/29/2023] Open
Abstract
Biomarker-based cancer identification and classification tools are widely used in bioinformatics and machine learning fields. However, the high dimensionality of microarray gene expression data poses a challenge for identifying important genes in cancer diagnosis. Many feature selection algorithms optimize cancer diagnosis by selecting optimal features. This article proposes an ensemble rank-based feature selection method (EFSM) and an ensemble weighted average voting classifier (VT) to overcome this challenge. The EFSM uses a ranking method that aggregates features from individual selection methods to efficiently discover the most relevant and useful features. The VT combines support vector machine, k-nearest neighbor, and decision tree algorithms to create an ensemble model. The proposed method was tested on three benchmark datasets and compared to existing built-in ensemble models. The results show that our model achieved higher accuracy, with 100% for leukaemia, 94.74% for colon cancer, and 94.34% for the 11-tumor dataset. This study concludes by identifying a subset of the most important cancer-causing genes and demonstrating their significance compared to the original data. The proposed approach surpasses existing strategies in accuracy and stability, significantly impacting the development of ML-based gene analysis. It detects vital genes with higher precision and stability than other existing methods.
Collapse
Affiliation(s)
- Rabea Khatun
- Department of Computer Science and Engineering, Green University of Bangladesh, Dhaka 1207, Bangladesh;
| | - Maksuda Akter
- Department of Computer Science and Engineering, Jagannath University, Dhaka 1100, Bangladesh; (M.A.); (M.A.T.)
| | - Md. Manowarul Islam
- Department of Computer Science and Engineering, Jagannath University, Dhaka 1100, Bangladesh; (M.A.); (M.A.T.)
| | - Md. Ashraf Uddin
- School of Information Technology, Deakin University, Waurn Ponds Campus, Geelong, VIC 3125, Australia; (M.A.U.); (S.A.)
| | - Md. Alamin Talukder
- Department of Computer Science and Engineering, Jagannath University, Dhaka 1100, Bangladesh; (M.A.); (M.A.T.)
| | - Joarder Kamruzzaman
- Centre for Smart Analytics, Federation University Australia, Ballarat, VIC 3842, Australia;
| | - AKM Azad
- Department of Mathematics and Statistics, College of Science, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 11564, Saudi Arabia;
| | - Bikash Kumar Paul
- Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Tangail 1902, Bangladesh;
- Department of Software Engineering, Daffodil International University (DIU), Dhaka 1342, Bangladesh
| | - Muhammad Ali Abdulllah Almoyad
- Department of Basic Medical Sciences, College of Applied Medical Sciences in Khamis Mushyt King Khalid University, Abha 61412, Saudi Arabia;
| | - Sunil Aryal
- School of Information Technology, Deakin University, Waurn Ponds Campus, Geelong, VIC 3125, Australia; (M.A.U.); (S.A.)
| | - Mohammad Ali Moni
- Artificial Intelligence & Data Science, School of Health and Rehabilitation Sciences, Faculty of Health and Behavioural Sciences, The University of Queensland, St Lucia, QLD 4072, Australia
| |
Collapse
|
16
|
Dousti Mousavi N, Aldirawi H, Yang J. Categorical Data Analysis for High-Dimensional Sparse Gene Expression Data. BIOTECH 2023; 12:52. [PMID: 37606439 PMCID: PMC10443356 DOI: 10.3390/biotech12030052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 07/15/2023] [Accepted: 07/24/2023] [Indexed: 08/23/2023] Open
Abstract
Categorical data analysis becomes challenging when high-dimensional sparse covariates are involved, which is often the case for omics data. We introduce a statistical procedure based on multinomial logistic regression analysis for such scenarios, including variable screening, model selection, order selection for response categories, and variable selection. We perform our procedure on high-dimensional gene expression data with 801 patients, 2426 genes, and five types of cancerous tumors. As a result, we recommend three finalized models: one with 74 genes achieves extremely low cross-entropy loss and zero predictive error rate based on a five-fold cross-validation; and two other models with 31 and 4 genes, respectively, are recommended for prognostic multi-gene signatures.
Collapse
Affiliation(s)
- Niloufar Dousti Mousavi
- Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL 60607, USA;
| | - Hani Aldirawi
- Department of Mathematics, California State University—San Bernardino, San Bernardino, CA 92407, USA;
| | - Jie Yang
- Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL 60607, USA;
| |
Collapse
|
17
|
Yuan S, Chen YC, Tsai CH, Chen HW, Shieh GS. Feature selection translates drug response predictors from cell lines to patients. Front Genet 2023; 14:1217414. [PMID: 37519889 PMCID: PMC10382684 DOI: 10.3389/fgene.2023.1217414] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2023] [Accepted: 06/26/2023] [Indexed: 08/01/2023] Open
Abstract
Targeted therapies and chemotherapies are prevalent in cancer treatment. Identification of predictive markers to stratify cancer patients who will respond to these therapies remains challenging because patient drug response data are limited. As large amounts of drug response data have been generated by cell lines, methods to efficiently translate cell-line-trained predictors to human tumors will be useful in clinical practice. Here, we propose versatile feature selection procedures that can be combined with any classifier. For demonstration, we combined the feature selection procedures with a (linear) logit model and a (non-linear) K-nearest neighbor and trained these on cell lines to result in LogitDA and KNNDA, respectively. We show that LogitDA/KNNDA significantly outperforms existing methods, e.g., a logistic model and a deep learning method trained by thousands of genes, in prediction AUC (0.70-1.00 for seven of the ten drugs tested) and is interpretable. This may be due to the fact that sample sizes are often limited in the area of drug response prediction. We further derive a novel adjustment on the prediction cutoff for LogitDA to yield a prediction accuracy of 0.70-0.93 for seven drugs, including erlotinib and cetuximab, whose pathways relevant to anti-cancer therapies are also uncovered. These results indicate that our methods can efficiently translate cell-line-trained predictors into tumors.
Collapse
Affiliation(s)
- Shinsheng Yuan
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
- Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan
| | - Yen-Chou Chen
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
| | - Chi-Hsuan Tsai
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
| | - Huei-Wen Chen
- College of Medicine, Graduate Institute of Toxicology, National Taiwan University, Taipei, Taiwan
| | - Grace S. Shieh
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan
- Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei, Taiwan
- Genome and Systems Biology Degree Program, Academia Sinica and National Taiwan University, Taipei, Taiwan
- Data Science Degree Program, Academia Sinica and National Taiwan University, Taipei, Taiwan
| |
Collapse
|
18
|
Guan C, Aflalo T, Kadlec K, Gámez de Leon J, Rosario ER, Bari A, Pouratian N, Andersen RA. Decoding and geometry of ten finger movements in human posterior parietal cortex and motor cortex. J Neural Eng 2023; 20:036020. [PMID: 37160127 PMCID: PMC10209510 DOI: 10.1088/1741-2552/acd3b1] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Revised: 03/24/2023] [Accepted: 05/09/2023] [Indexed: 05/11/2023]
Abstract
Objective. Enable neural control of individual prosthetic fingers for participants with upper-limb paralysis.Approach. Two tetraplegic participants were each implanted with a 96-channel array in the left posterior parietal cortex (PPC). One of the participants was additionally implanted with a 96-channel array near the hand knob of the left motor cortex (MC). Across tens of sessions, we recorded neural activity while the participants attempted to move individual fingers of the right hand. Offline, we classified attempted finger movements from neural firing rates using linear discriminant analysis with cross-validation. The participants then used the neural classifier online to control individual fingers of a brain-machine interface (BMI). Finally, we characterized the neural representational geometry during individual finger movements of both hands.Main Results. The two participants achieved 86% and 92% online accuracy during BMI control of the contralateral fingers (chance = 17%). Offline, a linear decoder achieved ten-finger decoding accuracies of 70% and 66% using respective PPC recordings and 75% using MC recordings (chance = 10%). In MC and in one PPC array, a factorized code linked corresponding finger movements of the contralateral and ipsilateral hands.Significance. This is the first study to decode both contralateral and ipsilateral finger movements from PPC. Online BMI control of contralateral fingers exceeded that of previous finger BMIs. PPC and MC signals can be used to control individual prosthetic fingers, which may contribute to a hand restoration strategy for people with tetraplegia.
Collapse
Affiliation(s)
- Charles Guan
- California Institute of Technology, Pasadena, CA, United States of America
| | - Tyson Aflalo
- California Institute of Technology, Pasadena, CA, United States of America
- T&C Chen Brain-Machine Interface Center at Caltech, Pasadena, CA, United States of America
| | - Kelly Kadlec
- California Institute of Technology, Pasadena, CA, United States of America
| | | | - Emily R Rosario
- Casa Colina Hospital and Centers for Healthcare, Pomona, CA, United States of America
| | - Ausaf Bari
- David Geffen School of Medicine at UCLA, Los Angeles, CA, United States of America
| | - Nader Pouratian
- University of Texas Southwestern Medical Center, Dallas, TX, United States of America
| | - Richard A Andersen
- California Institute of Technology, Pasadena, CA, United States of America
- T&C Chen Brain-Machine Interface Center at Caltech, Pasadena, CA, United States of America
| |
Collapse
|
19
|
Bajo-Morales J, Castillo-Secilla D, Herrera LJ, Caba O, Prados JC, Rojas I. Predicting COVID-19 Severity Integrating RNA-Seq Data Using Machine
Learning Techniques. Curr Bioinform 2023; 18:221-231. [DOI: 10.2174/1574893617666220718110053] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 05/21/2022] [Accepted: 05/31/2022] [Indexed: 11/22/2022]
Abstract
Abstract:
A fundamental challenge in the fight against COVID -19 is the development of reliable and accurate tools to predict disease progression in a patient. This information can be extremely useful in distinguishing hospitalized patients at higher risk for needing UCI from patients with low severity. How SARS-CoV-2 infection will evolve is still unclear.
Methods:
A novel pipeline was developed that can integrate RNA-Seq data from different databases to obtain a genetic biomarker COVID -19 severity index using an artificial intelligence algorithm. Our pipeline ensures robustness through multiple cross-validation processes in different steps.
Results:
CD93, RPS24, PSCA, and CD300E were identified as a COVID -19 severity gene signature. Furthermore, using the obtained gene signature, an effective multi-class classifier capable of discriminating between control, outpatient, inpatient, and ICU COVID -19 patients was optimized, achieving an accuracy of 97.5%.
Conclusion:
In summary, during this research, a new intelligent pipeline was implemented with the goal of developing a specific gene signature that can detect the severity of patients suffering COVID -19. Our approach to clinical decision support systems achieved excellent results, even when processing unseen samples. Our system can be of great clinical utility for the strategy of planning, organizing and managing human and material resources, as well as for automatically classifying the severity of patients affected by COVID -19.
Collapse
Affiliation(s)
- Javier Bajo-Morales
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez
Montero, 2, 18014, Granada, Spain
- Deuser Tech Group, Calle Islandia, 182-NAV 24A, Córdoba,
14014, Córdoba; Spain
| | - Daniel Castillo-Secilla
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez
Montero, 2, 18014, Granada, Spain
- Fujitsu Technology Solutions S.A, CoE Data Intelligence, Camino del Cerro
de los Gamos, 1, Pozuelo de Alarcón, 28224, Madrid, Spain
| | - Luis Javier Herrera
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez
Montero, 2, 18014, Granada, Spain
| | - Octavio Caba
- Nuclear Medicine Department, IMIBIC, University Hospital Reina Sofia, Menéndez
Pidal Avenue, 14004, Córdoba, Spain
| | - Jose Carlos Prados
- Nuclear Medicine Department, IMIBIC, University Hospital Reina Sofia, Menéndez
Pidal Avenue, 14004, Córdoba, Spain
| | - Ignacio Rojas
- Department of Computer Architecture and Technology, University of Granada, C.I.T.I.C., Periodista Rafael Gómez
Montero, 2, 18014, Granada, Spain
| |
Collapse
|
20
|
Wang J, Swartz CL, Huang K. Data-driven supply chain monitoring using canonical variate analysis. Comput Chem Eng 2023. [DOI: 10.1016/j.compchemeng.2023.108228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/29/2023]
|
21
|
Song X, Li R, Wang K, Bai Y, Xiao Y, Wang YP. Joint Sparse Collaborative Regression on Imaging Genetics Study of Schizophrenia. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1137-1146. [PMID: 35503837 PMCID: PMC10321021 DOI: 10.1109/tcbb.2022.3172289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The imaging genetics approach generates large amount of high dimensional and multi-modal data, providing complementary information for comprehensive study of Schizophrenia, a complex mental disease. However, at the same time, the variety of these data in structures, resolutions, and formats makes their integrative study a forbidding task. In this paper, we propose a novel model called Joint Sparse Collaborative Regression (JSCoReg), which can extract class-specific features from different health conditions/disease classes. We first evaluate the performance of feature selection in terms of Receiver operating characteristic curve and the area under the ROC curve in the simulation experiment. We demonstrate that the JSCoReg model can achieve higher accuracy compared with similar models including Joint Sparse Canonical Correlation Analysis and Sparse Collaborative Regression. We then applied the JSCoReg model to the analysis of schizophrenia dataset collected from the Mind Clinical Imaging Consortium. The JSCoReg enables us to better identify biomarkers associated with schizophrenia, which are verified to be both biologically and statistically significant.
Collapse
Affiliation(s)
- Xueli Song
- School of Sciences, Chang’an University, Xi’an, 710064, China
| | - Rongpeng Li
- School of Sciences, Chang’an University, Xi’an, 710064, China
| | - Kaiming Wang
- School of Sciences, Chang’an University, Xi’an, 710064, China
| | - Yuntong Bai
- Biomedical Engineering Department, Tulane University, New Orleans, LA 70118, USA
| | - Yuzhu Xiao
- School of Sciences, Chang’an University, Xi’an, 710064, China
| | - Yu-ping Wang
- Biomedical Engineering Department, Tulane University, New Orleans, LA 70118, USA
| |
Collapse
|
22
|
Huang EP, Pennello G, deSouza NM, Wang X, Buckler AJ, Kinahan PE, Barnhart HX, Delfino JG, Hall TJ, Raunig DL, Guimaraes AR, Obuchowski NA. Multiparametric Quantitative Imaging in Risk Prediction: Recommendations for Data Acquisition, Technical Performance Assessment, and Model Development and Validation. Acad Radiol 2023; 30:196-214. [PMID: 36273996 PMCID: PMC9825642 DOI: 10.1016/j.acra.2022.09.018] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Revised: 09/12/2022] [Accepted: 09/17/2022] [Indexed: 01/11/2023]
Abstract
Combinations of multiple quantitative imaging biomarkers (QIBs) are often able to predict the likelihood of an event of interest such as death or disease recurrence more effectively than single imaging measurements can alone. The development of such multiparametric quantitative imaging and evaluation of its fitness of use differs from the analogous processes for individual QIBs in several key aspects. A computational procedure to combine the QIB values into a model output must be specified. The output must also be reproducible and be shown to have reasonably strong ability to predict the risk of an event of interest. Attention must be paid to statistical issues not often encountered in the single QIB scenario, including overfitting and bias in the estimates of model performance. This is the fourth in a five-part series on statistical methodology for assessing the technical performance of multiparametric quantitative imaging. Considerations for data acquisition are discussed and recommendations from the literature on methodology to construct and evaluate QIB-based models for risk prediction are summarized. The findings in the literature upon which these recommendations are based are demonstrated through simulation studies. The concepts in this manuscript are applied to a real-life example involving prediction of major adverse cardiac events using automated plaque analysis.
Collapse
Affiliation(s)
- Erich P Huang
- Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, 9609 Medical Center Drive, MSC 9735, Bethesda, MD 20892-9735.
| | - Gene Pennello
- Center for Devices and Radiological Health, US Food and Drug Administration
| | - Nandita M deSouza
- Division of Radiotherapy and Imaging, The Institute of Cancer Research (London, UK), European Imaging Biomarkers Alliance
| | - Xiaofeng Wang
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation
| | | | | | | | - Jana G Delfino
- Center for Devices and Radiological Health, US Food and Drug Administration
| | - Timothy J Hall
- Department of Medical Physics, University of Wisconsin, Madison
| | - David L Raunig
- Data Science Institute, Statistical and Quantitative Sciences, Takeda
| | | | - Nancy A Obuchowski
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation
| |
Collapse
|
23
|
Huang EP, O'Connor JPB, McShane LM, Giger ML, Lambin P, Kinahan PE, Siegel EL, Shankar LK. Criteria for the translation of radiomics into clinically useful tests. Nat Rev Clin Oncol 2023; 20:69-82. [PMID: 36443594 PMCID: PMC9707172 DOI: 10.1038/s41571-022-00707-0] [Citation(s) in RCA: 109] [Impact Index Per Article: 54.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/02/2022] [Indexed: 11/29/2022]
Abstract
Computer-extracted tumour characteristics have been incorporated into medical imaging computer-aided diagnosis (CAD) algorithms for decades. With the advent of radiomics, an extension of CAD involving high-throughput computer-extracted quantitative characterization of healthy or pathological structures and processes as captured by medical imaging, interest in such computer-extracted measurements has increased substantially. However, despite the thousands of radiomic studies, the number of settings in which radiomics has been successfully translated into a clinically useful tool or has obtained FDA clearance is comparatively small. This relative dearth might be attributable to factors such as the varying imaging and radiomic feature extraction protocols used from study to study, the numerous potential pitfalls in the analysis of radiomic data, and the lack of studies showing that acting upon a radiomic-based tool leads to a favourable benefit-risk balance for the patient. Several guidelines on specific aspects of radiomic data acquisition and analysis are already available, although a similar roadmap for the overall process of translating radiomics into tools that can be used in clinical care is needed. Herein, we provide 16 criteria for the effective execution of this process in the hopes that they will guide the development of more clinically useful radiomic tests in the future.
Collapse
Affiliation(s)
- Erich P Huang
- Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Rockville, MD, USA.
| | - James P B O'Connor
- Division of Radiotherapy and Imaging, Institute of Cancer Research, London, UK
| | - Lisa M McShane
- Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Rockville, MD, USA
| | | | - Philippe Lambin
- Department of Precision Medicine, Maastricht University, Maastricht, Netherlands
| | - Paul E Kinahan
- Department of Radiology, University of Washington, Seattle, WA, USA
| | - Eliot L Siegel
- Department of Diagnostic Radiology, University of Maryland, Baltimore, MD, USA
| | - Lalitha K Shankar
- Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, Rockville, MD, USA
| |
Collapse
|
24
|
Obuchowski NA, Huang E, deSouza NM, Raunig D, Delfino J, Buckler A, Hatt C, Wang X, Moskowitz C, Guimaraes A, Giger M, Hall TJ, Kinahan P, Pennello G. A Framework for Evaluating the Technical Performance of Multiparameter Quantitative Imaging Biomarkers (mp-QIBs). Acad Radiol 2023; 30:147-158. [PMID: 36180328 PMCID: PMC9825639 DOI: 10.1016/j.acra.2022.08.031] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Revised: 08/19/2022] [Accepted: 08/26/2022] [Indexed: 01/11/2023]
Abstract
Multiparameter quantitative imaging incorporates anatomical, functional, and/or behavioral biomarkers to characterize tissue, detect disease, identify phenotypes, define longitudinal change, or predict outcome. Multiple imaging parameters are sometimes considered separately but ideally are evaluated collectively. Often, they are transformed as Likert interpretations, ignoring the correlations of quantitative properties that may result in better reproducibility or outcome prediction. In this paper we present three use cases of multiparameter quantitative imaging: i) multidimensional descriptor, ii) phenotype classification, and iii) risk prediction. A fourth application based on data-driven markers from radiomics is also presented. We describe the technical performance characteristics and their metrics common to all use cases, and provide a structure for the development, estimation, and testing of multiparameter quantitative imaging. This paper serves as an overview for a series of individual articles on the four applications, providing the statistical framework for multiparameter imaging applications in medicine.
Collapse
Affiliation(s)
- Nancy A Obuchowski
- Quantitative Health Sciences /JJN3, Cleveland Clinic Foundation, 9500 Euclid Ave. Cleveland, OH 44195.
| | - Erich Huang
- Biometric Research Program, Division of Cancer Treatment and Diagnosis - National Cancer Institute, National Institutes of Health, Huang, Rockville, Maryland
| | - Nandita M deSouza
- Division of Radiotherapy and Imaging, The Institute of Cancer Research and Royal Marsden NHS Foundation Trust, London, United Kingdom; European Imaging Biomarkers Alliance (EIBALL), European Society of Radiology (ESR), Vienna, Austria
| | - David Raunig
- Data Science Institute, Takeda, Raunig, Hew Hope, PA
| | - Jana Delfino
- Center for Devices and Radiological Health, US Food and Drug Administration, Delfino, Silver Spring, Maryland
| | | | - Charles Hatt
- University of Michigan, Hatt, Radiology, University of Michigan, Ann Arbor, MI
| | - Xiaofeng Wang
- Quantitative Health Sciences, Cleveland Clinic Foundation, Wang, Cleveland, OH
| | - Chaya Moskowitz
- Memorial Sloan Kettering Cancer Institute, Moskowitz, NYC, NY
| | - Alexander Guimaraes
- Department of Radiology, Oregon Health and Science University, Guimaraes, Oregon, Portland
| | - Maryellen Giger
- Department of Radiology, University of Chicago, Giger, Chicago, IL
| | - Timothy J Hall
- Department of Medical Physics, University of Wisconsin, Hall, Madison, WI
| | | | - Gene Pennello
- Division of Biostatistics, Center for Devices and Radiological Health, FDA, Pennello, Silver Spring, Maryland
| |
Collapse
|
25
|
Lai J, Wang X, Zhao K, Zheng S. Block-diagonal test for high-dimensional covariance matrices. TEST-SPAIN 2022. [DOI: 10.1007/s11749-022-00842-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
|
26
|
Novoselova N, Tom I. Hybrid Classification Model for Biomedical Data Analysis. INFORMATION TECHNOLOGY AND MANAGEMENT SCIENCE 2022. [DOI: 10.7250/itms-2022-0003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
The paper describes a method for constructing a hybrid classification model that allows combining several sources of biological information in order to build a classifier to identify subtypes of complex diseases. The distinctive feature of the method is its adaptive nature, i.e. the ability to build efficient classifiers regardless of data types, as well as a multi-criteria approach to evaluate the effectiveness of a classification. The testing results on real biomedical data showed the advantages of the proposed hybrid model in comparison with individual classifiers.
Collapse
Affiliation(s)
| | - Igor Tom
- United Institute of Informatics Problems, Minsk, Belarus
| |
Collapse
|
27
|
Mesa-Rodríguez A, Gonzalez A, Estevez-Rams E, Valdes-Sosa PA. Cancer Segmentation by Entropic Analysis of Ordered Gene Expression Profiles. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1744. [PMID: 36554151 PMCID: PMC9777913 DOI: 10.3390/e24121744] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/22/2022] [Revised: 11/24/2022] [Accepted: 11/24/2022] [Indexed: 06/17/2023]
Abstract
The availability of massive gene expression data has been challenging in terms of how to cure, process, and extract useful information. Here, we describe the use of entropic measures as discriminating criteria in cancer using the whole data set of gene expression levels. These methods were applied in classifying samples between tumor and normal type for 13 types of tumors with a high success ratio. Using gene expression, ordered by pathways, results in complexity-entropy diagrams. The map allows the clustering of the tumor and normal types samples, with a high success rate for nine of the thirteen, studied cancer types. Further analysis using information distance also shows good discriminating behavior, but, more importantly, allows for discriminating between cancer types. Together, our results allow the classification of tissues without the need to identify relevant genes or impose a particular cancer model. The used procedure can be extended to classification problems beyond the reported results.
Collapse
Affiliation(s)
- Ania Mesa-Rodríguez
- The Clinical Hospital of Chengdu Brain Science Institute, University of Electronic Sciences and Technology of China, Chengdu 610054, China
- Facultad de Matemática, Universidad de La Habana, San Lazaro y L, La Habana 10400, Cuba
| | - Augusto Gonzalez
- The Clinical Hospital of Chengdu Brain Science Institute, University of Electronic Sciences and Technology of China, Chengdu 610054, China
- Instituto de Cibernética, Matemática y Física, La Habana 10400, Cuba
| | - Ernesto Estevez-Rams
- Facultad de Física, Instituto de Ciencias y Tecnología de Materiales (IMRE), Universidad de La Habana, San Lazaro y L, La Habana 10400, Cuba
| | - Pedro A. Valdes-Sosa
- The Clinical Hospital of Chengdu Brain Science Institute, University of Electronic Sciences and Technology of China, Chengdu 610054, China
- Centro de Neurociencias, BioCubaFarma, La Habana 10400, Cuba
| |
Collapse
|
28
|
Anzarmou Y, Mkhadri A, Oualkacha K. Sparse overlapped linear discriminant analysis. TEST-SPAIN 2022. [DOI: 10.1007/s11749-022-00839-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
29
|
Independence index sufficient variable screening for categorical responses. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
30
|
Survival Analysis with High-Dimensional Omics Data Using a Threshold Gradient Descent Regularization-Based Neural Network Approach. Genes (Basel) 2022; 13:genes13091674. [PMID: 36140842 PMCID: PMC9498566 DOI: 10.3390/genes13091674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 09/13/2022] [Accepted: 09/16/2022] [Indexed: 11/17/2022] Open
Abstract
Analysis of data with a censored survival response and high-dimensional omics measurements is now common. Most of the existing analyses are based on specific (semi)parametric models, in particular the Cox model. Such analyses may be limited by not having sufficient flexibility, for example, in accommodating nonlinearity. For categorical and continuous responses, neural networks (NNs) have provided a highly competitive alternative. Comparatively, NNs for censored survival data remain limited. Omics measurements are usually high-dimensional, and only a small subset is expected to be survival-associated. As such, regularized estimation and selection are needed. In the existing NN studies, this is usually achieved via penalization. In this article, we propose adopting the threshold gradient descent regularization (TGDR) technique, which has competitive performance (for example, when compared to penalization) and unique advantages in regression analysis, but has not been adopted with NNs. The TGDR-based NN has a highly sensible formulation and an architecture different from the unregularized and penalization-based ones. Simulations show its satisfactory performance. Its practical effectiveness is further established via the analysis of two cancer omics datasets. Overall, this study can provide a practical and useful new way in the NN paradigm for survival analysis with high-dimensional omics measurements.
Collapse
|
31
|
Fazzari MJ, Guerra MM, Salmon J, Kim MY. Adverse pregnancy outcomes in women with systemic lupus erythematosus: can we improve predictions with machine learning? Lupus Sci Med 2022; 9:9/1/e000769. [PMID: 36104120 PMCID: PMC9476149 DOI: 10.1136/lupus-2022-000769] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Accepted: 09/01/2022] [Indexed: 11/03/2022]
Abstract
OBJECTIVES Nearly 20% of pregnancies in patients with SLE result in an adverse pregnancy outcome (APO). We previously developed an APO prediction model using logistic regression and data from Predictors of pRegnancy Outcome: bioMarkers In Antiphospholipid Antibody Syndrome and Systemic Lupus Erythematosus (PROMISSE), a large multicentre study of pregnant women with mild/moderate SLE and/or antiphospholipid antibodies. Our goal was to determine whether machine learning (ML) approaches improve APO prediction and identify other risk factors. METHODS The PROMISSE data included 41 predictors from 385 subjects; 18.4% had APO (preterm delivery due to placental insufficiency/pre-eclampsia, fetal/neonatal death, fetal growth restriction). Logistic regression with stepwise selection (LR-S), least absolute shrinkage and selection operator (LASSO), random forest (RF), neural network (NN), support vector machines (SVM-RBF), gradient boosting (GB) and SuperLearner (SL) were compared by cross-validated area under the ROC curve (AUC) and calibration. RESULTS Previously identified APO risk factors, antihypertensive medication use, low platelets, SLE disease activity and lupus anticoagulant (LAC), were confirmed as important with each algorithm. LASSO additionally revealed potential interactions between LAC and anticardiolipin IgG, among others. SL performed the best (AUC=0.78), but was statistically indistinguishable from LASSO, SVM-RBF and RF (AUC=0.77 for all). LR-S, NN and GB had worse AUC (0.71-0.74) and calibration scores. CONCLUSIONS We predicted APO with reasonable accuracy using variables routinely assessed prior to the 12th week of pregnancy. LASSO and some ML methods performed better than a standard logistic regression approach. Substantial improvement in APO prediction will likely be realised, not with increasingly complex algorithms but by the discovery of new biomarkers and APO risk factors.
Collapse
Affiliation(s)
- Melissa J Fazzari
- Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, New York, USA
| | - Marta M Guerra
- Rheumatology, Hospital for Special Surgery, New York, New York, USA
| | - Jane Salmon
- Rheumatology, Hospital for Special Surgery, New York, New York, USA
| | - Mimi Y Kim
- Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, New York, USA
| |
Collapse
|
32
|
Irigoien I, Cormand B, Soler-Artigas M, Sanchez-Mora C, Ramos-Quiroga JA, Arenas C. New Distance-Based approach for Genome-Wide Association Studies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2938-2949. [PMID: 34181548 DOI: 10.1109/tcbb.2021.3092812] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
With the rise of genome-wide association studies (GWAS), the analysis of typical GWAS data sets with thousands of single-nucleotide polymorphisms (SNPs) has become crucial in biomedicine research. Here, we propose a new method to identify SNPs related to disease in case-control studies. The method, based on genetic distances between individuals, takes into account the possible population substructure, and avoids the issues of multiple testing. The method provides two ordered lists of SNPs; one with SNPs which minor alleles can be considered risk alleles for the disease, and another one with SNPs which minor alleles can be considered as protective. These two lists provide a useful tool to help the researcher to decide where to focus attention in a first stage.
Collapse
|
33
|
Machine Learning Algorithms for Classification of MALDI-TOF MS Spectra from Phylogenetically Closely Related Species Brucella melitensis, Brucella abortus and Brucella suis. Microorganisms 2022; 10:microorganisms10081658. [PMID: 36014076 PMCID: PMC9416640 DOI: 10.3390/microorganisms10081658] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 07/29/2022] [Accepted: 08/08/2022] [Indexed: 11/25/2022] Open
Abstract
(1) Background: MALDI-TOF mass spectrometry (MS) is the gold standard for microbial fingerprinting, however, for phylogenetically closely related species, the resolution power drops down to the genus level. In this study, we analyzed MALDI-TOF spectra from 44 strains of B. melitensis, B. suis and B. abortus to identify the optimal classification method within popular supervised and unsupervised machine learning (ML) algorithms. (2) Methods: A consensus feature selection strategy was applied to pinpoint from among the 500 MS features those that yielded the best ML model and that may play a role in species differentiation. Unsupervised k-means and hierarchical agglomerative clustering were evaluated using the silhouette coefficient, while the supervised classifiers Random Forest, Support Vector Machine, Neural Network, and Multinomial Logistic Regression were explored in a fine-tuning manner using nested k-fold cross validation (CV) with a feature reduction step between the two CV loops. (3) Results: Sixteen differentially expressed peaks were identified and used to feed ML classifiers. Unsupervised and optimized supervised models displayed excellent predictive performances with 100% accuracy. The suitability of the consensus feature selection strategy for learning system accuracy was shown. (4) Conclusion: A meaningful ML approach is here introduced, to enhance Brucella spp. classification using MALDI-TOF MS data.
Collapse
|
34
|
Bayesian nonnegative matrix factorization in an incremental manner for data representation. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03522-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
|
35
|
Discovering Biomarkers for Non-Alcoholic Steatohepatitis Patients with and without Hepatocellular Carcinoma Using Fecal Metaproteomics. Int J Mol Sci 2022; 23:ijms23168841. [PMID: 36012106 PMCID: PMC9408600 DOI: 10.3390/ijms23168841] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 08/01/2022] [Accepted: 08/03/2022] [Indexed: 11/18/2022] Open
Abstract
High-calorie diets lead to hepatic steatosis and to the development of non-alcoholic fatty liver disease (NAFLD), which can evolve over many years into the inflammatory form of non-alcoholic steatohepatitis (NASH), posing a risk for the development of hepatocellular carcinoma (HCC). Due to diet and liver alteration, the axis between liver and gut is disturbed, resulting in gut microbiome alterations. Consequently, detecting these gut microbiome alterations represents a promising strategy for early NASH and HCC detection. We analyzed medical parameters and the fecal metaproteome of 19 healthy controls, 32 NASH patients, and 29 HCC patients, targeting the discovery of diagnostic biomarkers. Here, NASH and HCC resulted in increased inflammation status and shifts within the composition of the gut microbiome. An increased abundance of kielin/chordin, E3 ubiquitin ligase, and nucleophosmin 1 represented valuable fecal biomarkers, indicating disease-related changes in the liver. Although a single biomarker failed to separate NASH and HCC, machine learning-based classification algorithms provided an 86% accuracy in distinguishing between controls, NASH, and HCC. Fecal metaproteomics enables early detection of NASH and HCC by providing single biomarkers and machine learning-based metaprotein panels.
Collapse
|
36
|
Huang HC, Wu Y, Yang Q, Qin LX. PRECISION.array: An R Package for Benchmarking microRNA Array Data Normalization in the Context of Sample Classification. Front Genet 2022; 13:838679. [PMID: 35938023 PMCID: PMC9354575 DOI: 10.3389/fgene.2022.838679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2021] [Accepted: 06/10/2022] [Indexed: 11/13/2022] Open
Abstract
We present a new R package PRECISION.array for assessing the performance of data normalization methods in connection with methods for sample classification. It includes two microRNA microarray datasets for the same set of tumor samples: a re-sampling-based algorithm for simulating additional paired datasets under various designs of sample-to-array assignment and levels of signal-to-noise ratios and a collection of numerical and graphical tools for method performance assessment. The package allows users to specify their own methods for normalization and classification, in addition to implementing three methods for training data normalization, seven methods for test data normalization, seven methods for classifier training, and two methods for classifier validation. It enables an objective and systemic evaluation of the operating characteristics of normalization and classification methods in microRNA microarrays. To our knowledge, this is the first such tool available. The R package can be downloaded freely at https://github.com/LXQin/PRECISION.array.
Collapse
|
37
|
Hacking SM, Yakirevich E, Wang Y. From Immunohistochemistry to New Digital Ecosystems: A State-of-the-Art Biomarker Review for Precision Breast Cancer Medicine. Cancers (Basel) 2022; 14:3469. [PMID: 35884530 PMCID: PMC9315712 DOI: 10.3390/cancers14143469] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2022] [Revised: 07/13/2022] [Accepted: 07/15/2022] [Indexed: 02/04/2023] Open
Abstract
Breast cancers represent complex ecosystem-like networks of malignant cells and their associated microenvironment. Estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) are biomarkers ubiquitous to clinical practice in evaluating prognosis and predicting response to therapy. Recent feats in breast cancer have led to a new digital era, and advanced clinical trials have resulted in a growing number of personalized therapies with corresponding biomarkers. In this state-of-the-art review, we included the latest 10-year updated recommendations for ER, PR, and HER2, along with the most salient information on tumor-infiltrating lymphocytes (TILs), Ki-67, PD-L1, and several prognostic/predictive biomarkers at genomic, transcriptomic, and proteomic levels recently developed for selection and optimization of breast cancer treatment. Looking forward, the multi-omic landscape of the tumor ecosystem could be integrated with computational findings from whole slide images and radiomics in predictive machine learning (ML) models. These are new digital ecosystems on the road to precision breast cancer medicine.
Collapse
Affiliation(s)
| | | | - Yihong Wang
- Department of Pathology and Laboratory Medicine, Warren Alpert Medical School, Brown University, Rhode Island Hospital and Lifespan Medical Center, 593 Eddy Street, Providence, RI 02903, USA; (S.M.H.); (E.Y.)
| |
Collapse
|
38
|
Liu J, Xu Y, Liu S, Yu S, Yu Z, Low SS. Application and Progress of Chemometrics in Voltammetric Biosensing. BIOSENSORS 2022; 12:bios12070494. [PMID: 35884297 PMCID: PMC9313226 DOI: 10.3390/bios12070494] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Revised: 07/03/2022] [Accepted: 07/06/2022] [Indexed: 12/14/2022]
Abstract
The voltammetric electrochemical sensing method combined with biosensors and multi-sensor systems can quickly, accurately, and reliably analyze the concentration of the main analyte and the overall characteristics of complex samples. Simultaneously, the high-dimensional voltammogram contains the rich electrochemical features of the detected substances. Chemometric methods are important tools for mining valuable information from voltammetric data. Chemometrics can aid voltammetric biosensor calibration and multi-element detection in complex matrix conditions. This review introduces the voltammetric analysis techniques commonly used in the research of voltammetric biosensor and electronic tongues. Then, the research on optimizing voltammetric biosensor results using classical chemometrics is summarized. At the same time, the incorporation of machine learning and deep learning has brought new opportunities to further improve the detection performance of biosensors in complex samples. Finally, smartphones connected with miniaturized voltammetric biosensors and chemometric methods provide a high-quality portable analysis platform that shows great potential in point-of-care testing.
Collapse
Affiliation(s)
- Jingjing Liu
- College of Automation Engineering, Northeast Electric Power University, Jilin 132012, China; (Y.X.); (S.L.); (S.Y.)
- Correspondence: (J.L.); (S.S.L.)
| | - Yifei Xu
- College of Automation Engineering, Northeast Electric Power University, Jilin 132012, China; (Y.X.); (S.L.); (S.Y.)
| | - Shikun Liu
- College of Automation Engineering, Northeast Electric Power University, Jilin 132012, China; (Y.X.); (S.L.); (S.Y.)
| | - Shixin Yu
- College of Automation Engineering, Northeast Electric Power University, Jilin 132012, China; (Y.X.); (S.L.); (S.Y.)
| | - Zhirun Yu
- College of Law, The Australian National University, Canberra 2600, Australia;
| | - Sze Shin Low
- Research Centre of Life Science and HealthCare, China Beacons Institute, University of Nottingham Ningbo China, 199 Taikang East Road, Ningbo 315100, China
- Correspondence: (J.L.); (S.S.L.)
| |
Collapse
|
39
|
Mathai AM, Provost SB. On the singular gamma, Wishart, and beta matrix‐variate density functions. CAN J STAT 2022. [DOI: 10.1002/cjs.11710] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
- Arak M. Mathai
- Department of Mathematics and Statistics McGill University Montreal Quebec Canada
| | - Serge B. Provost
- Department of Statistical and Actuarial Sciences The University of Western Ontario London Ontario Canada
| |
Collapse
|
40
|
Bhutia S, Patra B, Ray M. A hybrid approach for cancer classification based on squirrel search. JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES 2022. [DOI: 10.1080/02522667.2022.2091095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Affiliation(s)
- Santosini Bhutia
- Department of Computer Science & Engineering, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, Odisha, India
| | - Bichitrananda Patra
- Department of Computer Application, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, Odisha, India
| | - Mitrabinda Ray
- Department of Computer Application, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, Odisha, India
| |
Collapse
|
41
|
Recognition of cancer mediating biomarkers using rough approximations enabled intuitionistic fuzzy soft sets based similarity measure. Appl Soft Comput 2022. [DOI: 10.1016/j.asoc.2022.109052] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
42
|
Rahman T, Huang HE, Li Y, Tai AS, Hseih WP, McClung CA, Tseng G. A sparse negative binomial classifier with covariate adjustment for RNA-seq data. Ann Appl Stat 2022. [DOI: 10.1214/21-aoas1532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Tanbin Rahman
- Department of Biostatistics, University of Pittsburgh
| | - Hsin-En Huang
- Institute of Statistics, National Tsing Hua University
| | - Yujia Li
- Department of Biostatistics, University of Pittsburgh
| | - An-Shun Tai
- Institute of Statistics, National Tsing Hua University
| | | | | | - George Tseng
- Department of Biostatistics, University of Pittsburgh
| |
Collapse
|
43
|
Reassessment of Reliability and Reproducibility for Triple-Negative Breast Cancer Subtyping. Cancers (Basel) 2022; 14:cancers14112571. [PMID: 35681552 PMCID: PMC9179838 DOI: 10.3390/cancers14112571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Revised: 05/05/2022] [Accepted: 05/06/2022] [Indexed: 11/17/2022] Open
Abstract
Simple Summary Triple-negative breast cancer (TNBC) is a heterogeneous disease. A proper classification system is needed to develop targetable biomarkers and guide personalized treatment in clinical practice. However, there has been no consensus on the molecular subtypes of TNBC, probably due to discrepancies in technical and computational methods chosen by different research groups. In this paper, we reassessed each major step for TNBC subtyping and provided suggestions, which promote rational workflow design and ensure reliable and reproducible results for future studies. We presented a recommended pipeline to the existing data, validated established TNBC subtypes with a larger sample size, and revealed two intermediate subtypes with prognostic significance. This work provides perspectives on issues and limitations regarding TNBC subtyping, indicating promising directions for developing targeted therapy based on the molecular characteristics of each TNBC subtype. Abstract Triple-negative breast cancer (TNBC) is a heterogeneous disease with diverse, often poor prognoses and treatment responses. In order to identify targetable biomarkers and guide personalized care, scientists have developed multiple molecular classification systems for TNBC based on transcriptomic profiling. However, there is no consensus on the molecular subtypes of TNBC, likely due to discrepancies in technical and computational methods used by different research groups. Here, we reassessed the major steps for TNBC subtyping, validated the reproducibility of established TNBC subtypes, and identified two more subtypes with a larger sample size. By comparing results from different workflows, we demonstrated the limitations of formalin-fixed, paraffin-embedded samples, as well as batch effect removal across microarray platforms. We also refined the usage of computational tools for TNBC subtyping. Furthermore, we integrated high-quality multi-institutional TNBC datasets (discovery set: n = 457; validation set: n = 165). Performing unsupervised clustering on the discovery and validation sets independently, we validated four previously discovered subtypes: luminal androgen receptor, mesenchymal, immunomodulatory, and basal-like immunosuppressed. Additionally, we identified two potential intermediate states of TNBC tumors based on their resemblance with more than one well-characterized subtype. In summary, we addressed the issues and limitations of previous TNBC subtyping through comprehensive analyses. Our results promote the rational design of future subtyping studies and provide new insights into TNBC patient stratification.
Collapse
|
44
|
Hébert F, Causeur D, Emily M. Adaptive Handling of Dependence in High-Dimensional Regression Modeling. J Comput Graph Stat 2022. [DOI: 10.1080/10618600.2022.2076687] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Florian Hébert
- Institut Agro, Univ. Rennes, CNRS, IRMAR, 35000 Rennes, France
| | - David Causeur
- Institut Agro, Univ. Rennes, CNRS, IRMAR, 35000 Rennes, France
| | - Mathieu Emily
- Institut Agro, Univ. Rennes, CNRS, IRMAR, 35000 Rennes, France
| |
Collapse
|
45
|
Chu X, Jiang M, Liu ZJ. Biomarker interaction selection and disease detection based on multivariate gain ratio. BMC Bioinformatics 2022; 23:176. [PMID: 35550010 PMCID: PMC9103137 DOI: 10.1186/s12859-022-04699-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Accepted: 04/14/2022] [Indexed: 11/30/2022] Open
Abstract
Background Disease detection is an important aspect of biotherapy. With the development of biotechnology and computer technology, there are many methods to detect disease based on single biomarker. However, biomarker does not influence disease alone in some cases. It’s the interaction between biomarkers that determines disease status. The existing influence measure I-score is used to evaluate the importance of interaction in determining disease status, but there is a deviation about the number of variables in interaction when applying I-score. To solve the problem, we propose a new influence measure Multivariate Gain Ratio (MGR) based on Gain Ratio (GR) of single-variate, which provides us with multivariate combination called interaction. Results We propose a preprocessing verification algorithm based on partial predictor variables to select an appropriate preprocessing method. In this paper, an algorithm for selecting key interactions of biomarkers and applying key interactions to construct a disease detection model is provided. MGR is more credible than I-score in the case of interaction containing small number of variables. Our method behaves better with average accuracy \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$93.13\%$$\end{document}93.13% than I-score of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$91.73\%$$\end{document}91.73% in Breast Cancer Wisconsin (Diagnostic) Dataset. Compared to the classification results \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$89.80\%$$\end{document}89.80% based on all predictor variables, MGR identifies the true main biomarkers and realizes the dimension reduction. In Leukemia Dataset, the experiment results show the effectiveness of MGR with the accuracy of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$97.32\%$$\end{document}97.32% compared to I-score with accuracy \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$89.11\%$$\end{document}89.11%. The results can be explained by the nature of MGR and I-score mentioned above because every key interaction contains a small number of variables in Leukemia Dataset. Conclusions MGR is effective for selecting important biomarkers and biomarker interactions even in high-dimension feature space in which the interaction could contain more than two biomarkers. The prediction ability of interactions selected by MGR is better than I-score in the case of interaction containing small number of variables. MGR is generally applicable to various types of biomarker datasets including cell nuclei, gene, SNPs and protein datasets.
Collapse
Affiliation(s)
- Xiao Chu
- Academy of Mathematics and Systems Science Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China.
| | - Mao Jiang
- Academy of Mathematics and Systems Science Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Zhuo-Jun Liu
- Academy of Mathematics and Systems Science Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
46
|
Yu B, Dai W, Pang L, Sang Q, Li F, Yu J, Feng H, Li J, Hou J, Yan C, Su L, Zhu Z, Li YY, Liu B. The dynamic alteration of transcriptional regulation by crucial TFs during tumorigenesis of gastric cancer. Mol Med 2022; 28:41. [PMID: 35421923 PMCID: PMC9008954 DOI: 10.1186/s10020-022-00468-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2021] [Accepted: 04/04/2022] [Indexed: 11/26/2022] Open
Abstract
Background The mechanisms of Gastric cancer (GC) initiation and progression are complicated, at least partly owing to the dynamic changes of gene regulation during carcinogenesis. Thus, investigations on the changes in regulatory networks can improve the understanding of cancer development and provide novel insights into the molecular mechanisms of cancer. Methods Differential co-expression analysis (DCEA), differential gene regulation network (GRN) modeling and differential regulation analysis (DRA) were integrated to detect differential transcriptional regulation events between gastric normal mucosa and cancer samples based on GSE54129 dataset. Cytological experiments and IHC staining assays were used to validate the dynamic changes of CREB1 regulated targets in different stages. Results A total of 1955 differentially regulated genes (DRGs) were identified and prioritized in a quantitative way. Among the top 1% DRGs, 14 out of 19 genes have been reported to be GC relevant. The four transcription factors (TFs) among the top 1% DRGs, including CREB1, BPTF, GATA6 and CEBPA, were regarded as crucial TFs relevant to GC progression. The differentially regulated links (DRLs) around the four crucial TFs were then prioritized to generate testable hypotheses on the differential regulation mechanisms of gastric carcinogenesis. To validate the dynamic alterations of gene regulation patterns of crucial TFs during GC progression, we took CREB1 as an example to screen its differentially regulated targets by using cytological and IHC staining assays. Eventually, TCEAL2 and MBNL1 were proved to be differentially regulated by CREB1 during tumorigenesis of gastric cancer. Conclusions By combining differential networking information and molecular cell experiments verification, testable hypotheses on the regulation mechanisms of GC around the core TFs and their top ranked DRLs were generated. Since TCEAL2 and MBNL1 have been reported to be potential therapeutic targets in SCLC and breast cancer respectively, their translation values in GC are worthy of further investigation. Supplementary Information The online version contains supplementary material available at 10.1186/s10020-022-00468-7.
Collapse
|
47
|
Bajo-Morales J, Prieto-Prieto JC, Herrera LJ, Rojas I, Castillo-Secilla D. COVID-19 Biomarkers Recognition & Classification Using Intelligent Systems. Curr Bioinform 2022. [DOI: 10.2174/1574893617666220328125029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Background:
SARS-CoV-2 has paralyzed mankind due to its high transmissibility and its associated mortality, causing millions of infections and deaths worldwide. The search for gene expression biomarkers from the host transcriptional response to infection may help understand the underlying mechanisms by which the virus causes COVID-19. This research proposes a smart methodology integrating different RNA-Seq datasets from SARS-CoV-2, other respiratory diseases, and healthy patients.
Methods:
The proposed pipeline exploits the functionality of the ‘KnowSeq’ R/Bioc package, integrating different data sources and attaining a significantly larger gene expression dataset, thus endowing the results with higher statistical significance and robustness in comparison with previous studies in the literature. A detailed preprocessing step was carried out to homogenize the samples and build a clinical decision system for SARS-CoV-2. It uses machine learning techniques such as feature selection algorithm and supervised classification system. This clinical decision system uses the most differentially expressed genes among different diseases (including SARS-Cov-2) to develop a four-class classifier.
Results:
The multiclass classifier designed can discern SARS-CoV-2 samples, reaching an accuracy equal to 91.5%, a mean F1-Score equal to 88.5%, and a SARS-CoV-2 AUC equal to 94% by using only 15 genes as predictors. A biological interpretation of the gene signature extracted reveals relations with processes involved in viral responses.
Conclusion:
This work proposes a COVID-19 gene signature composed of 15 genes, selected after applying the feature selection ‘minimum Redundancy Maximum Relevance’ algorithm. The integration among several RNA-Seq datasets was a success, allowing for a considerable large number of samples and therefore providing greater statistical significance to the results than previous studies. Biological interpretation of the selected genes was also provided.
Collapse
Affiliation(s)
- Javier Bajo-Morales
- Department of Computer Architecture and Technology, University of Granada. C.I.T.I.C., Periodista Rafael Gómez Montero, 2, 18014, Granada, Spain
| | - Juan Carlos Prieto-Prieto
- Nuclear Medicine Department, IMIBIC, University Hospital Reina Sofia, Menéndez Pidal Avenue, 14004, Córdoba, Spain
| | - Luis Javier Herrera
- Department of Computer Architecture and Technology, University of Granada. C.I.T.I.C., Periodista Rafael Gómez Montero, 2, 18014, Granada, Spain
| | - Ignacio Rojas
- Department of Computer Architecture and Technology, University of Granada. C.I.T.I.C., Periodista Rafael Gómez Montero, 2, 18014, Granada, Spain
| | - Daniel Castillo-Secilla
- Department of Computer Architecture and Technology, University of Granada. C.I.T.I.C., Periodista Rafael Gómez Montero, 2, 18014, Granada, Spain
| |
Collapse
|
48
|
Profiling (Non-)Nascent Entrepreneurs in Hungary Based on Machine Learning Approaches. SUSTAINABILITY 2022. [DOI: 10.3390/su14063571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
In our study, we examined the characteristics of nascent entrepreneurs using the 2021 Global Entrepreneurship Monitor national representative data in Hungary. We examined our topic based on Arenius and Minitti’s four-category theory framework. In our research, we examined system-level feature sets with four machine learning modeling algorithms: multivariate adaptive regression spline (MARS), support vector machine (SVM), random forest (RF), and AdaBoost. Our results show that each machine algorithm can predict nascent entrepreneurs with over 90% adaptive cruise control (ACC) accuracy. Furthermore, the adaptation of the categories of variables based on the theory of Arenius and Minitti provides an appropriate framework for obtaining reliable predictions. Based on our results, it can be concluded that perceptual factors have different importance and weight along the optimal models, and if we include further reliability measures in the model validation, we cannot pinpoint only one algorithm that can adequately identify nascent entrepreneurs. Accurate forecasting requires a careful and predictor-level analysis of the algorithms’ models, which also includes the systemic relationship between the affecting factors. An important but unexpected result of our study is that we identified that Hungarian NEs have very specific previous entrepreneurial and business ownership experience; thus, they can be defined not as a beginner but as a novice enterprise.
Collapse
|
49
|
Yang ZY, Ye ZF, Xiao YJ, Hsieh CY, Zhang SY. SPLDExtraTrees: robust machine learning approach for predicting kinase inhibitor resistance. Brief Bioinform 2022; 23:6543900. [PMID: 35262669 DOI: 10.1093/bib/bbac050] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 01/17/2022] [Accepted: 01/31/2022] [Indexed: 12/25/2022] Open
Abstract
Drug resistance is a major threat to the global health and a significant concern throughout the clinical treatment of diseases and drug development. The mutation in proteins that is related to drug binding is a common cause for adaptive drug resistance. Therefore, quantitative estimations of how mutations would affect the interaction between a drug and the target protein would be of vital significance for the drug development and the clinical practice. Computational methods that rely on molecular dynamics simulations, Rosetta protocols, as well as machine learning methods have been proven to be capable of predicting ligand affinity changes upon protein mutation. However, the severely limited sample size and heavy noise induced overfitting and generalization issues have impeded wide adoption of machine learning for studying drug resistance. In this paper, we propose a robust machine learning method, termed SPLDExtraTrees, which can accurately predict ligand binding affinity changes upon protein mutation and identify resistance-causing mutations. Especially, the proposed method ranks training data following a specific scheme that starts with easy-to-learn samples and gradually incorporates harder and diverse samples into the training, and then iterates between sample weight recalculations and model updates. In addition, we calculate additional physics-based structural features to provide the machine learning model with the valuable domain knowledge on proteins for these data-limited predictive tasks. The experiments substantiate the capability of the proposed method for predicting kinase inhibitor resistance under three scenarios and achieve predictive accuracy comparable with that of molecular dynamics and Rosetta methods with much less computational costs.
Collapse
Affiliation(s)
- Zi-Yi Yang
- Tencent Quantum Laboratory, Shenzhen, 518057, Guangdong, China
| | - Zhao-Feng Ye
- Tencent Quantum Laboratory, Shenzhen, 518057, Guangdong, China
| | - Yi-Jia Xiao
- Tencent Quantum Laboratory, Shenzhen, 518057, Guangdong, China.,Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
| | - Chang-Yu Hsieh
- Tencent Quantum Laboratory, Shenzhen, 518057, Guangdong, China
| | - Sheng-Yu Zhang
- Tencent Quantum Laboratory, Shenzhen, 518057, Guangdong, China
| |
Collapse
|
50
|
|