1
|
Borah K, Das HS, Budhathoki RK, Aurangzeb K, Mallik S. DOMSCNet: a deep learning model for the classification of stomach cancer using multi-layer omics data. Brief Bioinform 2025; 26:bbaf115. [PMID: 40178281 PMCID: PMC11966610 DOI: 10.1093/bib/bbaf115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2024] [Revised: 01/31/2025] [Accepted: 02/20/2025] [Indexed: 04/05/2025] Open
Abstract
The rapid advancement of next-generation sequencing (NGS) technology and the expanding availability of NGS datasets have led to a significant surge in biomedical research. To better understand the molecular processes, underlying cancer and to support its development, diagnosis, prediction, and therapy; NGS data analysis is crucial. However, the NGS multi-layer omics high-dimensional dataset is highly complex. In recent times, some computational methods have been developed for cancer omics data interpretation. However, various existing methods face challenges in accounting for diverse types of cancer omics data and struggle to effectively extract informative features for the integrated identification of core units. To address these challenges, we proposed a hybrid feature selection (HFS) technique to detect optimal features from multi-layer omics datasets. Subsequently, this study proposes a novel hybrid deep recurrent neural network-based model DOMSCNet to classify stomach cancer. The proposed model was made generic for all four multi-layer omics datasets. To observe the robustness of the DOMSCNet model, the proposed model was validated with eight external datasets. Experimental results showed that the SelectKBest-maximum relevancy minimum redundancy-Boruta (SMB), HFS technique outperformed all other HFS techniques. Across four multi-layer omics datasets and validated datasets, the proposed DOMSCNet model outdid existing classifiers along with other proposed classifiers.
Collapse
Affiliation(s)
- Kasmika Borah
- Department of Computer Science and Information Technology, Cotton University, Hem Baruah Rd, Panbazar, Guwahati, Kamrup Metropolitan district, Assam 781001, India
| | - Himanish Shekhar Das
- Department of Computer Science and Information Technology, Cotton University, Hem Baruah Rd, Panbazar, Guwahati, Kamrup Metropolitan district, Assam 781001, India
| | - Ram Kaji Budhathoki
- Department of Electrical and Electronics Engineering, School of Engineering, Kathmandu University, Kavrepalanchok district, Dhulikhel 45200, Nepal
| | - Khursheed Aurangzeb
- Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, P. O. Box 51178, Riyadh district, 11543, Saudi Arabia
| | - Saurav Mallik
- Department of Environmental Health, Harvard T. H. Chan School of Public Health, 665 Huntington Avenue, Boston, MA 02115, United States
- Department of Pharmacology & Toxicology, University of Arizona, 1295 N Martin Ave, Pima district, Tucson, AZ 85721, United States
| |
Collapse
|
2
|
Klempíř O, Krupička R. Analyzing Wav2Vec 1.0 Embeddings for Cross-Database Parkinson's Disease Detection and Speech Features Extraction. SENSORS (BASEL, SWITZERLAND) 2024; 24:5520. [PMID: 39275431 PMCID: PMC11398018 DOI: 10.3390/s24175520] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Revised: 08/22/2024] [Accepted: 08/24/2024] [Indexed: 09/16/2024]
Abstract
Advancements in deep learning speech representations have facilitated the effective use of extensive unlabeled speech datasets for Parkinson's disease (PD) modeling with minimal annotated data. This study employs the non-fine-tuned wav2vec 1.0 architecture to develop machine learning models for PD speech diagnosis tasks, such as cross-database classification and regression to predict demographic and articulation characteristics. The primary aim is to analyze overlapping components within the embeddings on both classification and regression tasks, investigating whether latent speech representations in PD are shared across models, particularly for related tasks. Firstly, evaluation using three multi-language PD datasets showed that wav2vec accurately detected PD based on speech, outperforming feature extraction using mel-frequency cepstral coefficients in the proposed cross-database classification scenarios. In cross-database scenarios using Italian and English-read texts, wav2vec demonstrated performance comparable to intra-dataset evaluations. We also compared our cross-database findings against those of other related studies. Secondly, wav2vec proved effective in regression, modeling various quantitative speech characteristics related to articulation and aging. Ultimately, subsequent analysis of important features examined the presence of significant overlaps between classification and regression models. The feature importance experiments discovered shared features across trained models, with increased sharing for related tasks, further suggesting that wav2vec contributes to improved generalizability. The study proposes wav2vec embeddings as a next promising step toward a speech-based universal model to assist in the evaluation of PD.
Collapse
Affiliation(s)
| | - Radim Krupička
- Department of Biomedical Informatics, Faculty of Biomedical Engineering, Czech Technical University in Prague, 16000 Prague, Czech Republic;
| |
Collapse
|
3
|
Borah K, Das HS, Seth S, Mallick K, Rahaman Z, Mallik S. A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis. Funct Integr Genomics 2024; 24:139. [PMID: 39158621 DOI: 10.1007/s10142-024-01415-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2024] [Revised: 07/30/2024] [Accepted: 08/01/2024] [Indexed: 08/20/2024]
Abstract
Recent advancements in biomedical technologies and the proliferation of high-dimensional Next Generation Sequencing (NGS) datasets have led to significant growth in the bulk and density of data. The NGS high-dimensional data, characterized by a large number of genomics, transcriptomics, proteomics, and metagenomics features relative to the number of biological samples, presents significant challenges for reducing feature dimensionality. The high dimensionality of NGS data poses significant challenges for data analysis, including increased computational burden, potential overfitting, and difficulty in interpreting results. Feature selection and feature extraction are two pivotal techniques employed to address these challenges by reducing the dimensionality of the data, thereby enhancing model performance, interpretability, and computational efficiency. Feature selection and feature extraction can be categorized into statistical and machine learning methods. The present study conducts a comprehensive and comparative review of various statistical, machine learning, and deep learning-based feature selection and extraction techniques specifically tailored for NGS and microarray data interpretation of humankind. A thorough literature search was performed to gather information on these techniques, focusing on array-based and NGS data analysis. Various techniques, including deep learning architectures, machine learning algorithms, and statistical methods, have been explored for microarray, bulk RNA-Seq, and single-cell, single-cell RNA-Seq (scRNA-Seq) technology-based datasets surveyed here. The study provides an overview of these techniques, highlighting their applications, advantages, and limitations in the context of high-dimensional NGS data. This review provides better insights for readers to apply feature selection and feature extraction techniques to enhance the performance of predictive models, uncover underlying biological patterns, and gain deeper insights into massive and complex NGS and microarray data.
Collapse
Affiliation(s)
- Kasmika Borah
- Department of Computer Science and Information Technology, Cotton University, Panbazar, Guwahati, 781001, Assam, India
| | - Himanish Shekhar Das
- Department of Computer Science and Information Technology, Cotton University, Panbazar, Guwahati, 781001, Assam, India.
| | - Soumita Seth
- Department of Computer Science and Engineering, Future Institute of Engineering and Management, Narendrapur, Kolkata, 700150, West Bengal, India
| | - Koushik Mallick
- Department of Computer Science and Engineering, RCC Institute of Information Technology, Canal S Rd, Beleghata, Kolkata, 700015, West Bengal, India
| | | | - Saurav Mallik
- Department of Environmental Health, Harvard T H Chan School of Public Health, Boston, MA, 02115, USA.
- Department of Pharmacology & Toxicology, University of Arizona, Tucson, AZ, 85721, USA.
| |
Collapse
|
4
|
Huang M, Wang J, Zhang Z, Zuo X. ZMIZ1 Regulates Proliferation, Autophagy and Apoptosis of Colon Cancer Cells by Mediating Ubiquitin-Proteasome Degradation of SIRT1. Biochem Genet 2024; 62:3245-3259. [PMID: 38214831 PMCID: PMC11289246 DOI: 10.1007/s10528-023-10573-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Accepted: 10/26/2023] [Indexed: 01/13/2024]
Abstract
There are nearly 1.15 million new cases of colon cancer, as well as 586,858 deaths from colon cancer worldwide in 2020. The aim of this study is to reveal whether ZMIZ1 can control the fate of colon cancer cells and the mechanism by which it functions. Specific shRNA transfection was used to knock down the expression of ZMIZ1 in colon cancer cell lines (HCT116 and HT29), and cell proliferation was detected using EdU and CCK-8 reagents, apoptosis by flow cytometry, and autophagy by western blot. The interaction of ZMIZ1 and SIRT1 was analyzed. Knockdown of ZMIZ1 significantly inhibited autophagy and proliferation, and induced apoptosis of HCT116 and HT29 cells. The mRNA level of SIRT1 was not affected by ZMIZ1 knockdown, but the protein level of SIRT1 was significantly decreased and the protein level of the SIRT1-specific substrate, acetylated FOXO3a, was reduced. Immunoprecipitation assays identified the interaction between SIRT1 and ZMIZ1 in HCT116 and HT29 cells. ZMIZ1 increased intracellular ubiquitination of SIRT1. Knockdown or pharmacological inhibition of SIRT1 neutralized the effects of ZMIZ knockdown on proliferation, autophagy and apoptosis in HCT116 and HT29 cells. ZMIZ1 may control the fate of colon cancer cells through the SIRT1/FOXO3a axis. Targeting ZMIZ1 would be beneficial for the treatment of colon cancer.
Collapse
Affiliation(s)
- Min Huang
- Department of Gastrointestinal Surgery, The First Affiliated Hospital of Wannan Medical College, No.2 Zheshan West Road, Wuhu, 241000, Anhui, China.
| | - Junfeng Wang
- Department of Gastrointestinal Surgery, The First Affiliated Hospital of Wannan Medical College, No.2 Zheshan West Road, Wuhu, 241000, Anhui, China
| | - Zhengrong Zhang
- Department of Gastrointestinal Surgery, The First Affiliated Hospital of Wannan Medical College, No.2 Zheshan West Road, Wuhu, 241000, Anhui, China
| | - Xueliang Zuo
- Department of Gastrointestinal Surgery, The First Affiliated Hospital of Wannan Medical College, No.2 Zheshan West Road, Wuhu, 241000, Anhui, China
| |
Collapse
|
5
|
Verma S, Magazzù G, Eftekhari N, Lou T, Gilhespy A, Occhipinti A, Angione C. Cross-attention enables deep learning on limited omics-imaging-clinical data of 130 lung cancer patients. CELL REPORTS METHODS 2024; 4:100817. [PMID: 38981473 PMCID: PMC11294841 DOI: 10.1016/j.crmeth.2024.100817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Revised: 04/18/2024] [Accepted: 06/17/2024] [Indexed: 07/11/2024]
Abstract
Deep-learning tools that extract prognostic factors derived from multi-omics data have recently contributed to individualized predictions of survival outcomes. However, the limited size of integrated omics-imaging-clinical datasets poses challenges. Here, we propose two biologically interpretable and robust deep-learning architectures for survival prediction of non-small cell lung cancer (NSCLC) patients, learning simultaneously from computed tomography (CT) scan images, gene expression data, and clinical information. The proposed models integrate patient-specific clinical, transcriptomic, and imaging data and incorporate Kyoto Encyclopedia of Genes and Genomes (KEGG) and Reactome pathway information, adding biological knowledge within the learning process to extract prognostic gene biomarkers and molecular pathways. While both models accurately stratify patients in high- and low-risk groups when trained on a dataset of only 130 patients, introducing a cross-attention mechanism in a sparse autoencoder significantly improves the performance, highlighting tumor regions and NSCLC-related genes as potential biomarkers and thus offering a significant methodological advancement when learning from small imaging-omics-clinical samples.
Collapse
Affiliation(s)
- Suraj Verma
- School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, UK
| | | | | | - Thai Lou
- Gateshead Health NHS Foundation Trust, Gateshead, UK
| | - Alex Gilhespy
- South Tyneside and Sunderland NHS Foundation Trust, Sunderland, UK
| | - Annalisa Occhipinti
- School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, UK; Centre for Digital Innovation, Teesside University, Middlesbrough, UK; National Horizons Centre, Teesside University, Darlington, UK
| | - Claudio Angione
- School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, UK; Centre for Digital Innovation, Teesside University, Middlesbrough, UK; National Horizons Centre, Teesside University, Darlington, UK.
| |
Collapse
|
6
|
Zehetner L, Széliová D, Kraus B, Hernandez Bort JA, Zanghellini J. Logistic PCA explains differences between genome-scale metabolic models in terms of metabolic pathways. PLoS Comput Biol 2024; 20:e1012236. [PMID: 38913731 PMCID: PMC11226097 DOI: 10.1371/journal.pcbi.1012236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2023] [Revised: 07/05/2024] [Accepted: 06/07/2024] [Indexed: 06/26/2024] Open
Abstract
Genome-scale metabolic models (GSMMs) offer a holistic view of biochemical reaction networks, enabling in-depth analyses of metabolism across species and tissues in multiple conditions. However, comparing GSMMs Against each other poses challenges as current dimensionality reduction algorithms or clustering methods lack mechanistic interpretability, and often rely on subjective assumptions. Here, we propose a new approach utilizing logisitic principal component analysis (LPCA) that efficiently clusters GSMMs while singling out mechanistic differences in terms of reactions and pathways that drive the categorization. We applied LPCA to multiple diverse datasets, including GSMMs of 222 Escherichia-strains, 343 budding yeasts (Saccharomycotina), 80 human tissues, and 2943 Firmicutes strains. Our findings demonstrate LPCA's effectiveness in preserving microbial phylogenetic relationships and discerning human tissue-specific metabolic profiles, exhibiting comparable performance to traditional methods like t-distributed stochastic neighborhood embedding (t-SNE) and Jaccard coefficients. Moreover, the subsystems and associated reactions identified by LPCA align with existing knowledge, underscoring its reliability in dissecting GSMMs and uncovering the underlying drivers of separation.
Collapse
Affiliation(s)
- Leopold Zehetner
- Department of Analytical Chemistry, Faculty of Chemistry, University of Vienna, Vienna, Austria
- Vienna Doctoral School in Chemistry (DoSChem), University of Vienna, Vienna, Austria
- Gene Therapy Process Development, Baxalta Innovations GmbH, a Part of Takeda Companies, Orth an der Donau, Austria
| | - Diana Széliová
- Department of Analytical Chemistry, Faculty of Chemistry, University of Vienna, Vienna, Austria
| | - Barbara Kraus
- Gene Therapy Process Development, Baxalta Innovations GmbH, a Part of Takeda Companies, Orth an der Donau, Austria
| | - Juan A. Hernandez Bort
- Gene Therapy Process Development, Baxalta Innovations GmbH, a Part of Takeda Companies, Orth an der Donau, Austria
| | - Jürgen Zanghellini
- Department of Analytical Chemistry, Faculty of Chemistry, University of Vienna, Vienna, Austria
| |
Collapse
|
7
|
Madhan S, Kalaiselvan A. Omics data classification using constitutive artificial neural network optimized with single candidate optimizer. NETWORK (BRISTOL, ENGLAND) 2024:1-25. [PMID: 38736309 DOI: 10.1080/0954898x.2024.2348726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 04/23/2024] [Indexed: 05/14/2024]
Abstract
Recent technical advancements enable omics-based biological study of molecules with very high throughput and low cost, such as genomic, proteomic, and microbionics'. To overcome this drawback, Omics Data Classification using Constitutive Artificial Neural Network Optimized with Single Candidate Optimizer (ODC-ZOA-CANN-SCO) is proposed in this manuscript. The input data is pre-processing by using Adaptive variational Bayesian filtering (AVBF) to replace missing values. The pre-processing data is fed to Zebra Optimization Algorithm (ZOA) for dimensionality reduction. Then, the Constitutive Artificial Neural Network (CANN) is employed to classify omics data. The weight parameter is optimized by Single Candidate Optimizer (SCO). The proposed ODC-ZOA-CANN-SCO method attains 25.36%, 21.04%, 22.18%, 26.90%, and 28.12% higher accuracy when analysed to the existing methods like multi-omics data integration utilizing adaptive graph learning and attention mode for patient categorization with biomarker identification (MOD-AGL-AM-PABI), deep learning method depending upon multi-omics data integration to create risk stratification prediction mode for skin cutaneous melanoma (DL-MODI-RSP-SCM), Deep belief network-base model for identifying Alzheimer's disease utilizing multi-omics data (DDN-DAD-MOD), hybrid cancer prediction depending upon multi-omics data and reinforcement learning state action reward state action (HCP-MOD-RL-SARSA), machine learning basis method under omics data including biological knowledge database for cancer clinical endpoint prediction (ML-ODBKD-CCEP) methods, respectively.
Collapse
Affiliation(s)
- Subramaniam Madhan
- Department of Computer Science and Engineering, University College of Engineering, Thirukkuvalai (A Constituent College of Anna University Chennai), Nagapattinam, Tamilnadu, India
| | - Anbarasan Kalaiselvan
- Department of Science and Humanities, University College of Engineering, Thirukkuvalai (A Constituent College of Anna University Chennai), Nagapattinam, Tamilnadu, India
| |
Collapse
|
8
|
Guo L, Xie Y, He J, Li X, Zhou W, Chen Q. Breast cancer prediction model based on clinical and biochemical characteristics: clinical data from patients with benign and malignant breast tumors from a single center in South China. J Cancer Res Clin Oncol 2023; 149:13257-13269. [PMID: 37480526 DOI: 10.1007/s00432-023-05181-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 07/11/2023] [Indexed: 07/24/2023]
Abstract
OBJECTIVE Breast cancer is the most prevalent cancer and is second leading cause of death from malignancy among women worldwide. In addition to tumor factors, the host characteristics of tumors have been paid more and more attention by the medical community. This study aimed to develop a breast cancer prediction model for the Chinese population using clinical and biochemical characteristics. METHODS This is a retrospective study. From 2012 to 2021, we selected 19,751 patients with breast diseases from the Guangdong Hospital of Traditional Chinese Medicine, which included 5660 patients with breast cancer and 14,091 patients with benign breast diseases-75% of patients were randomly assigned to the training group and 25% to the test group using a total of 34 clinical and biochemical characteristics. Significant clinical signs were investigated, and logistic regression with recursive feature elimination (RFE) model was used to develop a prediction model for distinguishing benign from malignant breast diseases. The prediction model's accuracy, precision, sensitivity, specificity, and area under the ROC curve (AUC) were calculated. RESULTS Clinical statistics demonstrated that the prediction model comprised 19 clinical characteristics had statistical separability in both the training group and the test group, as well as good sensitivity and prediction. CONCLUSIONS This model based on biochemical parameters demonstrates a significant predictive effect for breast cancer and may be useful as a reference for invasive tissue biopsy in patients undergoing BI-RADS 3 and 4A breast imaging.
Collapse
Affiliation(s)
- Li Guo
- Department of Breast, The Second Affiliated Hospital of Guangzhou University of Chinese Medicine, No. 111 of Dade Road, Yuexiu District, Guangzhou, 510120, China
| | - Yanyan Xie
- School of Medical Information Engineering, Guangzhou University of Chinese Medicine, No. 232 Wide Ring East Road, Panyu District, Guangzhou, 510006, China
| | - Junhao He
- School of Medical Information Engineering, Guangzhou University of Chinese Medicine, No. 232 Wide Ring East Road, Panyu District, Guangzhou, 510006, China
| | - Xian Li
- School of Medical Information Engineering, Guangzhou University of Chinese Medicine, No. 232 Wide Ring East Road, Panyu District, Guangzhou, 510006, China
| | - Wu Zhou
- School of Medical Information Engineering, Guangzhou University of Chinese Medicine, No. 232 Wide Ring East Road, Panyu District, Guangzhou, 510006, China.
| | - Qianjun Chen
- Department of Breast, The Second Affiliated Hospital of Guangzhou University of Chinese Medicine, No. 111 of Dade Road, Yuexiu District, Guangzhou, 510120, China.
| |
Collapse
|
9
|
Tasci E, Jagasia S, Zhuge Y, Camphausen K, Krauze AV. GradWise: A Novel Application of a Rank-Based Weighted Hybrid Filter and Embedded Feature Selection Method for Glioma Grading with Clinical and Molecular Characteristics. Cancers (Basel) 2023; 15:4628. [PMID: 37760597 PMCID: PMC10526509 DOI: 10.3390/cancers15184628] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Revised: 09/01/2023] [Accepted: 09/14/2023] [Indexed: 09/29/2023] Open
Abstract
Glioma grading plays a pivotal role in guiding treatment decisions, predicting patient outcomes, facilitating clinical trial participation and research, and tailoring treatment strategies. Current glioma grading in the clinic is based on tissue acquired at the time of resection, with tumor aggressiveness assessed from tumor morphology and molecular features. The increased emphasis on molecular characteristics as a guide for management and prognosis estimation underscores is driven by the need for accurate and standardized grading systems that integrate molecular and clinical information in the grading process and carry the expectation of the exposure of molecular markers that go beyond prognosis to increase understanding of tumor biology as a means of identifying druggable targets. In this study, we introduce a novel application (GradWise) that combines rank-based weighted hybrid filter (i.e., mRMR) and embedded (i.e., LASSO) feature selection methods to enhance the performance of feature selection and machine learning models for glioma grading using both clinical and molecular predictors. We utilized publicly available TCGA from the UCI ML Repository and CGGA datasets to identify the most effective scheme that allows for the selection of the minimum number of features with their names. Two popular feature selection methods with a rank-based weighting procedure were employed to conduct comprehensive experiments with the five supervised models. The computational results demonstrate that our proposed method achieves an accuracy rate of 87.007% with 13 features and an accuracy rate of 80.412% with five features on the TCGA and CGGA datasets, respectively. We also obtained four shared biomarkers for the glioma grading that emerged in both datasets and can be employed with transferable value to other datasets and data-based outcome analyses. These findings are a significant step toward highlighting the effectiveness of our approach by offering pioneering results with novel markers with prospects for understanding and targeting the biologic mechanisms of glioma progression to improve patient outcomes.
Collapse
Affiliation(s)
| | | | | | | | - Andra Valentina Krauze
- Radiation Oncology Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Building 10, Bethesda, MD 20892, USA
| |
Collapse
|
10
|
Kuzudisli C, Bakir-Gungor B, Bulut N, Qaqish B, Yousef M. Review of feature selection approaches based on grouping of features. PeerJ 2023; 11:e15666. [PMID: 37483989 PMCID: PMC10358338 DOI: 10.7717/peerj.15666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 06/08/2023] [Indexed: 07/25/2023] Open
Abstract
With the rapid development in technology, large amounts of high-dimensional data have been generated. This high dimensionality including redundancy and irrelevancy poses a great challenge in data analysis and decision making. Feature selection (FS) is an effective way to reduce dimensionality by eliminating redundant and irrelevant data. Most traditional FS approaches score and rank each feature individually; and then perform FS either by eliminating lower ranked features or by retaining highly-ranked features. In this review, we discuss an emerging approach to FS that is based on initially grouping features, then scoring groups of features rather than scoring individual features. Despite the presence of reviews on clustering and FS algorithms, to the best of our knowledge, this is the first review focusing on FS techniques based on grouping. The typical idea behind FS through grouping is to generate groups of similar features with dissimilarity between groups, then select representative features from each cluster. Approaches under supervised, unsupervised, semi supervised and integrative frameworks are explored. The comparison of experimental results indicates the effectiveness of sequential, optimization-based (i.e., fuzzy or evolutionary), hybrid and multi-method approaches. When it comes to biological data, the involvement of external biological sources can improve analysis results. We hope this work's findings can guide effective design of new FS approaches using feature grouping.
Collapse
Affiliation(s)
- Cihan Kuzudisli
- Department of Computer Engineering, Hasan Kalyoncu University, Gaziantep, Turkey
- Department of Electrical and Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Nurten Bulut
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Bahjat Qaqish
- Department of Biostatistics, University of North Carolina at Chapel Hill, North Carolina, Chapel Hill, United States of America
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center, Zefat Academic College, Zefat, Israel
| |
Collapse
|
11
|
Mallick K, Chakraborty S, Mallik S, Bandyopadhyay S. A scalable unsupervised learning of scRNAseq data detects rare cells through integration of structure-preserving embedding, clustering and outlier detection. Brief Bioinform 2023; 24:bbad125. [PMID: 37185897 DOI: 10.1093/bib/bbad125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Revised: 02/06/2023] [Accepted: 02/24/2023] [Indexed: 05/17/2023] Open
Abstract
Single-cell RNA-seq analysis has become a powerful tool to analyse the transcriptomes of individual cells. In turn, it has fostered the possibility of screening thousands of single cells in parallel. Thus, contrary to the traditional bulk measurements that only paint a macroscopic picture, gene measurements at the cell level aid researchers in studying different tissues and organs at various stages. However, accurate clustering methods for such high-dimensional data remain exiguous and a persistent challenge in this domain. Of late, several methods and techniques have been promulgated to address this issue. In this article, we propose a novel framework for clustering large-scale single-cell data and subsequently identifying the rare-cell sub-populations. To handle such sparse, high-dimensional data, we leverage PaCMAP (Pairwise Controlled Manifold Approximation), a feature extraction algorithm that preserves both the local and the global structures of the data and Gaussian Mixture Model to cluster single-cell data. Subsequently, we exploit Edited Nearest Neighbours sampling and Isolation Forest/One-class Support Vector Machine to identify rare-cell sub-populations. The performance of the proposed method is validated using the publicly available datasets with varying degrees of cell types and rare-cell sub-populations. On several benchmark datasets, the proposed method outperforms the existing state-of-the-art methods. The proposed method successfully identifies cell types that constitute populations ranging from 0.1 to 8% with F1-scores of 0.91 0.09. The source code is available at https://github.com/scrab017/RarPG.
Collapse
Affiliation(s)
- Koushik Mallick
- Computer Science and Engineering, RCC Institute of Information Technology, Canal South Road, 700015, West Bengal, India
| | - Sikim Chakraborty
- Centre for Economy and Growth, Observer Research Foundation, Rouse Avenue, New Delhi, 110002, Delhi, India
| | - Saurav Mallik
- Department of Environmental Health, Harvard T H Chan School of Public Health, 677 Huntington Ave, 02115, MA, USA
| | - Sanghamitra Bandyopadhyay
- Machine Intelligence Unit, Indian Statistical Institute, Barrackpore Trunk Rd., 700108, West Bengal, India
| |
Collapse
|
12
|
Khandelwal M, Kumar Rout R, Umer S, Mallik S, Li A. Multifactorial feature extraction and site prognosis model for protein methylation data. Brief Funct Genomics 2023; 22:20-30. [PMID: 36310537 DOI: 10.1093/bfgp/elac034] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Revised: 09/23/2022] [Accepted: 09/28/2022] [Indexed: 01/24/2023] Open
Abstract
Integrated studies (multi-omics studies) comprising genetic, proteomic and epigenetic data analyses have become an emerging topic in biomedical research. Protein methylation is a posttranslational modification that plays an essential role in various cellular activities. The prediction of methylation sites (arginine and lysine) is vital to understand the molecular processes of protein methylation. However, current experimental techniques used for methylation site predictions are tedious and expensive. Hence, computational techniques for predicting methylation sites in proteins are necessary. For predicting methylation sites, various computational methods have been proposed in recent years. Most existing methods require structural and evolutionary information for retrieving features, acquiring this information is not always convenient. Thus, we proposed a novel method, called multi-factorial feature extraction and site prognosis model (MufeSPM), for the prediction of protein methylation sites based on information theory features (Renyi, Shannon, Havrda-Charvat and Arimoto entropy), amino acid composition and physicochemical properties acquired from protein methylation data. A random forest algorithm was used to predict methylation sites in protein sequences. This paper also studied the impact of different features and classifiers on arginine and lysine methylation data sets. For the R methylation data set, MufeSPM yielded 82.45%($\pm $ 3.47) accuracy, and for the K methylation data set, it provided an average accuracy of 71.94%($\pm $ 2.12). Additionally, the area under the receiver operating characteristic curve for different classifiers in predicting methylation site was provided. The experimental results signify that MufeSPM performs better than the state-of-the-art predictors.
Collapse
Affiliation(s)
- Monika Khandelwal
- Computer Science & Engineering, National Institute of Technology Srinagar, Hazratbal, Srinagar, 190006, Jammu and Kashmir, India
| | - Ranjeet Kumar Rout
- Computer Science & Engineering, National Institute of Technology Srinagar, Hazratbal, Srinagar, 190006, Jammu and Kashmir, India
| | - Saiyed Umer
- Computer Science & Engineering, Aliah University, Kolkata, 700016, West Bengal, India
| | - Saurav Mallik
- Department of Environmental Health, Harvard T H Chan School of Public Health, Huntington Ave, Boston, 02115, MA, USA
| | - Aimin Li
- School of Computer Science and Engineering, Xi'an University of Technology, Jinhua S Rd, 710048, Shaanxi, China
| |
Collapse
|