1
|
Mansoor S, Hamid S, Tuan TT, Park JE, Chung YS. Advance computational tools for multiomics data learning. Biotechnol Adv 2024; 77:108447. [PMID: 39251098 DOI: 10.1016/j.biotechadv.2024.108447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2024] [Revised: 09/01/2024] [Accepted: 09/05/2024] [Indexed: 09/11/2024]
Abstract
The burgeoning field of bioinformatics has seen a surge in computational tools tailored for omics data analysis driven by the heterogeneous and high-dimensional nature of omics data. In biomedical and plant science research multi-omics data has become pivotal for predictive analytics in the era of big data necessitating sophisticated computational methodologies. This review explores a diverse array of computational approaches which play crucial role in processing, normalizing, integrating, and analyzing omics data. Notable methods such similarity-based methods, network-based approaches, correlation-based methods, Bayesian methods, fusion-based methods and multivariate techniques among others are discussed in detail, each offering unique functionalities to address the complexities of multi-omics data. Furthermore, this review underscores the significance of computational tools in advancing our understanding of data and their transformative impact on research.
Collapse
Affiliation(s)
- Sheikh Mansoor
- Department of Plant Resources and Environment, Jeju National University, 63243, Republic of Korea
| | - Saira Hamid
- Watson Crick Centre for Molecular Medicine, Islamic University of Science and Technology, Awantipora, Pulwama, J&K, India
| | - Thai Thanh Tuan
- Department of Plant Resources and Environment, Jeju National University, 63243, Republic of Korea; Multimedia Communications Laboratory, University of Information Technology, Ho Chi Minh city 70000, Vietnam; Multimedia Communications Laboratory, Vietnam National University, Ho Chi Minh city 70000, Vietnam
| | - Jong-Eun Park
- Department of Animal Biotechnology, College of Applied Life Science, Jeju National University, Jeju, Jeju-do, Republic of Korea.
| | - Yong Suk Chung
- Department of Plant Resources and Environment, Jeju National University, 63243, Republic of Korea.
| |
Collapse
|
2
|
Bhadra T, Mallik S, Hasan N, Zhao Z. Comparison of five supervised feature selection algorithms leading to top features and gene signatures from multi-omics data in cancer. BMC Bioinformatics 2022; 23:153. [PMID: 35484501 PMCID: PMC9052461 DOI: 10.1186/s12859-022-04678-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Accepted: 04/11/2022] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND As many complex omics data have been generated during the last two decades, dimensionality reduction problem has been a challenging issue in better mining such data. The omics data typically consists of many features. Accordingly, many feature selection algorithms have been developed. The performance of those feature selection methods often varies by specific data, making the discovery and interpretation of results challenging. METHODS AND RESULTS In this study, we performed a comprehensive comparative study of five widely used supervised feature selection methods (mRMR, INMIFS, DFS, SVM-RFE-CBR and VWMRmR) for multi-omics datasets. Specifically, we used five representative datasets: gene expression (Exp), exon expression (ExpExon), DNA methylation (hMethyl27), copy number variation (Gistic2), and pathway activity dataset (Paradigm IPLs) from a multi-omics study of acute myeloid leukemia (LAML) from The Cancer Genome Atlas (TCGA). The different feature subsets selected by the aforesaid five different feature selection algorithms are assessed using three evaluation criteria: (1) classification accuracy (Acc), (2) representation entropy (RE) and (3) redundancy rate (RR). Four different classifiers, viz., C4.5, NaiveBayes, KNN, and AdaBoost, were used to measure the classification accuary (Acc) for each selected feature subset. The VWMRmR algorithm obtains the best Acc for three datasets (ExpExon, hMethyl27 and Paradigm IPLs). The VWMRmR algorithm offers the best RR (obtained using normalized mutual information) for three datasets (Exp, Gistic2 and Paradigm IPLs), while it gives the best RR (obtained using Pearson correlation coefficient) for two datasets (Gistic2 and Paradigm IPLs). It also obtains the best RE for three datasets (Exp, Gistic2 and Paradigm IPLs). Overall, the VWMRmR algorithm yields best performance for all three evaluation criteria for majority of the datasets. In addition, we identified signature genes using supervised learning collected from the overlapped top feature set among five feature selection methods. We obtained a 7-gene signature (ZMIZ1, ENG, FGFR1, PAWR, KRT17, MPO and LAT2) for EXP, a 9-gene signature for ExpExon, a 7-gene signature for hMethyl27, one single-gene signature (PIK3CG) for Gistic2 and a 3-gene signature for Paradigm IPLs. CONCLUSION We performed a comprehensive comparison of the performance evaluation of five well-known feature selection methods for mining features from various high-dimensional datasets. We identified signature genes using supervised learning for the specific omic data for the disease. The study will help incorporate higher order dependencies among features.
Collapse
Affiliation(s)
- Tapas Bhadra
- Department of Computer Science and Engineering, Aliah University, Kolkata, West Bengal, 700160, India
| | - Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Neaj Hasan
- Department of Computer Science and Engineering, Aliah University, Kolkata, West Bengal, 700160, India
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
| |
Collapse
|
3
|
Munquad S, Si T, Mallik S, Das AB, Zhao Z. A Deep Learning-Based Framework for Supporting Clinical Diagnosis of Glioblastoma Subtypes. Front Genet 2022; 13:855420. [PMID: 35419027 PMCID: PMC9000988 DOI: 10.3389/fgene.2022.855420] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Accepted: 02/17/2022] [Indexed: 12/12/2022] Open
Abstract
Understanding molecular features that facilitate aggressive phenotypes in glioblastoma multiforme (GBM) remains a major clinical challenge. Accurate diagnosis of GBM subtypes, namely classical, proneural, and mesenchymal, and identification of specific molecular features are crucial for clinicians for systematic treatment. We develop a biologically interpretable and highly efficient deep learning framework based on a convolutional neural network for subtype identification. The classifiers were generated from high-throughput data of different molecular levels, i.e., transcriptome and methylome. Furthermore, an integrated subsystem of transcriptome and methylome data was also used to build the biologically relevant model. Our results show that deep learning model outperforms the traditional machine learning algorithms. Furthermore, to evaluate the biological and clinical applicability of the classification, we performed weighted gene correlation network analysis, gene set enrichment, and survival analysis of the feature genes. We identified the genotype-phenotype relationship of GBM subtypes and the subtype-specific predictive biomarkers for potential diagnosis and treatment.
Collapse
Affiliation(s)
- Sana Munquad
- Department of Biotechnology, National Institute of Technology Warangal, Warangal, India
| | - Tapas Si
- Department of Computer Science and Engineering, Bankura Unnayani Institute of Engineering, Bankura, India
| | - Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Asim Bikas Das
- Department of Biotechnology, National Institute of Technology Warangal, Warangal, India
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States.,Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, United States.,Department of Pathology and Laboratory Medicine, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, United States
| |
Collapse
|
4
|
Jia D, Chen C, Chen C, Chen F, Zhang N, Yan Z, Lv X. Breast Cancer Case Identification Based on Deep Learning and Bioinformatics Analysis. Front Genet 2021; 12:628136. [PMID: 34079578 PMCID: PMC8165442 DOI: 10.3389/fgene.2021.628136] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2020] [Accepted: 04/20/2021] [Indexed: 01/22/2023] Open
Abstract
Mastering the molecular mechanism of breast cancer (BC) can provide an in-depth understanding of BC pathology. This study explored existing technologies for diagnosing BC, such as mammography, ultrasound, magnetic resonance imaging (MRI), computed tomography (CT), and positron emission tomography (PET) and summarized the disadvantages of the existing cancer diagnosis. The purpose of this article is to use gene expression profiles of The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) to classify BC samples and normal samples. The method proposed in this article triumphs over some of the shortcomings of traditional diagnostic methods and can conduct BC diagnosis more rapidly with high sensitivity and have no radiation. This study first selected the genes most relevant to cancer through weighted gene co-expression network analysis (WGCNA) and differential expression analysis (DEA). Then it used the protein-protein interaction (PPI) network to screen 23 hub genes. Finally, it used the support vector machine (SVM), decision tree (DT), Bayesian network (BN), artificial neural network (ANN), convolutional neural network CNN-LeNet and CNN-AlexNet to process the expression levels of 23 hub genes. For gene expression profiles, the ANN model has the best performance in the classification of cancer samples. The ten-time average accuracy is 97.36% (±0.34%), the F1 value is 0.8535 (±0.0260), the sensitivity is 98.32% (±0.32%), the specificity is 89.59% (±3.53%) and the AUC is 0.99. In summary, this method effectively classifies cancer samples and normal samples and provides reasonable new ideas for the early diagnosis of cancer in the future.
Collapse
Affiliation(s)
- Dongfang Jia
- College of Information Science and Engineering, Xinjiang University, Urumqi, China
| | - Cheng Chen
- College of Information Science and Engineering, Xinjiang University, Urumqi, China
| | - Chen Chen
- College of Information Science and Engineering, Xinjiang University, Urumqi, China
| | - Fangfang Chen
- College of Information Science and Engineering, Xinjiang University, Urumqi, China
| | - Ningrui Zhang
- College of Information Science and Engineering, Xinjiang University, Urumqi, China
| | - Ziwei Yan
- College of Information Science and Engineering, Xinjiang University, Urumqi, China
| | - Xiaoyi Lv
- College of Information Science and Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing, Xinjiang University, Urumqi, China
| |
Collapse
|
5
|
Mandal M, Sahoo SK, Patra P, Mallik S, Zhao Z. In silico ranking of phenolics for therapeutic effectiveness on cancer stem cells. BMC Bioinformatics 2020; 21:499. [PMID: 33371879 PMCID: PMC7768647 DOI: 10.1186/s12859-020-03849-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2020] [Accepted: 10/27/2020] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Cancer stem cells (CSCs) have features such as the ability to self-renew, differentiate into defined progenies and initiate the tumor growth. Treatments of cancer include drugs, chemotherapy and radiotherapy or a combination. However, treatment of cancer by various therapeutic strategies often fail. One possible reason is that the nature of CSCs, which has stem-like properties, make it more dynamic and complex and may cause the therapeutic resistance. Another limitation is the side effects associated with the treatment of chemotherapy or radiotherapy. To explore better or alternative treatment options the current study aims to investigate the natural drug-like molecules that can be used as CSC-targeted therapy. Among various natural products, anticancer potential of phenolics is well established. We collected the 21 phytochemicals from phenolic group and their interacting CSC genes from the publicly available databases. Then a bipartite graph is constructed from the collected CSC genes along with their interacting phytochemicals from phenolic group as other. The bipartite graph is then transformed into weighted bipartite graph by considering the interaction strength between the phenolics and the CSC genes. The CSC genes are also weighted by two scores, namely, DSI (Disease Specificity Index) and DPI (Disease Pleiotropy Index). For each gene, its DSI score reflects the specific relationship with the disease and DPI score reflects the association with multiple diseases. Finally, a ranking technique is developed based on PageRank (PR) algorithm for ranking the phenolics. RESULTS We collected 21 phytochemicals from phenolic group and 1118 CSC genes. The top ranked phenolics were evaluated by their molecular and pharmacokinetics properties and disease association networks. We selected top five ranked phenolics (Resveratrol, Curcumin, Quercetin, Epigallocatechin Gallate, and Genistein) for further examination of their oral bioavailability through molecular properties, drug likeness through pharmacokinetic properties, and associated network with CSC genes. CONCLUSION Our PR ranking based approach is useful to rank the phenolics that are associated with CSC genes. Our results suggested some phenolics are potential molecules for CSC-related cancer treatment.
Collapse
Affiliation(s)
- Monalisa Mandal
- Department of School of Computer Science and Engineering, Xavier University, Bhubaneswar, Odisha, 752050, India
| | | | - Priyadarsan Patra
- Department of School of Computer Science and Engineering, Xavier University, Bhubaneswar, Odisha, 752050, India
| | - Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center At Houston, Houston, TX, 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center At Houston, Houston, TX, 77030, USA.
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center At Houston, Houston, TX, 77030, USA.
| |
Collapse
|