1
|
Mishra A, Majumder A, Kommineni D, Anna Joseph C, Chowdhury T, Anumula SK. Role of Generative Artificial Intelligence in Personalized Medicine: A Systematic Review. Cureus 2025; 17:e82310. [PMID: 40376348 PMCID: PMC12081128 DOI: 10.7759/cureus.82310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/15/2025] [Indexed: 05/18/2025] Open
Abstract
Precision medicine presents challenges in data collection, cost, and privacy as it tailors treatments to each patient's unique genetic and clinical profile. With its ability to produce realistic and confidential patient data, generative artificial intelligence (AI) offers a promising avenue that could revolutionize patient-centric healthcare. This systematic review aims to assess the role of generative AI in personalized medicine. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, we searched PubMed, Web of Science, Scopus, CINAHL, and Google Scholar, identifying 549 studies. After removing duplicates and applying eligibility criteria, 27 studies were found relevant and were included in this systematic review. Generative adversarial networks (GANs) were the most commonly used models (16 studies), followed by variational autoencoders (VAEs; seven studies). These models were primarily applied to drug response prediction, treatment effect estimation, biomarker discovery, and patient stratification. Generative AI models have shown significant promise in revolutionizing personalized medicine by enabling precise treatment predictions and patient-specific therapeutic insights. Despite their potential, challenges related to model validation, interpretability, and bias remain. Future research should prioritize large-scale validation studies using diverse datasets to enhance the clinical applicability and reliability of these AI-driven approaches.
Collapse
Affiliation(s)
- Aashish Mishra
- Computer Science and Information Technology, Eastern Kentucky University, Richmond, USA
| | | | | | | | - Tanay Chowdhury
- Data Science, Amazon Web Sciences (AWS) Generative AI Innovation Center, Sammamish, USA
| | | |
Collapse
|
2
|
He H, Wang L, Ma M. MOGAN for LUAD Subtype Classification by Integrating Three Omics Data Types. CANCER INNOVATION 2025; 4:e160. [PMID: 40026873 PMCID: PMC11868734 DOI: 10.1002/cai2.160] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Revised: 10/21/2024] [Accepted: 11/26/2024] [Indexed: 03/05/2025]
Abstract
Background Lung adenocarcinoma (LUAD) is a highly heterogeneous cancer type with a poor prognosis. Accurate subtype identification can help guide its treatment. The traditional subtype identification methods using a single-omics approach make it difficult to comprehensively characterize the molecular features of LUAD. Identification of subtypes through multi-omics association strategies can effectively supplement the shortcomings of single-omics information. Methods In this study, we used the Generative Adversarial Network (GAN) to mine transcriptomic, proteomic, and epigenomic information and generate an integrated data set. The newly integrated data were then used to identify LUAD immune subtypes. In the improved GAN (MOGAN) method, we not only integrated multiple omics datasets but also included the interactions between proteins and genes and between methylation and genes. Thus, we achieved effective complementarity of multi-omics information. Results Two subtypes, MOGANTPM_S1 and MOGANTPM_S2, were identified using immune cell infiltration analysis and the integrated multi-omics data. MOGANTPM_S1 patients displayed higher immune cell infiltration, better prognosis, and sensitivity to immune checkpoint inhibitors (ICIs), while MOGANTPM_S2 had lower immune cell infiltration, poorer prognosis, and were insensitive to ICIs. Therefore, immunotherapy was more suitable for MOGANTPM_S1 patients in clinical practice. In addition, this study developed a LUAD subtype diagnostic model using the transcriptomic and proteomic features of five genes, which can be used to guide clinical subtype diagnosis. Conclusions In summary, the MOGAN method was applied to integrate three omics data types and successfully identify two LUAD immune subtypes with significant survival differences. This classification method may be useful for LUAD treatment decisions.
Collapse
Affiliation(s)
- Haibin He
- Chongqing Key Laboratory of Big Data for Bio IntelligenceChongqing University of Posts and TelecommunicationsChongqingChina
| | - Longxing Wang
- Chongqing Key Laboratory of Big Data for Bio IntelligenceChongqing University of Posts and TelecommunicationsChongqingChina
| | - Mingyue Ma
- Chongqing Key Laboratory of Big Data for Bio IntelligenceChongqing University of Posts and TelecommunicationsChongqingChina
- Institute of Life SciencesChongqing Medical UniversityChongqingChina
| |
Collapse
|
3
|
Llinas-Bertran A, Butjosa-Espín M, Barberi V, Seoane JA. Multimodal data integration in early-stage breast cancer. Breast 2025; 80:103892. [PMID: 39922065 PMCID: PMC11973824 DOI: 10.1016/j.breast.2025.103892] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2024] [Revised: 12/13/2024] [Accepted: 01/27/2025] [Indexed: 02/10/2025] Open
Abstract
The use of biomarkers in breast cancer has significantly improved patient outcomes through targeted therapies, such as hormone therapy anti-Her2 therapy and CDK4/6 or PARP inhibitors. However, existing knowledge does not fully encompass the diverse nature of breast cancer, particularly in triple-negative tumors. The integration of multi-omics and multimodal data has the potential to provide new insights into biological processes, to improve breast cancer patient stratification, enhance prognosis and response prediction, and identify new biomarkers. This review presents a comprehensive overview of the state-of-the-art multimodal (including molecular and image) data integration algorithms developed and with applicability to breast cancer stratification, prognosis, or biomarker identification. We examined the primary challenges and opportunities of these multimodal data integration algorithms, including their advantages, limitations, and critical considerations for future research. We aimed to describe models that are not only academically and preclinically relevant, but also applicable to clinical settings.
Collapse
Affiliation(s)
- Arnau Llinas-Bertran
- Cancer Computational Biology Group, Vall d'Hebron Institute of Oncology (VHIO), Barcelona, Spain
| | - Maria Butjosa-Espín
- Cancer Computational Biology Group, Vall d'Hebron Institute of Oncology (VHIO), Barcelona, Spain
| | - Vittoria Barberi
- Breast Cancer Group, Vall d'Hebron Institute of Oncology (VHIO), Barcelona, Spain
| | - Jose A Seoane
- Cancer Computational Biology Group, Vall d'Hebron Institute of Oncology (VHIO), Barcelona, Spain.
| |
Collapse
|
4
|
He R, Sarwal V, Qiu X, Zhuang Y, Zhang L, Liu Y, Chiang J. Generative AI Models in Time-Varying Biomedical Data: Scoping Review. J Med Internet Res 2025; 27:e59792. [PMID: 40063929 PMCID: PMC11933772 DOI: 10.2196/59792] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 08/08/2024] [Accepted: 11/15/2024] [Indexed: 03/28/2025] Open
Abstract
BACKGROUND Trajectory modeling is a long-standing challenge in the application of computational methods to health care. In the age of big data, traditional statistical and machine learning methods do not achieve satisfactory results as they often fail to capture the complex underlying distributions of multimodal health data and long-term dependencies throughout medical histories. Recent advances in generative artificial intelligence (AI) have provided powerful tools to represent complex distributions and patterns with minimal underlying assumptions, with major impact in fields such as finance and environmental sciences, prompting researchers to apply these methods for disease modeling in health care. OBJECTIVE While AI methods have proven powerful, their application in clinical practice remains limited due to their highly complex nature. The proliferation of AI algorithms also poses a significant challenge for nondevelopers to track and incorporate these advances into clinical research and application. In this paper, we introduce basic concepts in generative AI and discuss current algorithms and how they can be applied to health care for practitioners with little background in computer science. METHODS We surveyed peer-reviewed papers on generative AI models with specific applications to time-series health data. Our search included single- and multimodal generative AI models that operated over structured and unstructured data, physiological waveforms, medical imaging, and multi-omics data. We introduce current generative AI methods, review their applications, and discuss their limitations and future directions in each data modality. RESULTS We followed the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines and reviewed 155 articles on generative AI applications to time-series health care data across modalities. Furthermore, we offer a systematic framework for clinicians to easily identify suitable AI methods for their data and task at hand. CONCLUSIONS We reviewed and critiqued existing applications of generative AI to time-series health data with the aim of bridging the gap between computational methods and clinical application. We also identified the shortcomings of existing approaches and highlighted recent advances in generative AI that represent promising directions for health care modeling.
Collapse
Affiliation(s)
- Rosemary He
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, United States
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, United States
| | - Varuni Sarwal
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, United States
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, United States
| | - Xinru Qiu
- Division of Biomedical Sciences, School of Medicine, University of California Riverside, Riverside, CA, United States
| | - Yongwen Zhuang
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, United States
| | - Le Zhang
- Institute for Integrative Genome Biology, University of California Riverside, Riverside, CA, United States
| | - Yue Liu
- Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin, TX, United States
| | - Jeffrey Chiang
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, United States
- Department of Neurosurgery, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, United States
| |
Collapse
|
5
|
Buhari SB, Ghahremani Nezhad N, Normi YM, Mohd Shariff F, Leow TC. Homology modeling and thermostability enhancement of Vibrio palustris PETase via hydrophobic interactions. J Biomol Struct Dyn 2025:1-14. [PMID: 39844700 DOI: 10.1080/07391102.2024.2440646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Accepted: 06/21/2024] [Indexed: 01/24/2025]
Abstract
The quest for sustainable solutions to plastic pollution has driven research into plastic-degrading enzymes, offering promising avenues for polymer recycling applications. However, enzymes derived from natural sources often exhibit suboptimal thermostability, hindering their industrial viability. Protein engineering techniques have emerged as a powerful approach to enhance the desired properties of these biocatalysts. This study aims to conduct a comprehensive analysis of the thermostability of Vibrio palustris PETase (VpPETase) through an integrated computational approach encompassing homology modeling, site-specific molecular docking, molecular dynamics (MD) simulations, and comparative evaluation of a single-point mutation (V195F) against the wild-type enzyme. Homology modeling was used to predict VpPETase model using multiple templates. Model quality was rigorously assessed using Ramachandran plot analysis, ProSA, Verify 3D, and ERRAT. Molecular docking elucidated the catalytic region comprising residues His149, Asp117, and Ser71, while highlighting the pivotal roles of His149, Tyr1, and Ser71 in substrate binding affinity. MD simulations at various temperatures revealed higher stability at 313.15 K over a 100 ns trajectory, as evidenced by analyses of root-mean-square deviation (RMSD), radius of gyration (Rg), solvent-accessible surface area (SASA), hydrogen bonding, and root-mean-square fluctuation (RMSF). The V195F mutant exhibited a slight increase in stability compared to wild-type. While this study provides valuable insights into the thermostability of VpPETase, further investigations, including experimental validation of thermostability enhancements and in vitro characterization, are warranted to fully exploit the potential of this enzyme for industrial applications in plastic recycling.
Collapse
Affiliation(s)
- Sunusi Bataiya Buhari
- Enzyme and Microbial Technology Research Center, Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
- Department of Cell and Molecular Biology, Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Nima Ghahremani Nezhad
- Enzyme and Microbial Technology Research Center, Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
- Department of Cell and Molecular Biology, Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Yahaya M Normi
- Department of Cell and Molecular Biology, Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Fairolniza Mohd Shariff
- Enzyme and Microbial Technology Research Center, Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
- Department of Microbiology, Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Thean Chor Leow
- Enzyme and Microbial Technology Research Center, Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
- Department of Cell and Molecular Biology, Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
- Institute of Bioscience, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| |
Collapse
|
6
|
Karlberg B, Kirchgaessner R, Lee J, Peterkort M, Beckman L, Goecks J, Ellrott K. SyntheVAEiser: augmenting traditional machine learning methods with VAE-based gene expression sample generation for improved cancer subtype predictions. Genome Biol 2024; 25:309. [PMID: 39696541 DOI: 10.1186/s13059-024-03431-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 10/30/2024] [Indexed: 12/20/2024] Open
Abstract
The accuracy of machine learning methods is often limited by the amount of training data that is available. We proposed to improve machine learning training regimes by augmenting datasets with synthetically generated samples. We present a method for synthesizing gene expression samples and test the system's capabilities for improving the accuracy of categorical prediction of cancer subtypes. We developed SyntheVAEiser, a variational autoencoder based tool that was trained and tested on over 8000 cancer samples. We have shown that this technique can be used to augment machine learning tasks and increase performance of recognition of underrepresented cohorts.
Collapse
Affiliation(s)
- Brian Karlberg
- Biomedical Engineering, Oregon Health and Science University, 3181 S.W. Sam Jackson Park Road, Portland, OR, 97239-3098, USA
| | - Raphael Kirchgaessner
- Biomedical Engineering, Oregon Health and Science University, 3181 S.W. Sam Jackson Park Road, Portland, OR, 97239-3098, USA
| | - Jordan Lee
- Biomedical Engineering, Oregon Health and Science University, 3181 S.W. Sam Jackson Park Road, Portland, OR, 97239-3098, USA
| | - Matthew Peterkort
- Biomedical Engineering, Oregon Health and Science University, 3181 S.W. Sam Jackson Park Road, Portland, OR, 97239-3098, USA
| | - Liam Beckman
- Biomedical Engineering, Oregon Health and Science University, 3181 S.W. Sam Jackson Park Road, Portland, OR, 97239-3098, USA
| | - Jeremy Goecks
- Biomedical Engineering, Oregon Health and Science University, 3181 S.W. Sam Jackson Park Road, Portland, OR, 97239-3098, USA
- Department of Machine Learning, Moffitt Cancer Center, Tampa, USA
| | - Kyle Ellrott
- Biomedical Engineering, Oregon Health and Science University, 3181 S.W. Sam Jackson Park Road, Portland, OR, 97239-3098, USA.
| |
Collapse
|
7
|
Liu Z, Park T. DMOIT: denoised multi-omics integration approach based on transformer multi-head self-attention mechanism. Front Genet 2024; 15:1488683. [PMID: 39720180 PMCID: PMC11666520 DOI: 10.3389/fgene.2024.1488683] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2024] [Accepted: 11/25/2024] [Indexed: 12/26/2024] Open
Abstract
Multi-omics data integration has become increasingly crucial for a deeper understanding of the complexity of biological systems. However, effectively integrating and analyzing multi-omics data remains challenging due to their heterogeneity and high dimensionality. Existing methods often struggle with noise, redundant features, and the complex interactions between different omics layers, leading to suboptimal performance. Additionally, they face difficulties in adequately capturing intra-omics interactions due to simplistic concatenation techiniques, and they risk losing critical inter-omics interaction information when using hierarchical attention layers. To address these challenges, we propose a novel Denoised Multi-Omics Integration approach that leverages the Transformer multi-head self-attention mechanism (DMOIT). DMOIT consists of three key modules: a generative adversarial imputation network for handling missing values, a sampling-based robust feature selection module to reduce noise and redundant features, and a multi-head self-attention (MHSA) based feature extractor with a noval architecture that enchance the intra-omics interaction capture. We validated model porformance using cancer datasets from the Cancer Genome Atlas (TCGA), conducting two tasks: survival time classification across different cancer types and estrogen receptor status classification for breast cancer. Our results show that DMOIT outperforms traditional machine learning methods and the state-of-the-art integration method MoGCN in terms of accuracy and weighted F1 score. Furthermore, we compared DMOIT with various alternative MHSA-based architectures to further validate our approach. Our results show that DMOIT consistently outperforms these models across various cancer types and different omics combinations. The strong performance and robustness of DMOIT demonstrate its potential as a valuable tool for integrating multi-omics data across various applications.
Collapse
Affiliation(s)
- Zhe Liu
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
| | - Taesung Park
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
- Department of Statistics, Seoul National University, Seoul, Republic of Korea
| |
Collapse
|
8
|
Vidanagamachchi SM, Waidyarathna KMGTR. Opportunities, challenges and future perspectives of using bioinformatics and artificial intelligence techniques on tropical disease identification using omics data. Front Digit Health 2024; 6:1471200. [PMID: 39654982 PMCID: PMC11625773 DOI: 10.3389/fdgth.2024.1471200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/06/2024] [Indexed: 12/12/2024] Open
Abstract
Tropical diseases can often be caused by viruses, bacteria, parasites, and fungi. They can be spread over vectors. Analysis of multiple omics data types can be utilized in providing comprehensive insights into biological system functions and disease progression. To this end, bioinformatics tools and diverse AI techniques are pivotal in identifying and understanding tropical diseases through the analysis of omics data. In this article, we provide a thorough review of opportunities, challenges, and future directions of utilizing Bioinformatics tools and AI-assisted models on tropical disease identification using various omics data types. We conducted the review from 2015 to 2024 considering reliable databases of peer-reviewed journals and conference articles. Several keywords were taken for the article searching and around 40 articles were reviewed. According to the review, we observed that utilization of omics data with Bioinformatics tools like BLAST, and Clustal Omega can make significant outcomes in tropical disease identification. Further, the integration of multiple omics data improves biomarker identification, and disease predictions including disease outbreak predictions. Moreover, AI-assisted models can improve the precision, cost-effectiveness, and efficiency of CRISPR-based gene editing, optimizing gRNA design, and supporting advanced genetic correction. Several AI-assisted models including XAI can be used to identify diseases and repurpose therapeutic targets and biomarkers efficiently. Furthermore, recent advancements including Transformer-based models such as BERT and GPT-4, have been mainly applied for sequence analysis and functional genomics. Finally, the most recent GeneViT model, utilizing Vision Transformers, and other AI techniques like Generative Adversarial Networks, Federated Learning, Transfer Learning, Reinforcement Learning, Automated ML and Attention Mechanism have shown significant performance in disease classification using omics data.
Collapse
Affiliation(s)
- S. M. Vidanagamachchi
- Department of Computer Science, Faculty of Science, University of Ruhuna, Matara, Sri Lanka
| | - K. M. G. T. R. Waidyarathna
- Department of Information Technology, Sri Lanka Institute of Advanced Technological Education, Galle, Sri Lanka
| |
Collapse
|
9
|
Ansari MI, Ahmed KT, Zhang W. Optimizing multi-omics data imputation with NMF and GAN synergy. Bioinformatics 2024; 40:btae674. [PMID: 39546381 PMCID: PMC11639186 DOI: 10.1093/bioinformatics/btae674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2024] [Revised: 10/21/2024] [Accepted: 11/08/2024] [Indexed: 11/17/2024] Open
Abstract
MOTIVATION Integrating multiple omics datasets can significantly advance our understanding of disease mechanisms, physiology, and treatment responses. However, a major challenge in multi-omics studies is the disparity in sample sizes across different datasets, which can introduce bias and reduce statistical power. To address this issue, we propose a novel framework, OmicsNMF, designed to impute missing omics data and enhance disease phenotype prediction. OmicsNMF integrates Generative Adversarial Networks (GANs) with Non-Negative Matrix Factorization (NMF). NMF is a well-established method for uncovering underlying patterns in omics data, while GANs enhance the imputation process by generating realistic data samples. This synergy aims to more effectively address sample size disparity, thereby improving data integration and prediction accuracy. RESULTS For evaluation, we focused on predicting breast cancer subtypes using the imputed data generated by our proposed framework, OmicsNMF. Our results indicate that OmicsNMF consistently outperforms baseline methods. We further assessed the quality of the imputed data through survival analysis, revealing that the imputed omics profiles provide significant prognostic power for both overall survival and disease-free status. Overall, OmicsNMF effectively leverages GANs and NMF to impute missing samples while preserving key biological features. This approach shows potential for advancing precision oncology by improving data integration and analysis. AVAILABILITY AND IMPLEMENTATION Source code is available at: https://github.com/compbiolabucf/OmicsNMF.
Collapse
Affiliation(s)
- Md Istiaq Ansari
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, United States
- Department of Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, United States
| | - Khandakar Tanvir Ahmed
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, United States
- Department of Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, United States
| | - Wei Zhang
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, United States
- Department of Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, United States
| |
Collapse
|
10
|
Liang H, Luo H, Sang Z, Jia M, Jiang X, Wang Z, Cong S, Yao X. GREMI: An Explainable Multi-Omics Integration Framework for Enhanced Disease Prediction and Module Identification. IEEE J Biomed Health Inform 2024; 28:6983-6996. [PMID: 39110558 DOI: 10.1109/jbhi.2024.3439713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/10/2024]
Abstract
Multi-omics integration has demonstrated promising performance in complex disease prediction. However, existing research typically focuses on maximizing prediction accuracy, while often neglecting the essential task of discovering meaningful biomarkers. This issue is particularly important in biomedicine, as molecules often interact rather than function individually to influence disease outcomes. To this end, we propose a two-phase framework named GREMI to assist multi-omics classification and explanation. In the prediction phase, we propose to improve prediction performance by employing a graph attention architecture on sample-wise co-functional networks to incorporate biomolecular interaction information for enhanced feature representation, followed by the integration of a joint-late mixed strategy and the true-class-probability block to adaptively evaluate classification confidence at both feature and omics levels. In the interpretation phase, we propose a multi-view approach to explain disease outcomes from the interaction module perspective, providing a more intuitive understanding and biomedical rationale. We incorporate Monte Carlo tree search (MCTS) to explore local-view subgraphs and pinpoint modules that highly contribute to disease characterization from the global-view. Extensive experiments demonstrate that the proposed framework outperforms state-of-the-art methods in seven different classification tasks, and our model effectively addresses data mutual interference when the number of omics types increases. We further illustrate the functional- and disease-relevance of the identified modules, as well as validate the classification performance of discovered modules using an independent cohort.
Collapse
|
11
|
Ballard JL, Wang Z, Li W, Shen L, Long Q. Deep learning-based approaches for multi-omics data integration and analysis. BioData Min 2024; 17:38. [PMID: 39358793 PMCID: PMC11446004 DOI: 10.1186/s13040-024-00391-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Accepted: 09/06/2024] [Indexed: 10/04/2024] Open
Abstract
BACKGROUND The rapid growth of deep learning, as well as the vast and ever-growing amount of available data, have provided ample opportunity for advances in fusion and analysis of complex and heterogeneous data types. Different data modalities provide complementary information that can be leveraged to gain a more complete understanding of each subject. In the biomedical domain, multi-omics data includes molecular (genomics, transcriptomics, proteomics, epigenomics, metabolomics, etc.) and imaging (radiomics, pathomics) modalities which, when combined, have the potential to improve performance on prediction, classification, clustering and other tasks. Deep learning encompasses a wide variety of methods, each of which have certain strengths and weaknesses for multi-omics integration. METHOD In this review, we categorize recent deep learning-based approaches by their basic architectures and discuss their unique capabilities in relation to one another. We also discuss some emerging themes advancing the field of multi-omics integration. RESULTS Deep learning-based multi-omics integration methods were categorized broadly into non-generative (feedforward neural networks, graph convolutional neural networks, and autoencoders) and generative (variational methods, generative adversarial models, and a generative pretrained model). Generative methods have the advantage of being able to impose constraints on the shared representations to enforce certain properties or incorporate prior knowledge. They can also be used to generate or impute missing modalities. Recent advances achieved by these methods include the ability to handle incomplete data as well as going beyond the traditional molecular omics data types to integrate other modalities such as imaging data. CONCLUSION We expect to see further growth in methods that can handle missingness, as this is a common challenge in working with complex and heterogeneous data. Additionally, methods that integrate more data types are expected to improve performance on downstream tasks by capturing a comprehensive view of each sample.
Collapse
Affiliation(s)
- Jenna L Ballard
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, 3700 Hamilton Walk, Philadelphia, PA, 19104, USA.
| | - Zexuan Wang
- Graduate Group in Applied Mathematics and Computational Science, University of Pennsylvania, 209 S. 33rd Street, Philadelphia, PA, 19104, USA
| | - Wenrui Li
- Department of Statistics, University of Connecticut, 215 Glenbrook Road, Storrs, CT, 06269, USA
| | - Li Shen
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, 423 Guardian Drive, Philadelphia, PA, 19104, USA.
| | - Qi Long
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, 423 Guardian Drive, Philadelphia, PA, 19104, USA.
| |
Collapse
|
12
|
Kook L, Lundborg AR. Algorithm-agnostic significance testing in supervised learning with multimodal data. Brief Bioinform 2024; 25:bbae475. [PMID: 39323092 PMCID: PMC11424510 DOI: 10.1093/bib/bbae475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Revised: 09/05/2024] [Accepted: 09/10/2024] [Indexed: 09/27/2024] Open
Abstract
MOTIVATION Valid statistical inference is crucial for decision-making but difficult to obtain in supervised learning with multimodal data, e.g. combinations of clinical features, genomic data, and medical images. Multimodal data often warrants the use of black-box algorithms, for instance, random forests or neural networks, which impede the use of traditional variable significance tests. RESULTS We address this problem by proposing the use of COvariance MEasure Tests (COMETs), which are calibrated and powerful tests that can be combined with any sufficiently predictive supervised learning algorithm. We apply COMETs to several high-dimensional, multimodal data sets to illustrate (i) variable significance testing for finding relevant mutations modulating drug-activity, (ii) modality selection for predicting survival in liver cancer patients with multiomics data, and (iii) modality selection with clinical features and medical imaging data. In all applications, COMETs yield results consistent with domain knowledge without requiring data-driven pre-processing, which may invalidate type I error control. These novel applications with high-dimensional multimodal data corroborate prior results on the power and robustness of COMETs for significance testing. AVAILABILITY AND IMPLEMENTATION COMETs are implemented in the cometsR package available on CRAN and pycometsPython library available on GitHub. Source code for reproducing all results is available at https://github.com/LucasKook/comets. All data sets used in this work are openly available.
Collapse
Affiliation(s)
- Lucas Kook
- Institute for Statistics and Mathematics, Vienna University of Economics and Business, Welthandelsplatz 1, AT-1020 Vienna, Austria
| | - Anton Rask Lundborg
- Department of Mathematical Sciences, University of Copenhagen, Universitetsparken 5, DK-2100 Copenhagen, Denmark
| |
Collapse
|
13
|
Casotti MC, Meira DD, Zetum ASS, Campanharo CV, da Silva DRC, Giacinti GM, da Silva IM, Moura JAD, Barbosa KRM, Altoé LSC, Mauricio LSR, Góes LSBDB, Alves LNR, Linhares SSG, Ventorim VDP, Guaitolini YM, dos Santos EDVW, Errera FIV, Groisman S, de Carvalho EF, de Paula F, de Sousa MVP, Fechine PBA, Louro ID. Integrating frontiers: a holistic, quantum and evolutionary approach to conquering cancer through systems biology and multidisciplinary synergy. Front Oncol 2024; 14:1419599. [PMID: 39224803 PMCID: PMC11367711 DOI: 10.3389/fonc.2024.1419599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 07/31/2024] [Indexed: 09/04/2024] Open
Abstract
Cancer therapy is facing increasingly significant challenges, marked by a wide range of techniques and research efforts centered around somatic mutations, precision oncology, and the vast amount of big data. Despite this abundance of information, the quest to cure cancer often seems more elusive, with the "war on cancer" yet to deliver a definitive victory. A particularly pressing issue is the development of tumor treatment resistance, highlighting the urgent need for innovative approaches. Evolutionary, Quantum Biology and System Biology offer a promising framework for advancing experimental cancer research. By integrating theoretical studies, translational methods, and flexible multidisciplinary clinical research, there's potential to enhance current treatment strategies and improve outcomes for cancer patients. Establishing stronger links between evolutionary, quantum, entropy and chaos principles and oncology could lead to more effective treatments that leverage an understanding of the tumor's evolutionary dynamics, paving the way for novel methods to control and mitigate cancer. Achieving these objectives necessitates a commitment to multidisciplinary and interprofessional collaboration at the heart of both research and clinical endeavors in oncology. This entails dismantling silos between disciplines, encouraging open communication and data sharing, and integrating diverse viewpoints and expertise from the outset of research projects. Being receptive to new scientific discoveries and responsive to how patients react to treatments is also crucial. Such strategies are key to keeping the field of oncology at the forefront of effective cancer management, ensuring patients receive the most personalized and effective care. Ultimately, this approach aims to push the boundaries of cancer understanding, treating it as a manageable chronic condition, aiming to extend life expectancy and enhance patient quality of life.
Collapse
Affiliation(s)
- Matheus Correia Casotti
- Núcleo de Genética Humana e Molecular, Universidade Federal do Espírito Santo (UFES), Vitória, ES, Brazil
| | - Débora Dummer Meira
- Núcleo de Genética Humana e Molecular, Universidade Federal do Espírito Santo (UFES), Vitória, ES, Brazil
| | | | | | | | - Giulia Maria Giacinti
- Núcleo de Genética Humana e Molecular, Universidade Federal do Espírito Santo (UFES), Vitória, ES, Brazil
| | - Iris Moreira da Silva
- Núcleo de Genética Humana e Molecular, Universidade Federal do Espírito Santo (UFES), Vitória, ES, Brazil
| | - João Augusto Diniz Moura
- Laboratório de Oncologia Clínica e Experimental, Universidade Federal do Espírito Santo (UFES), Vitória, ES, Brazil
| | - Karen Ruth Michio Barbosa
- Núcleo de Genética Humana e Molecular, Universidade Federal do Espírito Santo (UFES), Vitória, ES, Brazil
| | - Lorena Souza Castro Altoé
- Núcleo de Genética Humana e Molecular, Universidade Federal do Espírito Santo (UFES), Vitória, ES, Brazil
| | | | | | - Lyvia Neves Rebello Alves
- Núcleo de Genética Humana e Molecular, Universidade Federal do Espírito Santo (UFES), Vitória, ES, Brazil
| | | | - Vinícius do Prado Ventorim
- Núcleo de Genética Humana e Molecular, Universidade Federal do Espírito Santo (UFES), Vitória, ES, Brazil
| | - Yasmin Moreto Guaitolini
- Núcleo de Genética Humana e Molecular, Universidade Federal do Espírito Santo (UFES), Vitória, ES, Brazil
| | | | | | - Sonia Groisman
- Instituto de Biologia Roberto Alcântara Gomes (IBRAG), Universidade do Estado do Rio de Janeiro (UERJ), Rio de Janeiro, RJ, Brazil
| | - Elizeu Fagundes de Carvalho
- Instituto de Biologia Roberto Alcântara Gomes (IBRAG), Universidade do Estado do Rio de Janeiro (UERJ), Rio de Janeiro, RJ, Brazil
| | - Flavia de Paula
- Núcleo de Genética Humana e Molecular, Universidade Federal do Espírito Santo (UFES), Vitória, ES, Brazil
| | | | - Pierre Basílio Almeida Fechine
- Group of Chemistry of Advanced Materials (GQMat), Department of Analytical Chemistry and Physical-Chemistry, Federal University of Ceará (UFC), Fortaleza, CE, Brazil
| | - Iuri Drumond Louro
- Núcleo de Genética Humana e Molecular, Universidade Federal do Espírito Santo (UFES), Vitória, ES, Brazil
| |
Collapse
|
14
|
Zhao Y, Li X, Zhou C, Peng H, Zheng Z, Chen J, Ding W. A review of cancer data fusion methods based on deep learning. INFORMATION FUSION 2024; 108:102361. [DOI: 10.1016/j.inffus.2024.102361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/04/2025]
|
15
|
Li H, Zhou Y, Zhao N, Wang Y, Lai Y, Zeng F, Yang F. ISMI-VAE: A deep learning model for classifying disease cells using gene expression and SNV data. Comput Biol Med 2024; 175:108485. [PMID: 38653063 DOI: 10.1016/j.compbiomed.2024.108485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2023] [Revised: 04/03/2024] [Accepted: 04/15/2024] [Indexed: 04/25/2024]
Abstract
Various studies have linked several diseases, including cancer and COVID-19, to single nucleotide variations (SNV). Although single-cell RNA sequencing (scRNA-seq) technology can provide SNV and gene expression data, few studies have integrated and analyzed these multimodal data. To address this issue, we introduce Interpretable Single-cell Multimodal Data Integration Based on Variational Autoencoder (ISMI-VAE). ISMI-VAE leverages latent variable models that utilize the characteristics of SNV and gene expression data to overcome high noise levels and uses deep learning techniques to integrate multimodal information, map them to a low-dimensional space, and classify disease cells. Moreover, ISMI-VAE introduces an attention mechanism to reflect feature importance and analyze genetic features that could potentially cause disease. Experimental results on three cancer data sets and one COVID-19 data set demonstrate that ISMI-VAE surpasses the baseline method in terms of both effectiveness and interpretability and can effectively identify disease-causing gene features.
Collapse
Affiliation(s)
- Han Li
- Department of Automation, Xiamen University, Xiamen, China; National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, 361005, China; Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision Making, Xiamen university, Xiamen, 361000, China
| | - Yitao Zhou
- Department of Automation, Xiamen University, Xiamen, China; National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, 361005, China; Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision Making, Xiamen university, Xiamen, 361000, China
| | - Ningyuan Zhao
- Department of Automation, Xiamen University, Xiamen, China
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, China; National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, 361005, China; Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision Making, Xiamen university, Xiamen, 361000, China
| | - Yongxuan Lai
- School of Informatics, Xiamen University, Xiamen, China
| | - Feng Zeng
- Department of Automation, Xiamen University, Xiamen, China; National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, 361005, China; Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision Making, Xiamen university, Xiamen, 361000, China; State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Xiamen University, China; Research Unit of Cellular Stress of CAMS, Cancer Research Center, School of Medicine, Xiamen University, China.
| | - Fan Yang
- Department of Automation, Xiamen University, Xiamen, China; National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, 361005, China; Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision Making, Xiamen university, Xiamen, 361000, China.
| |
Collapse
|
16
|
Afroz S, Islam N, Habib MA, Reza MS, Ashad Alam M. Multi-omics data integration and drug screening of AML cancer using Generative Adversarial Network. Methods 2024; 226:138-150. [PMID: 38670415 DOI: 10.1016/j.ymeth.2024.04.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 04/02/2024] [Accepted: 04/20/2024] [Indexed: 04/28/2024] Open
Abstract
In the era of precision medicine, accurate disease phenotype prediction for heterogeneous diseases, such as cancer, is emerging due to advanced technologies that link genotypes and phenotypes. However, it is difficult to integrate different types of biological data because they are so varied. In this study, we focused on predicting the traits of a blood cancer called Acute Myeloid Leukemia (AML) by combining different kinds of biological data. We used a recently developed method called Omics Generative Adversarial Network (GAN) to better classify cancer outcomes. The primary advantages of a GAN include its ability to create synthetic data that is nearly indistinguishable from real data, its high flexibility, and its wide range of applications, including multi-omics data analysis. In addition, the GAN was effective at combining two types of biological data. We created synthetic datasets for gene activity and DNA methylation. Our method was more accurate in predicting disease traits than using the original data alone. The experimental results provided evidence that the creation of synthetic data through interacting multi-omics data analysis using GANs improves the overall prediction quality. Furthermore, we identified the top-ranked significant genes through statistical methods and pinpointed potential candidate drug agents through in-silico studies. The proposed drugs, also supported by other independent studies, might play a crucial role in the treatment of AML cancer. The code is available on GitHub; https://github.com/SabrinAfroz/omicsGAN_codes?fbclid=IwAR1-/stuffmlE0hyWgSu2wlXo6dYlKUei3faLdlvpxTOOUPVlmYCloXf4Uk9ejK4I.
Collapse
Affiliation(s)
- Sabrin Afroz
- Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Bangladesh
| | - Nadira Islam
- Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Bangladesh
| | - Md Ahsan Habib
- Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Bangladesh; Statistical Learning Group, Bangladesh
| | - Md Selim Reza
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, Tulane University, New Orleans, LA 70112, USA; Statistical Learning Group, Bangladesh
| | - Md Ashad Alam
- Ochsner Center for Outcomes Research, Ochsner Research, Ochsner Clinic Foundation, New Orleans, LA 70121, USA; Statistical Learning Group, Bangladesh.
| |
Collapse
|
17
|
Giansanti V, Giannese F, Botrugno OA, Gandolfi G, Balestrieri C, Antoniotti M, Tonon G, Cittaro D. Scalable integration of multiomic single-cell data using generative adversarial networks. Bioinformatics 2024; 40:btae300. [PMID: 38696763 PMCID: PMC11654621 DOI: 10.1093/bioinformatics/btae300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Revised: 03/22/2024] [Accepted: 04/30/2024] [Indexed: 05/04/2024] Open
Abstract
MOTIVATION Single-cell profiling has become a common practice to investigate the complexity of tissues, organs, and organisms. Recent technological advances are expanding our capabilities to profile various molecular layers beyond the transcriptome such as, but not limited to, the genome, the epigenome, and the proteome. Depending on the experimental procedure, these data can be obtained from separate assays or the very same cells. Yet, integration of more than two assays is currently not supported by the majority of the computational frameworks avaiable. RESULTS We here propose a Multi-Omic data integration framework based on Wasserstein Generative Adversarial Networks suitable for the analysis of paired or unpaired data with a high number of modalities (>2). At the core of our strategy is a single network trained on all modalities together, limiting the computational burden when many molecular layers are evaluated. AVAILABILITY AND IMPLEMENTATION Source code of our framework is available at https://github.com/vgiansanti/MOWGAN.
Collapse
Affiliation(s)
- Valentina Giansanti
- Department of Informatics, Systems and Communication, Università degli
Studi di Milano-Bicocca, Milan, 20125, Italy
- Center for Omics Sciences, IRCCS San Raffaele Scientific
Institute, Milan, 20132, Italy
| | - Francesca Giannese
- Center for Omics Sciences, IRCCS San Raffaele Scientific
Institute, Milan, 20132, Italy
| | - Oronza A Botrugno
- Functional Genomics of Cancer Unit, IRCCS San Raffaele Scientific
Institute, Milan, 20132, Italy
- Università Vita-Salute San Raffaele, Milan, 20132, Italy
| | - Giorgia Gandolfi
- Center for Omics Sciences, IRCCS San Raffaele Scientific
Institute, Milan, 20132, Italy
| | - Chiara Balestrieri
- Center for Omics Sciences, IRCCS San Raffaele Scientific
Institute, Milan, 20132, Italy
- Experimental Hematology Unit, IRCCS San Raffaele Scientific
Institute, Milan, 20132, Italy
| | - Marco Antoniotti
- Department of Informatics, Systems and Communication, Università degli
Studi di Milano-Bicocca, Milan, 20125, Italy
- Bicocca Bioinformatics Biostatistics and Bioimaging Centre-B4, Università
degli Studi di Milano-Bicocca, Milan, 20125, Italy
- Istituto di Bioimmagini e Fisiologia Molecolare, Consiglio Nazionale delle
Ricerche (CNR), Milan, 20090, Italy
| | - Giovanni Tonon
- Center for Omics Sciences, IRCCS San Raffaele Scientific
Institute, Milan, 20132, Italy
- Functional Genomics of Cancer Unit, IRCCS San Raffaele Scientific
Institute, Milan, 20132, Italy
- Università Vita-Salute San Raffaele, Milan, 20132, Italy
| | - Davide Cittaro
- Center for Omics Sciences, IRCCS San Raffaele Scientific
Institute, Milan, 20132, Italy
| |
Collapse
|
18
|
Yoon JH, Lee D, Lee C, Cho E, Lee S, Cazenave-Gassiot A, Kim K, Chae S, Dennis EA, Suh PG. Paradigm shift required for translational research on the brain. Exp Mol Med 2024; 56:1043-1054. [PMID: 38689090 PMCID: PMC11148129 DOI: 10.1038/s12276-024-01218-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 02/07/2024] [Accepted: 02/20/2024] [Indexed: 05/02/2024] Open
Abstract
Biomedical research on the brain has led to many discoveries and developments, such as understanding human consciousness and the mind and overcoming brain diseases. However, historical biomedical research on the brain has unique characteristics that differ from those of conventional biomedical research. For example, there are different scientific interpretations due to the high complexity of the brain and insufficient intercommunication between researchers of different disciplines owing to the limited conceptual and technical overlap of distinct backgrounds. Therefore, the development of biomedical research on the brain has been slower than that in other areas. Brain biomedical research has recently undergone a paradigm shift, and conducting patient-centered, large-scale brain biomedical research has become possible using emerging high-throughput analysis tools. Neuroimaging, multiomics, and artificial intelligence technology are the main drivers of this new approach, foreshadowing dramatic advances in translational research. In addition, emerging interdisciplinary cooperative studies provide insights into how unresolved questions in biomedicine can be addressed. This review presents the in-depth aspects of conventional biomedical research and discusses the future of biomedical research on the brain.
Collapse
Affiliation(s)
- Jong Hyuk Yoon
- Neurodegenerative Diseases Research Group, Korea Brain Research Institute, Daegu, 41062, Republic of Korea.
| | - Dongha Lee
- Cognitive Science Research Group, Korea Brain Research Institute, Daegu, 41062, Republic of Korea
| | - Chany Lee
- Cognitive Science Research Group, Korea Brain Research Institute, Daegu, 41062, Republic of Korea
| | - Eunji Cho
- Neurodegenerative Diseases Research Group, Korea Brain Research Institute, Daegu, 41062, Republic of Korea
| | - Seulah Lee
- Neurodegenerative Diseases Research Group, Korea Brain Research Institute, Daegu, 41062, Republic of Korea
| | - Amaury Cazenave-Gassiot
- Department of Biochemistry and Precision Medicine Translational Research Program, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, 119077, Singapore
- Singapore Lipidomics Incubator (SLING), Life Sciences Institute, National University of Singapore, Singapore, 117456, Singapore
| | - Kipom Kim
- Research Strategy Office, Korea Brain Research Institute, Daegu, 41062, Republic of Korea
| | - Sehyun Chae
- Neurovascular Unit Research Group, Korean Brain Research Institute, Daegu, 41062, Republic of Korea
| | - Edward A Dennis
- Department of Pharmacology and Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA, 92093-0601, USA
| | - Pann-Ghill Suh
- Korea Brain Research Institute, Daegu, 41062, Republic of Korea
| |
Collapse
|
19
|
Vallevik VB, Babic A, Marshall SE, Elvatun S, Brøgger HMB, Alagaratnam S, Edwin B, Veeraragavan NR, Befring AK, Nygård JF. Can I trust my fake data - A comprehensive quality assessment framework for synthetic tabular data in healthcare. Int J Med Inform 2024; 185:105413. [PMID: 38493547 DOI: 10.1016/j.ijmedinf.2024.105413] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 02/17/2024] [Accepted: 03/11/2024] [Indexed: 03/19/2024]
Abstract
BACKGROUND Ensuring safe adoption of AI tools in healthcare hinges on access to sufficient data for training, testing and validation. Synthetic data has been suggested in response to privacy concerns and regulatory requirements and can be created by training a generator on real data to produce a dataset with similar statistical properties. Competing metrics with differing taxonomies for quality evaluation have been proposed, resulting in a complex landscape. Optimising quality entails balancing considerations that make the data fit for use, yet relevant dimensions are left out of existing frameworks. METHOD We performed a comprehensive literature review on the use of quality evaluation metrics on synthetic data within the scope of synthetic tabular healthcare data using deep generative methods. Based on this and the collective team experiences, we developed a conceptual framework for quality assurance. The applicability was benchmarked against a practical case from the Dutch National Cancer Registry. CONCLUSION We present a conceptual framework for quality assuranceof synthetic data for AI applications in healthcare that aligns diverging taxonomies, expands on common quality dimensions to include the dimensions of Fairness and Carbon footprint, and proposes stages necessary to support real-life applications. Building trust in synthetic data by increasing transparency and reducing the safety risk will accelerate the development and uptake of trustworthy AI tools for the benefit of patients. DISCUSSION Despite the growing emphasis on algorithmic fairness and carbon footprint, these metrics were scarce in the literature review. The overwhelming focus was on statistical similarity using distance metrics while sequential logic detection was scarce. A consensus-backed framework that includes all relevant quality dimensions can provide assurance for safe and responsible real-life applications of synthetic data. As the choice of appropriate metrics are highly context dependent, further research is needed on validation studies to guide metric choices and support the development of technical standards.
Collapse
Affiliation(s)
- Vibeke Binz Vallevik
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; DNV AS, Veritasveien 1, 1322 Høvik, Norway.
| | | | | | - Severin Elvatun
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway
| | - Helga M B Brøgger
- DNV AS, Veritasveien 1, 1322 Høvik, Norway; Oslo University Hospital, Sognsvannsveien 20, 0372 Oslo, Norway
| | | | - Bjørn Edwin
- University of Oslo, Boks 1072 Blindern, NO-0316 Oslo, Norway; The Intervention Centre and Department of HPB Surgery, Oslo University Hospital and Institute of Clinical Medicine, Faculty of Medicine, University of Oslo, Oslo, Norway
| | | | | | - Jan F Nygård
- Cancer Registry of Norway, Ullernchausseen 64, 0379 Oslo, Norway; UiT - The Arctic University of Norway, Tromsø, Norway
| |
Collapse
|
20
|
Shannon CP, Lee AH, Tebbutt SJ, Singh A. A Commentary on Multi-omics Data Integration in Systems Vaccinology. J Mol Biol 2024; 436:168522. [PMID: 38458605 DOI: 10.1016/j.jmb.2024.168522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Revised: 03/04/2024] [Accepted: 03/04/2024] [Indexed: 03/10/2024]
Affiliation(s)
| | - Amy Hy Lee
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, Canada
| | - Scott J Tebbutt
- PROOF Centre of Excellence, Vancouver, Canada; Department of Medicine, The University of British Columbia, Vancouver, Canada; Centre for Heart Lung Innovation, Vancouver, Canada
| | - Amrit Singh
- Centre for Heart Lung Innovation, Vancouver, Canada; Department of Anesthesiology, Pharmacology and Therapeutics, The University of British Columbia, Vancouver, Canada.
| |
Collapse
|
21
|
Seo B, Lee D, Jeon H, Ha J, Suh S. MotGen: a closed-loop bacterial motility control framework using generative adversarial networks. Bioinformatics 2024; 40:btae170. [PMID: 38552318 PMCID: PMC11031359 DOI: 10.1093/bioinformatics/btae170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 03/02/2024] [Accepted: 03/27/2024] [Indexed: 04/21/2024] Open
Abstract
MOTIVATION Many organisms' survival and behavior hinge on their responses to environmental signals. While research on bacteria-directed therapeutic agents has increased, systematic exploration of real-time modulation of bacterial motility remains limited. Current studies often focus on permanent motility changes through genetic alterations, restricting the ability to modulate bacterial motility dynamically on a large scale. To address this gap, we propose a novel real-time control framework for systematically modulating bacterial motility dynamics. RESULTS We introduce MotGen, a deep learning approach leveraging Generative Adversarial Networks to analyze swimming performance statistics of motile bacteria based on live cell imaging data. By tracking objects and optimizing cell trajectory mapping under environmentally altered conditions, we trained MotGen on a comprehensive statistical dataset derived from real image data. Our experimental results demonstrate MotGen's ability to capture motility dynamics from real bacterial populations with low mean absolute error in both simulated and real datasets. MotGen allows us to approach optimal swimming conditions for desired motility statistics in real-time. MotGen's potential extends to practical biomedical applications, including immune response prediction, by providing imputation of bacterial motility patterns based on external environmental conditions. Our short-term, in-situ interventions for controlling motility behavior offer a promising foundation for the development of bacteria-based biomedical applications. AVAILABILITY AND IMPLEMENTATION MotGen is presented as a combination of Matlab image analysis code and a machine learning workflow in Python. Codes are available at https://github.com/bgmseo/MotGen, for cell tracking and implementation of trained models to generate bacterial motility statistics.
Collapse
Affiliation(s)
- BoGeum Seo
- Department of Mechanical Engineering, Seoul National University, 08826 Seoul, Republic of Korea
| | - DoHee Lee
- Center for Healthcare Robotics, Korea Institute of Science & Technology, 02792 Seoul, Republic of Korea
| | - Heungjin Jeon
- Infection Control Convergence Research Center, Chungnam National University, 34134 Daejeon, Republic of Korea
| | - Junhyoung Ha
- Center for Healthcare Robotics, Korea Institute of Science & Technology, 02792 Seoul, Republic of Korea
| | - SeungBeum Suh
- Center for Healthcare Robotics, Korea Institute of Science & Technology, 02792 Seoul, Republic of Korea
| |
Collapse
|
22
|
Cusworth S, Gkoutos GV, Acharjee A. A novel generative adversarial networks modelling for the class imbalance problem in high dimensional omics data. BMC Med Inform Decis Mak 2024; 24:90. [PMID: 38549123 PMCID: PMC10979623 DOI: 10.1186/s12911-024-02487-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Accepted: 03/22/2024] [Indexed: 04/01/2024] Open
Abstract
Class imbalance remains a large problem in high-throughput omics analyses, causing bias towards the over-represented class when training machine learning-based classifiers. Oversampling is a common method used to balance classes, allowing for better generalization of the training data. More naive approaches can introduce other biases into the data, being especially sensitive to inaccuracies in the training data, a problem considering the characteristically noisy data obtained in healthcare. This is especially a problem with high-dimensional data. A generative adversarial network-based method is proposed for creating synthetic samples from small, high-dimensional data, to improve upon other more naive generative approaches. The method was compared with 'synthetic minority over-sampling technique' (SMOTE) and 'random oversampling' (RO). Generative methods were validated by training classifiers on the balanced data.
Collapse
Affiliation(s)
- Samuel Cusworth
- Institute of Applied Health Research, University of Birmingham, Birmingham, UK
- NIHR Blood and Transplant Research Unit (BTRU) in Precision Transplant and Cellular Therapeutics, University of Birmingham, Birmingham, UK
| | - Georgios V Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, B15 2TT, Birmingham, UK
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, B15 2TT, Birmingham, UK
- MRC Health Data Research UK (HDR), Midlands Site, UK
- Centre for Health Data Research, University of Birmingham, B15 2TT, Birmingham, UK
- NIHR Experimental Cancer Medicine Centre, B15 2TT, Birmingham, UK
| | - Animesh Acharjee
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, B15 2TT, Birmingham, UK.
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, B15 2TT, Birmingham, UK.
- MRC Health Data Research UK (HDR), Midlands Site, UK.
- Centre for Health Data Research, University of Birmingham, B15 2TT, Birmingham, UK.
| |
Collapse
|
23
|
Lan W, Liao H, Chen Q, Zhu L, Pan Y, Chen YPP. DeepKEGG: a multi-omics data integration framework with biological insights for cancer recurrence prediction and biomarker discovery. Brief Bioinform 2024; 25:bbae185. [PMID: 38678587 PMCID: PMC11056029 DOI: 10.1093/bib/bbae185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2024] [Revised: 03/07/2024] [Accepted: 04/09/2024] [Indexed: 05/01/2024] Open
Abstract
Deep learning-based multi-omics data integration methods have the capability to reveal the mechanisms of cancer development, discover cancer biomarkers and identify pathogenic targets. However, current methods ignore the potential correlations between samples in integrating multi-omics data. In addition, providing accurate biological explanations still poses significant challenges due to the complexity of deep learning models. Therefore, there is an urgent need for a deep learning-based multi-omics integration method to explore the potential correlations between samples and provide model interpretability. Herein, we propose a novel interpretable multi-omics data integration method (DeepKEGG) for cancer recurrence prediction and biomarker discovery. In DeepKEGG, a biological hierarchical module is designed for local connections of neuron nodes and model interpretability based on the biological relationship between genes/miRNAs and pathways. In addition, a pathway self-attention module is constructed to explore the correlation between different samples and generate the potential pathway feature representation for enhancing the prediction performance of the model. Lastly, an attribution-based feature importance calculation method is utilized to discover biomarkers related to cancer recurrence and provide a biological interpretation of the model. Experimental results demonstrate that DeepKEGG outperforms other state-of-the-art methods in 5-fold cross validation. Furthermore, case studies also indicate that DeepKEGG serves as an effective tool for biomarker discovery. The code is available at https://github.com/lanbiolab/DeepKEGG.
Collapse
Affiliation(s)
- Wei Lan
- Guangxi Key Laboratory of Multimedia Communications and Network Technology, School of Computer, Electronic and Information, Guangxi University, No. 100 Daxue Road, Xixiangtang District, Nanning 530004, China
| | - Haibo Liao
- Guangxi Key Laboratory of Multimedia Communications and Network Technology, School of Computer, Electronic and Information, Guangxi University, No. 100 Daxue Road, Xixiangtang District, Nanning 530004, China
| | - Qingfeng Chen
- Guangxi Key Laboratory of Multimedia Communications and Network Technology, School of Computer, Electronic and Information, Guangxi University, No. 100 Daxue Road, Xixiangtang District, Nanning 530004, China
| | - Lingzhi Zhu
- School of Computer and Information Science, Hunan Institute of Technology, No. 18 Henghua Road, Zhuhui District, Hengyang 421002, China
| | - Yi Pan
- School of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, No. 1068 Xueyuan Avenue, Shenzhen University Town, Nanshan District, Shenzhen 518055, China
| | - Yi-Ping Phoebe Chen
- Department of Computer Science and Information Technology, La Trobe University, Plenty Rd, Bundoora, Melbourne, Victoria 3086, Australia
| |
Collapse
|
24
|
Díaz-Campos MÁ, Vasquez-Arriaga J, Ochoa S, Hernández-Lemus E. Functional impact of multi-omic interactions in lung cancer. Front Genet 2024; 15:1282241. [PMID: 38389572 PMCID: PMC10881857 DOI: 10.3389/fgene.2024.1282241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Accepted: 01/23/2024] [Indexed: 02/24/2024] Open
Abstract
Lung tumors are a leading cause of cancer-related death worldwide. Lung cancers are highly heterogeneous on their phenotypes, both at the cellular and molecular levels. Efforts to better understand the biological origins and outcomes of lung cancer in terms of this enormous variability often require of high-throughput experimental techniques paired with advanced data analytics. Anticipated advancements in multi-omic methodologies hold potential to reveal a broader molecular perspective of these tumors. This study introduces a theoretical and computational framework for generating network models depicting regulatory constraints on biological functions in a semi-automated way. The approach successfully identifies enriched functions in analyzed omics data, focusing on Adenocarcinoma (LUAD) and Squamous cell carcinoma (LUSC, a type of NSCLC) in the lung. Valuable information about novel regulatory characteristics, supported by robust biological reasoning, is illustrated, for instance by considering the role of genes, miRNAs and CpG sites associated with NSCLC, both novel and previously reported. Utilizing multi-omic regulatory networks, we constructed robust models elucidating omics data interconnectedness, enabling systematic generation of mechanistic hypotheses. These findings offer insights into complex regulatory mechanisms underlying these cancer types, paving the way for further exploring their molecular complexity.
Collapse
Affiliation(s)
| | - Jorge Vasquez-Arriaga
- Computational Genomics Division, National Institute of Genomic Medicine, Mexico City, Mexico
| | - Soledad Ochoa
- Computational Genomics Division, National Institute of Genomic Medicine, Mexico City, Mexico
- Department of Obstetrics and Gynecology, Cedars-Sinai Medical Center, Los Angeles, CA, United States
| | - Enrique Hernández-Lemus
- Computational Genomics Division, National Institute of Genomic Medicine, Mexico City, Mexico
- Center for Complexity Sciences, Universidad Nacional Autónoma de México, Mexico City, Mexico
| |
Collapse
|
25
|
Wang J, Liao N, Du X, Chen Q, Wei B. A semi-supervised approach for the integration of multi-omics data based on transformer multi-head self-attention mechanism and graph convolutional networks. BMC Genomics 2024; 25:86. [PMID: 38254021 PMCID: PMC10802018 DOI: 10.1186/s12864-024-09985-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Accepted: 01/07/2024] [Indexed: 01/24/2024] Open
Abstract
BACKGROUND AND OBJECTIVES Comprehensive analysis of multi-omics data is crucial for accurately formulating effective treatment plans for complex diseases. Supervised ensemble methods have gained popularity in recent years for multi-omics data analysis. However, existing research based on supervised learning algorithms often fails to fully harness the information from unlabeled nodes and overlooks the latent features within and among different omics, as well as the various associations among features. Here, we present a novel multi-omics integrative method MOSEGCN, based on the Transformer multi-head self-attention mechanism and Graph Convolutional Networks(GCN), with the aim of enhancing the accuracy of complex disease classification. MOSEGCN first employs the Transformer multi-head self-attention mechanism and Similarity Network Fusion (SNF) to separately learn the inherent correlations of latent features within and among different omics, constructing a comprehensive view of diseases. Subsequently, it feeds the learned crucial information into a self-ensembling Graph Convolutional Network (SEGCN) built upon semi-supervised learning methods for training and testing, facilitating a better analysis and utilization of information from multi-omics data to achieve precise classification of disease subtypes. RESULTS The experimental results show that MOSEGCN outperforms several state-of-the-art multi-omics integrative analysis approaches on three types of omics data: mRNA expression data, microRNA expression data, and DNA methylation data, with accuracy rates of 83.0% for Alzheimer's disease and 86.7% for breast cancer subtyping. Furthermore, MOSEGCN exhibits strong generalizability on the GBM dataset, enabling the identification of important biomarkers for related diseases. CONCLUSION MOSEGCN explores the significant relationship information among different omics and within each omics' latent features, effectively leveraging labeled and unlabeled information to further enhance the accuracy of complex disease classification. It also provides a promising approach for identifying reliable biomarkers, paving the way for personalized medicine.
Collapse
Affiliation(s)
- Jiahui Wang
- School of Computer and Information Security, Guilin University of Electronic Technology, No. 1 Jinji Road, Guilin City, 541004, Guangxi Zhuang Autonomous Region, China
| | - Nanqing Liao
- School of Medical, Guangxi University, No. 100 East University Road, Nanning, 530004, Guangxi, China
| | - Xiaofei Du
- School of Computer and Information Security, Guilin University of Electronic Technology, No. 1 Jinji Road, Guilin City, 541004, Guangxi Zhuang Autonomous Region, China
| | - Qingfeng Chen
- School of Computer, Electronics and Information, Guangxi University, No. 100 East University Road, Nanning, 530004, Guangxi, China.
| | - Bizhong Wei
- School of Computer and Information Security, Guilin University of Electronic Technology, No. 1 Jinji Road, Guilin City, 541004, Guangxi Zhuang Autonomous Region, China.
| |
Collapse
|
26
|
Li R, Wu J, Li G, Liu J, Xuan J, Zhu Q. Mdwgan-gp: data augmentation for gene expression data based on multiple discriminator WGAN-GP. BMC Bioinformatics 2023; 24:427. [PMID: 37957576 PMCID: PMC10644641 DOI: 10.1186/s12859-023-05558-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Accepted: 11/06/2023] [Indexed: 11/15/2023] Open
Abstract
BACKGROUND Although gene expression data play significant roles in biological and medical studies, their applications are hampered due to the difficulty and high expenses of gathering them through biological experiments. It is an urgent problem to generate high quality gene expression data with computational methods. WGAN-GP, a generative adversarial network-based method, has been successfully applied in augmenting gene expression data. However, mode collapse or over-fitting may take place for small training samples due to just one discriminator is adopted in the method. RESULTS In this study, an improved data augmentation approach MDWGAN-GP, a generative adversarial network model with multiple discriminators, is proposed. In addition, a novel method is devised for enriching training samples based on linear graph convolutional network. Extensive experiments were implemented on real biological data. CONCLUSIONS The experimental results have demonstrated that compared with other state-of-the-art methods, the MDWGAN-GP method can produce higher quality generated gene expression data in most cases.
Collapse
Affiliation(s)
- Rongyuan Li
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, China
| | - Jingli Wu
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, China.
| | - Gaoshi Li
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, China
| | - Jiafei Liu
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, China
| | - Junbo Xuan
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, China
| | - Qi Zhu
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, China
| |
Collapse
|
27
|
Chung Y, Lee H. Joint triplet loss with semi-hard constraint for data augmentation and disease prediction using gene expression data. Sci Rep 2023; 13:18178. [PMID: 37875602 PMCID: PMC10598120 DOI: 10.1038/s41598-023-45467-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Accepted: 10/19/2023] [Indexed: 10/26/2023] Open
Abstract
The accurate prediction of patients with complex diseases, such as Alzheimer's disease (AD), as well as disease stages, including early- and late-stage cancer, is challenging owing to substantial variability among patients and limited availability of clinical data. Deep metric learning has emerged as a promising approach for addressing these challenges by improving data representation. In this study, we propose a joint triplet loss model with a semi-hard constraint (JTSC) to represent data in a small number of samples. JTSC strictly selects semi-hard samples by switching anchors and positive samples during the learning process in triplet embedding and combines a triplet loss function with an angular loss function. Our results indicate that JTSC significantly improves the number of appropriately represented samples during training when applied to the gene expression data of AD and to cancer stage prediction tasks. Furthermore, we demonstrate that using an embedding vector from JTSC as an input to the classifiers for AD and cancer stage prediction significantly improves classification performance by extracting more accurate features. In conclusion, we show that feature embedding through JTSC can aid in classification when there are a small number of samples compared to a larger number of features.
Collapse
Affiliation(s)
- Yeonwoo Chung
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, 61005, Republic of Korea
| | - Hyunju Lee
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, 61005, Republic of Korea.
- Artificial Intelligence Graduate School, Gwangju Institute of Science and Technology, Gwangju, 61005, Republic of Korea.
| |
Collapse
|
28
|
Shi M, Li X, Li M, Si Y. Attention-based generative adversarial networks improve prognostic outcome prediction of cancer from multimodal data. Brief Bioinform 2023; 24:bbad329. [PMID: 37756592 DOI: 10.1093/bib/bbad329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Revised: 08/20/2023] [Accepted: 08/28/2023] [Indexed: 09/29/2023] Open
Abstract
The prediction of prognostic outcome is critical for the development of efficient cancer therapeutics and potential personalized medicine. However, due to the heterogeneity and diversity of multimodal data of cancer, data integration and feature selection remain a challenge for prognostic outcome prediction. We proposed a deep learning method with generative adversarial network based on sequential channel-spatial attention modules (CSAM-GAN), a multimodal data integration and feature selection approach, for accomplishing prognostic stratification tasks in cancer. Sequential channel-spatial attention modules equipped with an encoder-decoder are applied for the input features of multimodal data to accurately refine selected features. A discriminator network was proposed to make the generator and discriminator learning in an adversarial way to accurately describe the complex heterogeneous information of multiple modal data. We conducted extensive experiments with various feature selection and classification methods and confirmed that the CSAM-GAN via the multilayer deep neural network (DNN) classifier outperformed these baseline methods on two different multimodal data sets with miRNA expression, mRNA expression and histopathological image data: lower-grade glioma and kidney renal clear cell carcinoma. The CSAM-GAN via the multilayer DNN classifier bridges the gap between heterogenous multimodal data and prognostic outcome prediction.
Collapse
Affiliation(s)
- Mingguang Shi
- School of Electrical Engineering and Automation, Hefei University of Technology, Hefei, Anhui 230009, China
| | - Xuefeng Li
- School of Electrical Engineering and Automation, Hefei University of Technology, Hefei, Anhui 230009, China
| | - Mingna Li
- School of Electrical Engineering and Automation, Hefei University of Technology, Hefei, Anhui 230009, China
| | - Yichong Si
- School of Electrical Engineering and Automation, Hefei University of Technology, Hefei, Anhui 230009, China
| |
Collapse
|
29
|
Zhang C, Yang Y, Tang S, Aihara K, Zhang C, Chen L. Contrastively generative self-expression model for single-cell and spatial multimodal data. Brief Bioinform 2023; 24:bbad265. [PMID: 37507114 DOI: 10.1093/bib/bbad265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Revised: 05/27/2023] [Accepted: 07/03/2023] [Indexed: 07/30/2023] Open
Abstract
Advances in single-cell multi-omics technology provide an unprecedented opportunity to fully understand cellular heterogeneity. However, integrating omics data from multiple modalities is challenging due to the individual characteristics of each measurement. Here, to solve such a problem, we propose a contrastive and generative deep self-expression model, called single-cell multimodal self-expressive integration (scMSI), which integrates the heterogeneous multimodal data into a unified manifold space. Specifically, scMSI first learns each omics-specific latent representation and self-expression relationship to consider the characteristics of different omics data by deep self-expressive generative model. Then, scMSI combines these omics-specific self-expression relations through contrastive learning. In such a way, scMSI provides a paradigm to integrate multiple omics data even with weak relation, which effectively achieves the representation learning and data integration into a unified framework. We demonstrate that scMSI provides a cohesive solution for a variety of analysis tasks, such as integration analysis, data denoising, batch correction and spatial domain detection. We have applied scMSI on various single-cell and spatial multimodal datasets to validate its high effectiveness and robustness in diverse data types and application scenarios.
Collapse
Affiliation(s)
- Chengming Zhang
- Key Laboratory of Systems Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai 200031, China
- International Research Center for Neurointelligence, The University of Tokyo Institutes for Advanced Study, The University of Tokyo, Tokyo 113-0033, Japan
| | - Yiwen Yang
- Key Laboratory of Systems Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai 200031, China
- School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China
| | - Shijie Tang
- Key Laboratory of Systems Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai 200031, China
| | - Kazuyuki Aihara
- International Research Center for Neurointelligence, The University of Tokyo Institutes for Advanced Study, The University of Tokyo, Tokyo 113-0033, Japan
| | - Chuanchao Zhang
- Key Laboratory of Systems Health Science of Zhejiang Province, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China
- Guangdong Institute of Intelligence Science and Technology, Hengqin, Zhuhai, Guangdong 519031, China
| | - Luonan Chen
- Key Laboratory of Systems Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai 200031, China
- School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China
- Key Laboratory of Systems Health Science of Zhejiang Province, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China
- Guangdong Institute of Intelligence Science and Technology, Hengqin, Zhuhai, Guangdong 519031, China
| |
Collapse
|
30
|
Jacobs F, D'Amico S, Benvenuti C, Gaudio M, Saltalamacchia G, Miggiano C, De Sanctis R, Della Porta MG, Santoro A, Zambelli A. Opportunities and Challenges of Synthetic Data Generation in Oncology. JCO Clin Cancer Inform 2023; 7:e2300045. [PMID: 37535875 DOI: 10.1200/cci.23.00045] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Revised: 05/05/2023] [Accepted: 05/25/2023] [Indexed: 08/05/2023] Open
Abstract
Widespread interest in artificial intelligence (AI) in health care has focused mainly on deductive systems that analyze available real-world data to discover patterns not otherwise visible. Generative adversarial network, a new type of inductive AI, has recently evolved to generate high-fidelity virtual synthetic data (SD) trained on relatively limited real-world information. The AI system is fed with a collection of real data, and it learns to generate new augmented data while maintaining the general characteristics of the original data set. The use of SD to enhance clinical research and protect patient privacy has drawn a lot of interest in medicine and in the complex field of oncology. This article summarizes the main characteristics of this innovative technology and critically discusses how it can be used to accelerate data access for secondary purposes, providing an overview of the opportunities and challenges of SD generation for clinical cancer research and health care.
Collapse
Affiliation(s)
- Flavia Jacobs
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | | | - Chiara Benvenuti
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | - Mariangela Gaudio
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | | | - Chiara Miggiano
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | - Rita De Sanctis
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | - Matteo Giovanni Della Porta
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | - Armando Santoro
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| | - Alberto Zambelli
- Department of Biomedical Sciences, Humanitas University, Milan, Italy
- IRCCS Istituto Clinico Humanitas, Milan, Italy
| |
Collapse
|
31
|
Zhong Y, Peng Y, Lin Y, Chen D, Zhang H, Zheng W, Chen Y, Wu C. MODILM: towards better complex diseases classification using a novel multi-omics data integration learning model. BMC Med Inform Decis Mak 2023; 23:82. [PMID: 37147619 PMCID: PMC10161645 DOI: 10.1186/s12911-023-02173-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Accepted: 04/11/2023] [Indexed: 05/07/2023] Open
Abstract
BACKGROUND Accurately classifying complex diseases is crucial for diagnosis and personalized treatment. Integrating multi-omics data has been demonstrated to enhance the accuracy of analyzing and classifying complex diseases. This can be attributed to the highly correlated nature of the data with various diseases, as well as the comprehensive and complementary information it provides. However, integrating multi-omics data for complex diseases is challenged by data characteristics such as high imbalance, scale variation, heterogeneity, and noise interference. These challenges further emphasize the importance of developing effective methods for multi-omics data integration. RESULTS We proposed a novel multi-omics data learning model called MODILM, which integrates multiple omics data to improve the classification accuracy of complex diseases by obtaining more significant and complementary information from different single-omics data. Our approach includes four key steps: 1) constructing a similarity network for each omics data using the cosine similarity measure, 2) leveraging Graph Attention Networks to learn sample-specific and intra-association features from similarity networks for single-omics data, 3) using Multilayer Perceptron networks to map learned features to a new feature space, thereby strengthening and extracting high-level omics-specific features, and 4) fusing these high-level features using a View Correlation Discovery Network to learn cross-omics features in the label space, which results in unique class-level distinctiveness for complex diseases. To demonstrate the effectiveness of MODILM, we conducted experiments on six benchmark datasets consisting of miRNA expression, mRNA, and DNA methylation data. Our results show that MODILM outperforms state-of-the-art methods, effectively improving the accuracy of complex disease classification. CONCLUSIONS Our MODILM provides a more competitive way to extract and integrate important and complementary information from multiple omics data, providing a very promising tool for supporting decision-making for clinical diagnosis.
Collapse
Affiliation(s)
- Yating Zhong
- Guangxi Key Lab of Human-Machine Interaction and Intelligent Decision, Nanning Normal University, Nanning, 530001, China
| | - Yuzhong Peng
- Guangxi Key Lab of Human-Machine Interaction and Intelligent Decision, Nanning Normal University, Nanning, 530001, China.
| | - Yanmei Lin
- School of Environment and Life Science, Nanning Normal University, Nanning, 530001, China.
| | - Dingjia Chen
- Guangxi Key Lab of Human-Machine Interaction and Intelligent Decision, Nanning Normal University, Nanning, 530001, China
| | - Hao Zhang
- School of Computer Science, Fudan University, Shanghai, 200433, China
- School of Computer, Guangdong University of Petrochemical Technology, Maoming, 525000, China
| | - Wen Zheng
- Guangxi Key Lab of Human-Machine Interaction and Intelligent Decision, Nanning Normal University, Nanning, 530001, China
| | - Yuanyuan Chen
- Guangxi Key Lab of Human-Machine Interaction and Intelligent Decision, Nanning Normal University, Nanning, 530001, China
| | - Changliang Wu
- Department of Spleen, Stomach and Liver Diseases, Guangxi International Zhuang Medical Hospital, Nanning, 530201, China
| |
Collapse
|
32
|
Gong P, Cheng L, Zhang Z, Meng A, Li E, Chen J, Zhang L. Multi-omics integration method based on attention deep learning network for biomedical data classification. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 231:107377. [PMID: 36739624 DOI: 10.1016/j.cmpb.2023.107377] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Revised: 01/06/2023] [Accepted: 01/25/2023] [Indexed: 06/18/2023]
Abstract
BACKGROUND AND OBJECTIVE Integrating multi-omics data for the comprehensive analysis of the biological processes in human diseases has become one of the most challenging tasks of bioinformatics. Deep learning (DL) algorithms have recently become one of the most promising multi-omics data integration analysis methods. However, existing DL-based studies almost integrate the multi-omics data by concatenation in the input data space or the learned feature space, ignoring the correlations between patients and omics. METHODS We propose a novel multi-omics integration method, called Multi-omics Attention Deep Learning Network (MOADLN), which is used for biomedical data classification. Firstly, for each type of omics data, we use three fully-connected layers and the self-attention mechanism to reduce dimensionality, and construct the correlations between patients, respectively. Then, we apply the feature vector learned from self-attention to generate the initial category labels. Secondly, for the initial label predicted of each omics data, we use an effective Multi-Omics Correlation Discovery Network (MOCDN) to learn the cross-omic correlations in the label space. Finally, we use the softmax classifier for label prediction. RESULTS We demonstrate that our method outperforms several state-of-the-art methods on two datasets with mRNA expression data, DNA methylation data, and miRNA expression data. In addition, we identified essential biomarkers of relevant diseases by MOADLN, and the generality of MOADLN is also demonstrated in the KIRP and KIRC datasets. CONCLUSIONS MOADLN jointly explores correlations between patients in intra-omics and correlations of cross-omics in label space, which is an effective DL-based classification of biomedical data.
Collapse
Affiliation(s)
- Ping Gong
- School of Medical Imaging, Xuzhou Medical University, Xuzhou, CN, China.
| | - Lei Cheng
- School of Medical Imaging, Xuzhou Medical University, Xuzhou, CN, China
| | - Zhiyuan Zhang
- School of Medical Imaging, Xuzhou Medical University, Xuzhou, CN, China
| | - Ao Meng
- School of Medical Imaging, Xuzhou Medical University, Xuzhou, CN, China
| | - Enshuo Li
- School of Medical Imaging, Xuzhou Medical University, Xuzhou, CN, China
| | - Jie Chen
- Department of Radiation Oncology, Affiliated Hospital of Xuzhou Medical University, Xuzhou, CN, China
| | - Longzhen Zhang
- Department of Radiation Oncology, Affiliated Hospital of Xuzhou Medical University, Xuzhou, CN, China
| |
Collapse
|
33
|
Bao J, Chang C, Zhang Q, Saykin AJ, Shen L, Long Q, for the Alzheimer’s Disease Neuroimaging Initiative. Integrative analysis of multi-omics and imaging data with incorporation of biological information via structural Bayesian factor analysis. Brief Bioinform 2023; 24:bbad073. [PMID: 36882008 PMCID: PMC10387302 DOI: 10.1093/bib/bbad073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Revised: 01/14/2023] [Accepted: 02/10/2023] [Indexed: 03/09/2023] Open
Abstract
MOTIVATION With the rapid development of modern technologies, massive data are available for the systematic study of Alzheimer's disease (AD). Though many existing AD studies mainly focus on single-modality omics data, multi-omics datasets can provide a more comprehensive understanding of AD. To bridge this gap, we proposed a novel structural Bayesian factor analysis framework (SBFA) to extract the information shared by multi-omics data through the aggregation of genotyping data, gene expression data, neuroimaging phenotypes and prior biological network knowledge. Our approach can extract common information shared by different modalities and encourage biologically related features to be selected, guiding future AD research in a biologically meaningful way. METHOD Our SBFA model decomposes the mean parameters of the data into a sparse factor loading matrix and a factor matrix, where the factor matrix represents the common information extracted from multi-omics and imaging data. Our framework is designed to incorporate prior biological network information. Our simulation study demonstrated that our proposed SBFA framework could achieve the best performance compared with the other state-of-the-art factor-analysis-based integrative analysis methods. RESULTS We apply our proposed SBFA model together with several state-of-the-art factor analysis models to extract the latent common information from genotyping, gene expression and brain imaging data simultaneously from the ADNI biobank database. The latent information is then used to predict the functional activities questionnaire score, an important measurement for diagnosis of AD quantifying subjects' abilities in daily life. Our SBFA model shows the best prediction performance compared with the other factor analysis models. AVAILABILITY Code are publicly available at https://github.com/JingxuanBao/SBFA. CONTACT qlong@upenn.edu.
Collapse
Affiliation(s)
- Jingxuan Bao
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, 19104, PA, USA
| | - Changgee Chang
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, 19104, PA, USA
| | - Qiyiwen Zhang
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, 19104, PA, USA
| | - Andrew J Saykin
- Department of Radiology and Imaging Sciences, Indiana University, Indianapolis, 46202, IN, USA
| | - Li Shen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, 19104, PA, USA
| | - Qi Long
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, 19104, PA, USA
| | | |
Collapse
|
34
|
Kim H, Kim Y, Lee CY, Kim DG, Cheon M. Investigation of early molecular alterations in tauopathy with generative adversarial networks. Sci Rep 2023; 13:732. [PMID: 36639689 PMCID: PMC9839697 DOI: 10.1038/s41598-023-28081-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Accepted: 01/12/2023] [Indexed: 01/15/2023] Open
Abstract
The recent advances in deep learning-based approaches hold great promise for unravelling biological mechanisms, discovering biomarkers, and predicting gene function. Here, we deployed a deep generative model for simulating the molecular progression of tauopathy and dissecting its early features. We applied generative adversarial networks (GANs) for bulk RNA-seq analysis in a mouse model of tauopathy (TPR50-P301S). The union set of differentially expressed genes from four comparisons (two phenotypes with two time points) was used as input training data. We devised four-way transition curves for a virtual simulation of disease progression, clustered and grouped the curves by patterns, and identified eight distinct pattern groups showing different biological features from Gene Ontology enrichment analyses. Genes that were upregulated in early tauopathy were associated with vasculature development, and these changes preceded immune responses. We confirmed significant disease-associated differences in the public human data for the genes of the different pattern groups. Validation with weighted gene co-expression network analysis suggested that our GAN-based approach can be used to detect distinct patterns of early molecular changes during disease progression, which may be extremely difficult in in vivo experiments. The generative model is a valid systematic approach for exploring the sequential cascades of mechanisms and targeting early molecular events related to dementia.
Collapse
Affiliation(s)
- Hyerin Kim
- Dementia Research Group, Korea Brain Research Institute (KBRI), Daegu, 41062, Republic of Korea
| | - Yongjin Kim
- Dementia Research Group, Korea Brain Research Institute (KBRI), Daegu, 41062, Republic of Korea
| | - Chung-Yeol Lee
- Dementia Research Group, Korea Brain Research Institute (KBRI), Daegu, 41062, Republic of Korea
| | - Do-Geun Kim
- Dementia Research Group, Korea Brain Research Institute (KBRI), Daegu, 41062, Republic of Korea
| | - Mookyung Cheon
- Dementia Research Group, Korea Brain Research Institute (KBRI), Daegu, 41062, Republic of Korea.
| |
Collapse
|
35
|
Tanvir Ahmed K, Cheng S, Li Q, Yong J, Zhang W. Incomplete time-series gene expression in integrative study for islet autoimmunity prediction. Brief Bioinform 2022; 24:6895461. [PMID: 36513375 PMCID: PMC9851333 DOI: 10.1093/bib/bbac537] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Revised: 10/27/2022] [Accepted: 11/08/2022] [Indexed: 12/15/2022] Open
Abstract
Type 1 diabetes (T1D) outcome prediction plays a vital role in identifying novel risk factors, ensuring early patient care and designing cohort studies. TEDDY is a longitudinal cohort study that collects a vast amount of multi-omics and clinical data from its participants to explore the progression and markers of T1D. However, missing data in the omics profiles make the outcome prediction a difficult task. TEDDY collected time series gene expression for less than 6% of enrolled participants. Additionally, for the participants whose gene expressions are collected, 79% time steps are missing. This study introduces an advanced bioinformatics framework for gene expression imputation and islet autoimmunity (IA) prediction. The imputation model generates synthetic data for participants with partially or entirely missing gene expression. The prediction model integrates the synthetic gene expression with other risk factors to achieve better predictive performance. Comprehensive experiments on TEDDY datasets show that: (1) Our pipeline can effectively integrate synthetic gene expression with family history, HLA genotype and SNPs to better predict IA status at 2 years (sensitivity 0.622, AUC 0.715) compared with the individual datasets and state-of-the-art results in the literature (AUC 0.682). (2) The synthetic gene expression contains predictive signals as strong as the true gene expression, reducing reliance on expensive and long-term longitudinal data collection. (3) Time series gene expression is crucial to the proposed improvement and shows significantly better predictive ability than cross-sectional gene expression. (4) Our pipeline is robust to limited data availability. Availability: Code is available at https://github.com/compbiolabucf/TEDDY.
Collapse
Affiliation(s)
| | - Sze Cheng
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota Twin Cities, Minneapolis, MN 55455, USA
| | - Qian Li
- Department of Biostatistics, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Jeongsik Yong
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota Twin Cities, Minneapolis, MN 55455, USA
| | - Wei Zhang
- Corresponding author. Wei Zhang, Computer Science Department, University of Central Florida. Tel.: 407-823-2763;
| |
Collapse
|
36
|
Zhang Y, Kiryu H. MODEC: an unsupervised clustering method integrating omics data for identifying cancer subtypes. Brief Bioinform 2022; 23:6696139. [PMID: 36094092 DOI: 10.1093/bib/bbac372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 07/16/2022] [Accepted: 08/08/2022] [Indexed: 12/14/2022] Open
Abstract
The identification of cancer subtypes can help researchers understand hidden genomic mechanisms, enhance diagnostic accuracy and improve clinical treatments. With the development of high-throughput techniques, researchers can access large amounts of data from multiple sources. Because of the high dimensionality and complexity of multiomics and clinical data, research into the integration of multiomics data is needed, and developing effective tools for such purposes remains a challenge for researchers. In this work, we proposed an entirely unsupervised clustering method without harnessing any prior knowledge (MODEC). We used manifold optimization and deep-learning techniques to integrate multiomics data for the identification of cancer subtypes and the analysis of significant clinical variables. Since there is nonlinearity in the gene-level datasets, we used manifold optimization methodology to extract essential information from the original omics data to obtain a low-dimensional latent subspace. Then, MODEC uses a deep learning-based clustering module to iteratively define cluster centroids and assign cluster labels to each sample by minimizing the Kullback-Leibler divergence loss. MODEC was applied to six public cancer datasets from The Cancer Genome Atlas database and outperformed eight competing methods in terms of the accuracy and reliability of the subtyping results. MODEC was extremely competitive in the identification of survival patterns and significant clinical features, which could help doctors monitor disease progression and provide more suitable treatment strategies.
Collapse
Affiliation(s)
- Yanting Zhang
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, 113-0033, Tokyo, Japan
| | - Hisanori Kiryu
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, 113-0033, Tokyo, Japan
| |
Collapse
|
37
|
Wang X, Yu G, Wang J, Zain AM, Guo W. Lung cancer subtype diagnosis using weakly-paired multi-omics data. Bioinformatics 2022; 38:5092-5099. [PMID: 36130063 DOI: 10.1093/bioinformatics/btac643] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 08/30/2022] [Accepted: 09/19/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Cancer subtype diagnosis is crucial for its precise treatment and different subtypes need different therapies. Although the diagnosis can be greatly improved by fusing multiomics data, most fusion solutions depend on paired omics data, which are actually weakly paired, with different omics views missing for different samples. Incomplete multiview learning-based solutions can alleviate this issue but are still far from satisfactory because they: (i) mainly focus on shared information while ignore the important individuality of multiomics data and (ii) cannot pick out interpretable features for precise diagnosis. RESULTS We introduce an interpretable and flexible solution (LungDWM) for Lung cancer subtype Diagnosis using Weakly paired Multiomics data. LungDWM first builds an attention-based encoder for each omics to pick out important diagnostic features and extract shared and complementary information across omics. Next, it proposes an individual loss to jointly extract the specific information of each omics and performs generative adversarial learning to impute missing omics of samples using extracted features. After that, it fuses the extracted and imputed features to diagnose cancer subtypes. Experiments on benchmark datasets show that LungDWM achieves a better performance than recent competitive methods, and has a high authenticity and good interpretability. AVAILABILITY AND IMPLEMENTATION The code is available at http://www.sdu-idea.cn/codes.php?name=LungDWM. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xingze Wang
- School of Software, Shandong University, Ji'nan 250100, China.,SDU-NTU Joint Centre for AI Research, Shandong University, Ji'nan 250100, China
| | - Guoxian Yu
- School of Software, Shandong University, Ji'nan 250100, China.,SDU-NTU Joint Centre for AI Research, Shandong University, Ji'nan 250100, China
| | - Jun Wang
- SDU-NTU Joint Centre for AI Research, Shandong University, Ji'nan 250100, China
| | - Azlan Mohd Zain
- Big Data Centre, University Teknologi Malaysia, Skudai 81310, Malaysia
| | - Wei Guo
- School of Software, Shandong University, Ji'nan 250100, China.,SDU-NTU Joint Centre for AI Research, Shandong University, Ji'nan 250100, China
| |
Collapse
|
38
|
Baul S, Ahmed KT, Filipek J, Zhang W. omicsGAT: Graph Attention Network for Cancer Subtype Analyses. Int J Mol Sci 2022; 23:10220. [PMID: 36142140 PMCID: PMC9499656 DOI: 10.3390/ijms231810220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Revised: 08/14/2022] [Accepted: 08/31/2022] [Indexed: 12/01/2022] Open
Abstract
The use of high-throughput omics technologies is becoming increasingly popular in all facets of biomedical science. The mRNA sequencing (RNA-seq) method reports quantitative measures of more than tens of thousands of biological features. It provides a more comprehensive molecular perspective of studied cancer mechanisms compared to traditional approaches. Graph-based learning models have been proposed to learn important hidden representations from gene expression data and network structure to improve cancer outcome prediction, patient stratification, and cell clustering. However, these graph-based methods cannot rank the importance of the different neighbors for a particular sample in the downstream cancer subtype analyses. In this study, we introduce omicsGAT, a graph attention network (GAT) model to integrate graph-based learning with an attention mechanism for RNA-seq data analysis. The multi-head attention mechanism in omicsGAT can more effectively secure information of a particular sample by assigning different attention coefficients to its neighbors. Comprehensive experiments on The Cancer Genome Atlas (TCGA) breast cancer and bladder cancer bulk RNA-seq data and two single-cell RNA-seq datasets validate that (1) the proposed model can effectively integrate neighborhood information of a sample and learn an embedding vector to improve disease phenotype prediction, cancer patient stratification, and cell clustering of the sample and (2) the attention matrix generated from the multi-head attention coefficients provides more useful information compared to the sample correlation-based adjacency matrix. From the results, we can conclude that some neighbors play a more important role than others in cancer subtype analyses of a particular sample based on the attention coefficient.
Collapse
Affiliation(s)
- Sudipto Baul
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
- Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, USA
| | - Khandakar Tanvir Ahmed
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
- Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, USA
| | - Joseph Filipek
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
- Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, USA
| | - Wei Zhang
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
- Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, USA
| |
Collapse
|
39
|
Leng D, Zheng L, Wen Y, Zhang Y, Wu L, Wang J, Wang M, Zhang Z, He S, Bo X. A benchmark study of deep learning-based multi-omics data fusion methods for cancer. Genome Biol 2022; 23:171. [PMID: 35945544 PMCID: PMC9361561 DOI: 10.1186/s13059-022-02739-2] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Accepted: 07/26/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND A fused method using a combination of multi-omics data enables a comprehensive study of complex biological processes and highlights the interrelationship of relevant biomolecules and their functions. Driven by high-throughput sequencing technologies, several promising deep learning methods have been proposed for fusing multi-omics data generated from a large number of samples. RESULTS In this study, 16 representative deep learning methods are comprehensively evaluated on simulated, single-cell, and cancer multi-omics datasets. For each of the datasets, two tasks are designed: classification and clustering. The classification performance is evaluated by using three benchmarking metrics including accuracy, F1 macro, and F1 weighted. Meanwhile, the clustering performance is evaluated by using four benchmarking metrics including the Jaccard index (JI), C-index, silhouette score, and Davies Bouldin score. For the cancer multi-omics datasets, the methods' strength in capturing the association of multi-omics dimensionality reduction results with survival and clinical annotations is further evaluated. The benchmarking results indicate that moGAT achieves the best classification performance. Meanwhile, efmmdVAE, efVAE, and lfmmdVAE show the most promising performance across all complementary contexts in clustering tasks. CONCLUSIONS Our benchmarking results not only provide a reference for biomedical researchers to choose appropriate deep learning-based multi-omics data fusion methods, but also suggest the future directions for the development of more effective multi-omics data fusion methods. The deep learning frameworks are available at https://github.com/zhenglinyi/DL-mo .
Collapse
Affiliation(s)
- Dongjin Leng
- Institute of Health Service and Transfusion Medicine, Beijing, People’s Republic of China
| | - Linyi Zheng
- School of Informatics, Xiamen University, Xiamen, People’s Republic of China
| | - Yuqi Wen
- Institute of Health Service and Transfusion Medicine, Beijing, People’s Republic of China
| | - Yunhao Zhang
- School of Informatics, Xiamen University, Xiamen, People’s Republic of China
| | - Lianlian Wu
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, People’s Republic of China
| | - Jing Wang
- School of Medicine, Tsinghua University, Beijing, People’s Republic of China
| | - Meihong Wang
- School of Informatics, Xiamen University, Xiamen, People’s Republic of China
| | - Zhongnan Zhang
- School of Informatics, Xiamen University, Xiamen, People’s Republic of China
| | - Song He
- Institute of Health Service and Transfusion Medicine, Beijing, People’s Republic of China
| | - Xiaochen Bo
- Institute of Health Service and Transfusion Medicine, Beijing, People’s Republic of China
| |
Collapse
|