1
|
Vinken M, Grimm D, Baatout S, Baselet B, Beheshti A, Braun M, Carstens AC, Casaletto JA, Cools B, Costes SV, De Meulemeester P, Doruk B, Eyal S, Ferreira MJS, Miranda S, Hahn C, Helvacıoğlu Akyüz S, Herbert S, Krepkiy D, Lichterfeld Y, Liemersdorf C, Krüger M, Marchal S, Ritz J, Schmakeit T, Stenuit H, Tabury K, Trittel T, Wehland M, Zhang YS, Putt KS, Zhang ZY, Tagle DA. Taking the 3Rs to a higher level: replacement and reduction of animal testing in life sciences in space research. Biotechnol Adv 2025; 81:108574. [PMID: 40180136 PMCID: PMC12048243 DOI: 10.1016/j.biotechadv.2025.108574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2025] [Revised: 03/28/2025] [Accepted: 03/29/2025] [Indexed: 04/05/2025]
Abstract
Human settlements on the Moon, crewed missions to Mars and space tourism will become a reality in the next few decades. Human presence in space, especially for extended periods of time, will therefore steeply increase. However, despite more than 60 years of spaceflight, the mechanisms underlying the effects of the space environment on human physiology are still not fully understood. Animals, ranging in complexity from flies to monkeys, have played a pioneering role in understanding the (patho)physiological outcome of critical environmental factors in space, in particular altered gravity and cosmic radiation. The use of animals in biomedical research is increasingly being criticized because of ethical reasons and limited human relevance. Driven by the 3Rs concept, calling for replacement, reduction and refinement of animal experimentation, major efforts have been focused in the past decades on the development of alternative methods that fully bypass animal testing or so-called new approach methodologies. These new approach methodologies range from simple monolayer cultures of individual primary or stem cells all up to bioprinted 3D organoids and microfluidic chips that recapitulate the complex cellular architecture of organs. Other approaches applied in life sciences in space research contribute to the reduction of animal experimentation. These include methods to mimic space conditions on Earth, such as microgravity and radiation simulators, as well as tools to support the processing, analysis or application of testing results obtained in life sciences in space research, including systems biology, live-cell, high-content and real-time analysis, high-throughput analysis, artificial intelligence and digital twins. The present paper provides an in-depth overview of such methods to replace or reduce animal testing in life sciences in space research.
Collapse
Affiliation(s)
- Mathieu Vinken
- Department of Pharmaceutical and Pharmacological Sciences, Vrije Universiteit Brussel, Brussels, Belgium.
| | - Daniela Grimm
- Department of Microgravity and Translational Regenerative Medicine, Otto-von-Guericke-University, Magdeburg, Germany; Department of Biomedicine, Aarhus University, Aarhus, Denmark
| | - Sarah Baatout
- Nuclear Medical Applications Institute, Belgian Nuclear Research Centre, Mol, Belgium; Department of Molecular Biotechnology, Gent University, Gent, Belgium
| | - Bjorn Baselet
- Nuclear Medical Applications Institute, Belgian Nuclear Research Centre, Mol, Belgium
| | - Afshin Beheshti
- Center of Space Biomedicine, McGowan Institute for Regenerative Medicine, and Department of Surgery, University of Pittsburgh, Pittsburgh, PA, USA; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Markus Braun
- German Space Agency, German Aerospace Center, Bonn, Germany
| | | | - James A Casaletto
- Blue Marble Space Institute of Science, Space Biosciences Division, NASA Ames Research Center, Moffett Field, CA, USA
| | - Ben Cools
- Department of Pharmaceutical and Pharmacological Sciences, Vrije Universiteit Brussel, Brussels, Belgium; Nuclear Medical Applications Institute, Belgian Nuclear Research Centre, Mol, Belgium
| | - Sylvain V Costes
- Blue Marble Space Institute of Science, Space Biosciences Division, NASA Ames Research Center, Moffett Field, CA, USA; Space Biosciences Division, NASA Ames Research Center, Moffett Field, CA, USA
| | - Phoebe De Meulemeester
- Department of Pharmaceutical and Pharmacological Sciences, Vrije Universiteit Brussel, Brussels, Belgium
| | - Bartu Doruk
- Space Applications Services NV/SA, Sint-Stevens-Woluwe, Belgium; Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
| | - Sara Eyal
- Institute for Drug Research, School of Pharmacy, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem, Israel
| | | | - Silvana Miranda
- Nuclear Medical Applications Institute, Belgian Nuclear Research Centre, Mol, Belgium; Department of Molecular Biotechnology, Gent University, Gent, Belgium
| | - Christiane Hahn
- European Space Agency, Human and Robotic Exploration Programmes, Human Exploration Science team, Noordwijk, the Netherlands
| | - Sinem Helvacıoğlu Akyüz
- Department of Pharmaceutical and Pharmacological Sciences, Vrije Universiteit Brussel, Brussels, Belgium
| | - Stefan Herbert
- Space Systems, Airbus Defence and Space, Immenstaad am Bodensee, Germany
| | - Dmitriy Krepkiy
- Office of Special Initiatives, National Center for Advancing Translational Sciences, National Institutes of Health, Bethesda, MD, USA
| | - Yannick Lichterfeld
- Department of Applied Aerospace Biology, Institute of Aerospace Medicine, German Aerospace Center, Cologne, Germany
| | - Christian Liemersdorf
- Department of Applied Aerospace Biology, Institute of Aerospace Medicine, German Aerospace Center, Cologne, Germany
| | - Marcus Krüger
- Department of Microgravity and Translational Regenerative Medicine, Otto-von-Guericke-University, Magdeburg, Germany
| | - Shannon Marchal
- Department of Microgravity and Translational Regenerative Medicine, Otto-von-Guericke-University, Magdeburg, Germany
| | - Jette Ritz
- Department of Pharmaceutical and Pharmacological Sciences, Vrije Universiteit Brussel, Brussels, Belgium
| | - Theresa Schmakeit
- Department of Applied Aerospace Biology, Institute of Aerospace Medicine, German Aerospace Center, Cologne, Germany
| | - Hilde Stenuit
- Space Applications Services NV/SA, Sint-Stevens-Woluwe, Belgium
| | - Kevin Tabury
- Nuclear Medical Applications Institute, Belgian Nuclear Research Centre, Mol, Belgium
| | - Torsten Trittel
- Department of Microgravity and Translational Regenerative Medicine, Otto-von-Guericke-University, Magdeburg, Germany; Department of Engineering, Brandenburg University of Applied Sciences, Brandenburg an der Havel, Germany
| | - Markus Wehland
- Department of Microgravity and Translational Regenerative Medicine, Otto-von-Guericke-University, Magdeburg, Germany
| | - Yu Shrike Zhang
- Division of Engineering, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA; Harvard Stem Cell Institute, Harvard University, Cambridge, MA, USA; Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Karson S Putt
- Institute for Drug Discovery, Purdue University, West Lafayette, IN, USA
| | - Zhong-Yin Zhang
- Institute for Drug Discovery, Purdue University, West Lafayette, IN, USA; Borch Department of Medicinal Chemistry and Molecular Pharmacology, Purdue University, West Lafayette, IN, USA
| | - Danilo A Tagle
- Office of Special Initiatives, National Center for Advancing Translational Sciences, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
2
|
Ai X, Smith MC, Feltus FA. GEMDiff: a diffusion workflow bridges between normal and tumor gene expression states: a breast cancer case study. Brief Bioinform 2025; 26:bbaf093. [PMID: 40067113 PMCID: PMC11894803 DOI: 10.1093/bib/bbaf093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2024] [Revised: 01/17/2025] [Accepted: 02/19/2025] [Indexed: 03/15/2025] Open
Abstract
Breast cancer remains a significant global health challenge due to its complexity, which arises from multiple genetic and epigenetic mutations that originate in normal breast tissue. Traditional machine learning models often fall short in addressing the intricate gene interactions that complicate drug design and treatment strategies. In contrast, our study introduces GEMDiff, a novel computational workflow leveraging a diffusion model to bridge the gene expression states between normal and tumor conditions. GEMDiff augments RNAseq data and simulates perturbation transformations between normal and tumor gene states, enhancing biomarker identification. GEMDiff can handle large-scale gene expression data without succumbing to the scalability and stability issues that plague other generative models. By avoiding the need for task-specific hyper-parameter tuning and specific loss functions, GEMDiff can be generalized across various tasks, making it a robust tool for gene expression analysis. The model's ability to augment RNA-seq data and simulate gene perturbations provides a valuable tool for researchers. This capability can be used to generate synthetic data for training other machine learning models, thereby addressing the issue of limited biological data and enhancing the performance of predictive models. The effectiveness of GEMDiff is demonstrated through a case study using breast mRNA gene expression data, identifying 307 core genes involved in the transition from a breast tumor to a normal gene expression state. GEMDiff is open source and available at https://github.com/xai990/GEMDiff.git under the MIT license.
Collapse
Affiliation(s)
- Xusheng Ai
- Department of Electrical and Computer Engineering, Clemson University, Clemson, SC 29634, United States
| | - Melissa C Smith
- Department of Electrical and Computer Engineering, Clemson University, Clemson, SC 29634, United States
| | - F Alex Feltus
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC 29634, United States
- Biomedical Data Science and Informatics Program, Clemson University, Clemson, SC 29634, United States
- Center for Human Genetics, Clemson University, Clemson, SC 29634, United States
| |
Collapse
|
3
|
Carrillo-Perez F, Pizurica M, Zheng Y, Nandi TN, Madduri R, Shen J, Gevaert O. Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models. Nat Biomed Eng 2025; 9:320-332. [PMID: 38514775 DOI: 10.1038/s41551-024-01193-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2022] [Accepted: 02/29/2024] [Indexed: 03/23/2024]
Abstract
Training machine-learning models with synthetically generated data can alleviate the problem of data scarcity when acquiring diverse and sufficiently large datasets is costly and challenging. Here we show that cascaded diffusion models can be used to synthesize realistic whole-slide image tiles from latent representations of RNA-sequencing data from human tumours. Alterations in gene expression affected the composition of cell types in the generated synthetic image tiles, which accurately preserved the distribution of cell types and maintained the cell fraction observed in bulk RNA-sequencing data, as we show for lung adenocarcinoma, kidney renal papillary cell carcinoma, cervical squamous cell carcinoma, colon adenocarcinoma and glioblastoma. Machine-learning models pretrained with the generated synthetic data performed better than models trained from scratch. Synthetic data may accelerate the development of machine-learning models in scarce-data settings and allow for the imputation of missing data modalities.
Collapse
Affiliation(s)
- Francisco Carrillo-Perez
- Stanford Center for Biomedical Informatics Research (BMIR), Stanford University, School of Medicine, Stanford, CA, USA
| | - Marija Pizurica
- Stanford Center for Biomedical Informatics Research (BMIR), Stanford University, School of Medicine, Stanford, CA, USA
- Internet technology and Data science Lab (IDLab), Ghent University, Ghent, Belgium
| | - Yuanning Zheng
- Stanford Center for Biomedical Informatics Research (BMIR), Stanford University, School of Medicine, Stanford, CA, USA
| | - Tarak Nath Nandi
- Data Science and Learning Division, Argonne National Laboratory, Lemont, IL, USA
| | - Ravi Madduri
- Data Science and Learning Division, Argonne National Laboratory, Lemont, IL, USA
| | - Jeanne Shen
- Department of Pathology, Stanford University, School of Medicine, Palo Alto, CA, USA
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research (BMIR), Stanford University, School of Medicine, Stanford, CA, USA.
- Department of Biomedical Data Science, Stanford University, School of Medicine, Stanford, CA, USA.
| |
Collapse
|
4
|
Gangwal A, Ansari A, Ahmad I, Azad AK, Wan Sulaiman WMA. Current strategies to address data scarcity in artificial intelligence-based drug discovery: A comprehensive review. Comput Biol Med 2024; 179:108734. [PMID: 38964243 DOI: 10.1016/j.compbiomed.2024.108734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 06/01/2024] [Accepted: 06/08/2024] [Indexed: 07/06/2024]
Abstract
Artificial intelligence (AI) has played a vital role in computer-aided drug design (CADD). This development has been further accelerated with the increasing use of machine learning (ML), mainly deep learning (DL), and computing hardware and software advancements. As a result, initial doubts about the application of AI in drug discovery have been dispelled, leading to significant benefits in medicinal chemistry. At the same time, it is crucial to recognize that AI is still in its infancy and faces a few limitations that need to be addressed to harness its full potential in drug discovery. Some notable limitations are insufficient, unlabeled, and non-uniform data, the resemblance of some AI-generated molecules with existing molecules, unavailability of inadequate benchmarks, intellectual property rights (IPRs) related hurdles in data sharing, poor understanding of biology, focus on proxy data and ligands, lack of holistic methods to represent input (molecular structures) to prevent pre-processing of input molecules (feature engineering), etc. The major component in AI infrastructure is input data, as most of the successes of AI-driven efforts to improve drug discovery depend on the quality and quantity of data, used to train and test AI algorithms, besides a few other factors. Additionally, data-gulping DL approaches, without sufficient data, may collapse to live up to their promise. Current literature suggests a few methods, to certain extent, effectively handle low data for better output from the AI models in the context of drug discovery. These are transferring learning (TL), active learning (AL), single or one-shot learning (OSL), multi-task learning (MTL), data augmentation (DA), data synthesis (DS), etc. One different method, which enables sharing of proprietary data on a common platform (without compromising data privacy) to train ML model, is federated learning (FL). In this review, we compare and discuss these methods, their recent applications, and limitations while modeling small molecule data to get the improved output of AI methods in drug discovery. Article also sums up some other novel methods to handle inadequate data.
Collapse
Affiliation(s)
- Amit Gangwal
- Department of Natural Product Chemistry, Shri Vile Parle Kelavani Mandal's Institute of Pharmacy, Dhule, 424001, Maharashtra, India.
| | - Azim Ansari
- Computer Aided Drug Design Center, Shri Vile Parle Kelavani Mandal's Institute of Pharmacy, Dhule, 424001, Maharashtra, India
| | - Iqrar Ahmad
- Department of Pharmaceutical Chemistry, Prof. Ravindra Nikam College of Pharmacy, Gondur, Dhule, 424002, Maharashtra, India.
| | - Abul Kalam Azad
- Faculty of Pharmacy, University College of MAIWP International, Batu Caves, 68100, Kuala Lumpur, Malaysia.
| | | |
Collapse
|
5
|
Knudsen JE, Rich JM, Ma R. Artificial Intelligence in Pathomics and Genomics of Renal Cell Carcinoma. Urol Clin North Am 2024; 51:47-62. [PMID: 37945102 DOI: 10.1016/j.ucl.2023.06.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2023]
Abstract
The integration of artificial intelligence (AI) with histopathology images and gene expression patterns has led to the emergence of the dynamic fields of pathomics and genomics. These fields have revolutionized renal cell carcinoma (RCC) diagnosis and subtyping and improved survival prediction models. Machine learning has identified unique gene patterns across RCC subtypes and grades, providing insights into RCC origins and potential treatments, as targeted therapies. The combination of pathomics and genomics using AI opens new avenues in RCC research, promising future breakthroughs and innovations that patients and physicians can anticipate.
Collapse
Affiliation(s)
- J Everett Knudsen
- Catherine & Joseph Aresty Department of Urology, USC Institute of Urology, Center for Robotic Simulation & Education, University of Southern California, Los Angeles, CA, USA
| | - Joseph M Rich
- Catherine & Joseph Aresty Department of Urology, USC Institute of Urology, Center for Robotic Simulation & Education, University of Southern California, Los Angeles, CA, USA
| | - Runzhuo Ma
- Catherine & Joseph Aresty Department of Urology, USC Institute of Urology, Center for Robotic Simulation & Education, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
6
|
Wang Y, Chen Q, Shao H, Zhang R, Shen H. Generating bulk RNA-Seq gene expression data based on generative deep learning models and utilizing it for data augmentation. Comput Biol Med 2024; 169:107828. [PMID: 38101117 DOI: 10.1016/j.compbiomed.2023.107828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2023] [Revised: 11/22/2023] [Accepted: 12/04/2023] [Indexed: 12/17/2023]
Abstract
Large-scale high-throughput transcriptome sequencing data holds significant value in biomedical research. However, practical challenges such as difficulty in sample acquisition often limit the availability of large sample sizes, leading to decreased reliability of the analysis results. In practice, generative deep learning models, such as Generative Adversarial Networks (GANs) and Diffusion Models (DMs), have been proven to generate realistic data and may be used to solve this promblem. In this study, we utilized bulk RNA-Seq gene expression data to construct different generative models with two data preprocessing methods: Min-Max-GAN, Z-Score-GAN, Min-Max-DM, and Z-Score-DM. We demonstrated that the generated data from the Min-Max-GAN model exhibited high similarity to real data, surpassing the performance of the other models significantly. Furthermore, we trained the models on the largest dataset available to date, achieving MMD (Maximum Mean Discrepancy) of 0.030 and 0.033 on the training and independent datasets, respectively. Through SHAP (SHapley Additive exPlanations) explanations of our generative model, we also enhanced our model's credibility. Finally, we applied the generated data to data augmentation and observed a significant improvement in the performance of classification models. In summary, this study establishes a GAN-based approach for generating bulk RNA-Seq gene expression data, which contributes to enhancing the performance and reliability of downstream tasks in high-throughput transcriptome analysis.
Collapse
Affiliation(s)
- Yinglun Wang
- School of Life Sciences and Biopharmaceutics, Guangdong Pharmaceutical University, Guangzhou, 51006, PR China
| | - Qiurui Chen
- School of Life Sciences and Biopharmaceutics, Guangdong Pharmaceutical University, Guangzhou, 51006, PR China
| | - Hongwei Shao
- School of Life Sciences and Biopharmaceutics, Guangdong Pharmaceutical University, Guangzhou, 51006, PR China
| | - Rongxin Zhang
- School of Life Sciences and Biopharmaceutics, Guangdong Pharmaceutical University, Guangzhou, 51006, PR China.
| | - Han Shen
- School of Life Sciences and Biopharmaceutics, Guangdong Pharmaceutical University, Guangzhou, 51006, PR China.
| |
Collapse
|
7
|
Ravaee H, Manshaei MH, Safayani M, Sartakhti JS. Intelligent phenotype-detection and gene expression profile generation with generative adversarial networks. J Theor Biol 2024; 577:111636. [PMID: 37944593 DOI: 10.1016/j.jtbi.2023.111636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2023] [Revised: 08/11/2023] [Accepted: 10/05/2023] [Indexed: 11/12/2023]
Abstract
Gene expression analysis is valuable for cancer type classification and identifying diverse cancer phenotypes. The latest high-throughput RNA sequencing devices have enabled access to large volumes of gene expression data. However, we face several challenges, such as data security and privacy, when we develop machine learning-based classifiers for categorizing cancer types with these datasets. To address these issues, we propose IP3G (Intelligent Phenotype-detection and Gene expression profile Generation with Generative adversarial network), a model based on Generative Adversarial Networks. IP3G tackles two major problems: augmenting gene expression data and unsupervised phenotype discovery. By converting gene expression profiles into 2-Dimensional images and leveraging IP3G, we generate new profiles for specific phenotypes. IP3G learns disentangled representations of gene expression patterns and identifies phenotypes without labeled data. We improve the objective function of the GAN used in IP3G by employing the earth mover distance and a novel mutual information function. IP3G outperforms clustering methods like k-Means, DBSCAN, and GMM in unsupervised phenotype discovery, while also surpassing SVM and CNN classification accuracy by up to 6% through gene expression profile augmentation. The source code for the developed IP3G is accessible to the public on GitHub.
Collapse
Affiliation(s)
- Hamid Ravaee
- Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, 84156-83111, Iran
| | - Mohammad Hossein Manshaei
- Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, 84156-83111, Iran.
| | - Mehran Safayani
- Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, 84156-83111, Iran
| | | |
Collapse
|
8
|
Sen SK, Green ED, Hutter CM, Craven M, Ideker T, Di Francesco V. Opportunities for basic, clinical, and bioethics research at the intersection of machine learning and genomics. CELL GENOMICS 2024; 4:100466. [PMID: 38190108 PMCID: PMC10794834 DOI: 10.1016/j.xgen.2023.100466] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Revised: 07/14/2023] [Accepted: 11/20/2023] [Indexed: 01/09/2024]
Abstract
The data-intensive fields of genomics and machine learning (ML) are in an early stage of convergence. Genomics researchers increasingly seek to harness the power of ML methods to extract knowledge from their data; conversely, ML scientists recognize that genomics offers a wealth of large, complex, and well-annotated datasets that can be used as a substrate for developing biologically relevant algorithms and applications. The National Human Genome Research Institute (NHGRI) inquired with researchers working in these two fields to identify common challenges and receive recommendations to better support genomic research efforts using ML approaches. Those included increasing the amount and variety of training datasets by integrating genomic with multiomics, context-specific (e.g., by cell type), and social determinants of health datasets; reducing the inherent biases of training datasets; prioritizing transparency and interpretability of ML methods; and developing privacy-preserving technologies for research participants' data.
Collapse
Affiliation(s)
- Shurjo K Sen
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
| | - Eric D Green
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Carolyn M Hutter
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Mark Craven
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI 53792, USA; Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53792, USA
| | - Trey Ideker
- Division of Genetics, Department of Medicine, University of California San Diego, La Jolla, CA 92093, USA
| | - Valentina Di Francesco
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA.
| |
Collapse
|
9
|
Li R, Wu J, Li G, Liu J, Xuan J, Zhu Q. Mdwgan-gp: data augmentation for gene expression data based on multiple discriminator WGAN-GP. BMC Bioinformatics 2023; 24:427. [PMID: 37957576 PMCID: PMC10644641 DOI: 10.1186/s12859-023-05558-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Accepted: 11/06/2023] [Indexed: 11/15/2023] Open
Abstract
BACKGROUND Although gene expression data play significant roles in biological and medical studies, their applications are hampered due to the difficulty and high expenses of gathering them through biological experiments. It is an urgent problem to generate high quality gene expression data with computational methods. WGAN-GP, a generative adversarial network-based method, has been successfully applied in augmenting gene expression data. However, mode collapse or over-fitting may take place for small training samples due to just one discriminator is adopted in the method. RESULTS In this study, an improved data augmentation approach MDWGAN-GP, a generative adversarial network model with multiple discriminators, is proposed. In addition, a novel method is devised for enriching training samples based on linear graph convolutional network. Extensive experiments were implemented on real biological data. CONCLUSIONS The experimental results have demonstrated that compared with other state-of-the-art methods, the MDWGAN-GP method can produce higher quality generated gene expression data in most cases.
Collapse
Affiliation(s)
- Rongyuan Li
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, China
| | - Jingli Wu
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, China.
| | - Gaoshi Li
- Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin, China
| | - Jiafei Liu
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, China
| | - Junbo Xuan
- Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University, Guilin, China
| | - Qi Zhu
- College of Computer Science and Engineering, Guangxi Normal University, Guilin, China
| |
Collapse
|
10
|
Chung Y, Lee H. Joint triplet loss with semi-hard constraint for data augmentation and disease prediction using gene expression data. Sci Rep 2023; 13:18178. [PMID: 37875602 PMCID: PMC10598120 DOI: 10.1038/s41598-023-45467-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Accepted: 10/19/2023] [Indexed: 10/26/2023] Open
Abstract
The accurate prediction of patients with complex diseases, such as Alzheimer's disease (AD), as well as disease stages, including early- and late-stage cancer, is challenging owing to substantial variability among patients and limited availability of clinical data. Deep metric learning has emerged as a promising approach for addressing these challenges by improving data representation. In this study, we propose a joint triplet loss model with a semi-hard constraint (JTSC) to represent data in a small number of samples. JTSC strictly selects semi-hard samples by switching anchors and positive samples during the learning process in triplet embedding and combines a triplet loss function with an angular loss function. Our results indicate that JTSC significantly improves the number of appropriately represented samples during training when applied to the gene expression data of AD and to cancer stage prediction tasks. Furthermore, we demonstrate that using an embedding vector from JTSC as an input to the classifiers for AD and cancer stage prediction significantly improves classification performance by extracting more accurate features. In conclusion, we show that feature embedding through JTSC can aid in classification when there are a small number of samples compared to a larger number of features.
Collapse
Affiliation(s)
- Yeonwoo Chung
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, 61005, Republic of Korea
| | - Hyunju Lee
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, 61005, Republic of Korea.
- Artificial Intelligence Graduate School, Gwangju Institute of Science and Technology, Gwangju, 61005, Republic of Korea.
| |
Collapse
|
11
|
Pun FW, Ozerov IV, Zhavoronkov A. AI-powered therapeutic target discovery. Trends Pharmacol Sci 2023; 44:561-572. [PMID: 37479540 DOI: 10.1016/j.tips.2023.06.010] [Citation(s) in RCA: 88] [Impact Index Per Article: 44.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 06/20/2023] [Accepted: 06/23/2023] [Indexed: 07/23/2023]
Abstract
Disease modeling and target identification are the most crucial initial steps in drug discovery, and influence the probability of success at every step of drug development. Traditional target identification is a time-consuming process that takes years to decades and usually starts in an academic setting. Given its advantages of analyzing large datasets and intricate biological networks, artificial intelligence (AI) is playing a growing role in modern drug target identification. We review recent advances in target discovery, focusing on breakthroughs in AI-driven therapeutic target exploration. We also discuss the importance of striking a balance between novelty and confidence in target selection. An increasing number of AI-identified targets are being validated through experiments and several AI-derived drugs are entering clinical trials; we highlight current limitations and potential pathways for moving forward.
Collapse
Affiliation(s)
- Frank W Pun
- Insilico Medicine Hong Kong Ltd., Hong Kong Science and Technology Park, New Territories, Hong Kong
| | - Ivan V Ozerov
- Insilico Medicine Hong Kong Ltd., Hong Kong Science and Technology Park, New Territories, Hong Kong
| | - Alex Zhavoronkov
- Insilico Medicine Hong Kong Ltd., Hong Kong Science and Technology Park, New Territories, Hong Kong; Insilico Medicine MENA, 6F IRENA Building, Abu Dhabi, United Arab Emirates; Buck Institute for Research on Aging, Novato, CA, USA.
| |
Collapse
|
12
|
Carrillo-Perez F, Pizurica M, Ozawa MG, Vogel H, West RB, Kong CS, Herrera LJ, Shen J, Gevaert O. Synthetic whole-slide image tile generation with gene expression profile-infused deep generative models. CELL REPORTS METHODS 2023; 3:100534. [PMID: 37671024 PMCID: PMC10475789 DOI: 10.1016/j.crmeth.2023.100534] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Revised: 03/10/2023] [Accepted: 06/22/2023] [Indexed: 09/07/2023]
Abstract
In this work, we propose an approach to generate whole-slide image (WSI) tiles by using deep generative models infused with matched gene expression profiles. First, we train a variational autoencoder (VAE) that learns a latent, lower-dimensional representation of multi-tissue gene expression profiles. Then, we use this representation to infuse generative adversarial networks (GANs) that generate lung and brain cortex tissue tiles, resulting in a new model that we call RNA-GAN. Tiles generated by RNA-GAN were preferred by expert pathologists compared with tiles generated using traditional GANs, and in addition, RNA-GAN needs fewer training epochs to generate high-quality tiles. Finally, RNA-GAN was able to generalize to gene expression profiles outside of the training set, showing imputation capabilities. A web-based quiz is available for users to play a game distinguishing real and synthetic tiles: https://rna-gan.stanford.edu/, and the code for RNA-GAN is available here: https://github.com/gevaertlab/RNA-GAN.
Collapse
Affiliation(s)
- Francisco Carrillo-Perez
- Stanford Center for Biomedical Informatics Research (BMIR), Stanford University, School of Medicine, 1265 Welch Road, Stanford, CA 94305-547, USA
- Computer Engineering, Automatics and Robotics Department, University of Granada, C. Periodista Daniel Saucedo Aranda, s/n, Granada, 18014 Granada, Spain
| | - Marija Pizurica
- Stanford Center for Biomedical Informatics Research (BMIR), Stanford University, School of Medicine, 1265 Welch Road, Stanford, CA 94305-547, USA
- Internet Technology and Data Science Lab (IDLab), Ghent University, Technologiepark-Zwijnaarde 126, Gent, 9052 Gent, Belgium
| | - Michael G. Ozawa
- Department of Pathology, Stanford University School of Medicine, 300 Pasteur Dr, Palo Alto, CA 94304, USA
| | - Hannes Vogel
- Department of Pathology, Stanford University School of Medicine, 300 Pasteur Dr, Palo Alto, CA 94304, USA
| | - Robert B. West
- Department of Pathology, Stanford University School of Medicine, 300 Pasteur Dr, Palo Alto, CA 94304, USA
| | - Christina S. Kong
- Department of Pathology, Stanford University School of Medicine, 300 Pasteur Dr, Palo Alto, CA 94304, USA
| | - Luis Javier Herrera
- Computer Engineering, Automatics and Robotics Department, University of Granada, C. Periodista Daniel Saucedo Aranda, s/n, Granada, 18014 Granada, Spain
| | - Jeanne Shen
- Department of Pathology, Stanford University School of Medicine, 300 Pasteur Dr, Palo Alto, CA 94304, USA
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research (BMIR), Stanford University, School of Medicine, 1265 Welch Road, Stanford, CA 94305-547, USA
- Department of Biomedical Data Science, Stanford University, School of Medicine, Medical School Office Building (MSOB), 1265 Welch Road, Stanford, CA 94305-547, USA
| |
Collapse
|
13
|
Jahanyar B, Tabatabaee H, Rowhanimanesh A. MS-ACGAN: A modified auxiliary classifier generative adversarial network for schizophrenia's samples augmentation based on microarray gene expression data. Comput Biol Med 2023; 162:107024. [PMID: 37263150 DOI: 10.1016/j.compbiomed.2023.107024] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2022] [Revised: 05/01/2023] [Accepted: 05/09/2023] [Indexed: 06/03/2023]
Abstract
Artificial intelligence-based models and robust computational methods have expedited the data-to-knowledge trajectory in precision medicine. Although machine learning models have been widely applied in medical data analysis, some barriers are yet to be challenging, such as available biosample shortage, prohibitive costs, rare diseases, and ethical considerations. Transcriptomics, an omics approach that studies gene activities and provides gene expression data such as microarray and RNA-Sequences faces the difficulties of biospecimen collection, particularly for mental disorders, as some psychiatric patients avoid medical care. Microarray data suffers from the low number of available samples, making it challenging to apply machine learning models. However, adversarial generative network (GAN), the hottest paradigm in deep learning, has created unprecedented momentum in data augmentation and efficiently expands datasets. This paper proposes a novel model termed MS-ACGAN, where the generator feeds on a bordered Gaussian distribution. In machine learning, calibration is of utmost importance, which gives insight into model uncertainty and is considered a crucial step toward improving the robustness and reliability of models. Therefore, we apply calibration techniques to classifiers and focus on estimating their probabilities as accurately as possible. Additionally, we present our trustworthy outputs by harnessing confidence intervals that confine the point estimate limitations and report a range of expected values for performance metrics. Both concepts statistically describe the implemented model's reliability in this study. Furthermore, we employ two quantitative measures, GAN-train and GAN-test, to demonstrate that the artificial data generated by our robust approach remarkably resembles the original data characteristics.
Collapse
Affiliation(s)
- Bahareh Jahanyar
- Department of Computer Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran
| | - Hamid Tabatabaee
- Department of Computer Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran.
| | | |
Collapse
|
14
|
Carrillo-Perez F, Pizurica M, Zheng Y, Nandi TN, Madduri R, Shen J, Gevaert O. RNA-to-image multi-cancer synthesis using cascaded diffusion models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.13.523899. [PMID: 36711711 PMCID: PMC9882105 DOI: 10.1101/2023.01.13.523899] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Data scarcity presents a significant obstacle in the field of biomedicine, where acquiring diverse and sufficient datasets can be costly and challenging. Synthetic data generation offers a potential solution to this problem by expanding dataset sizes, thereby enabling the training of more robust and generalizable machine learning models. Although previous studies have explored synthetic data generation for cancer diagnosis, they have predominantly focused on single modality settings, such as whole-slide image tiles or RNA-Seq data. To bridge this gap, we propose a novel approach, RNA-Cascaded-Diffusion-Model or RNA-CDM, for performing RNA-to-image synthesis in a multi-cancer context, drawing inspiration from successful text-to-image synthesis models used in natural images. In our approach, we employ a variational auto-encoder to reduce the dimensionality of a patient's gene expression profile, effectively distinguishing between different types of cancer. Subsequently, we employ a cascaded diffusion model to synthesize realistic whole-slide image tiles using the latent representation derived from the patient's RNA-Seq data. Our results demonstrate that the generated tiles accurately preserve the distribution of cell types observed in real-world data, with state-of-the-art cell identification models successfully detecting important cell types in the synthetic samples. Furthermore, we illustrate that the synthetic tiles maintain the cell fraction observed in bulk RNA-Seq data and that modifications in gene expression affect the composition of cell types in the synthetic tiles. Next, we utilize the synthetic data generated by RNA-CDM to pretrain machine learning models and observe improved performance compared to training from scratch. Our study emphasizes the potential usefulness of synthetic data in developing machine learning models in sarce-data settings, while also highlighting the possibility of imputing missing data modalities by leveraging the available information. In conclusion, our proposed RNA-CDM approach for synthetic data generation in biomedicine, particularly in the context of cancer diagnosis, offers a novel and promising solution to address data scarcity. By generating synthetic data that aligns with real-world distributions and leveraging it to pretrain machine learning models, we contribute to the development of robust clinical decision support systems and potential advancements in precision medicine.
Collapse
|
15
|
Lacan A, Sebag M, Hanczar B. GAN-based data augmentation for transcriptomics: survey and comparative assessment. Bioinformatics 2023; 39:i111-i120. [PMID: 37387181 DOI: 10.1093/bioinformatics/btad239] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Transcriptomics data are becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models' full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentation, is suggested as a regularization strategy. Data augmentation corresponds to label-invariant transformations of the training set (e.g. geometric transformations on images and syntax parsing on text data). Such transformations are, unfortunately, unknown in the transcriptomic field. Therefore, deep generative models such as generative adversarial networks (GANs) have been proposed to generate additional samples. In this article, we analyze GAN-based data augmentation strategies with respect to performance indicators and the classification of cancer phenotypes. RESULTS This work highlights a significant boost in binary and multiclass classification performances due to augmentation strategies. Without augmentation, training a classifier on only 50 RNA-seq samples yields an accuracy of, respectively, 94% and 70% for binary and tissue classification. In comparison, we achieved 98% and 94% of accuracy when adding 1000 augmented samples. Richer architectures and more expensive training of the GAN return better augmentation performances and generated data quality overall. Further analysis of the generated data shows that several performance indicators are needed to assess its quality correctly. AVAILABILITY AND IMPLEMENTATION All data used for this research are publicly available and comes from The Cancer Genome Atlas. Reproducible code is available on the GitLab repository: https://forge.ibisc.univ-evry.fr/alacan/GANs-for-transcriptomics.
Collapse
Affiliation(s)
- Alice Lacan
- IBISC, University Paris-Saclay (Univ. Evry), Evry 91000, France
| | - Michèle Sebag
- TAU, CNRS-INRIA-LISN, University Paris-Saclay, Gif-sur-Yvette 91190, France
| | - Blaise Hanczar
- IBISC, University Paris-Saclay (Univ. Evry), Evry 91000, France
| |
Collapse
|
16
|
Sanders LM, Scott RT, Yang JH, Qutub AA, Garcia Martin H, Berrios DC, Hastings JJA, Rask J, Mackintosh G, Hoarfrost AL, Chalk S, Kalantari J, Khezeli K, Antonsen EL, Babdor J, Barker R, Baranzini SE, Beheshti A, Delgado-Aparicio GM, Glicksberg BS, Greene CS, Haendel M, Hamid AA, Heller P, Jamieson D, Jarvis KJ, Komarova SV, Komorowski M, Kothiyal P, Mahabal A, Manor U, Mason CE, Matar M, Mias GI, Miller J, Myers JG, Nelson C, Oribello J, Park SM, Parsons-Wingerter P, Prabhu RK, Reynolds RJ, Saravia-Butler A, Saria S, Sawyer A, Singh NK, Snyder M, Soboczenski F, Soman K, Theriot CA, Van Valen D, Venkateswaran K, Warren L, Worthey L, Zitnik M, Costes SV. Biological research and self-driving labs in deep space supported by artificial intelligence. NAT MACH INTELL 2023. [DOI: 10.1038/s42256-023-00618-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2023]
|
17
|
Kircher M, Chludzinski E, Krepel J, Saremi B, Beineke A, Jung K. Augmentation of Transcriptomic Data for Improved Classification of Patients with Respiratory Diseases of Viral Origin. Int J Mol Sci 2022; 23:ijms23052481. [PMID: 35269624 PMCID: PMC8910329 DOI: 10.3390/ijms23052481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Revised: 02/17/2022] [Accepted: 02/21/2022] [Indexed: 02/01/2023] Open
Abstract
To better understand the molecular basis of respiratory diseases of viral origin, high-throughput gene-expression data are frequently taken by means of DNA microarray or RNA-seq technology. Such data can also be useful to classify infected individuals by molecular signatures in the form of machine-learning models with genes as predictor variables. Early diagnosis of patients by molecular signatures could also contribute to better treatments. An approach that has rarely been considered for machine-learning models in the context of transcriptomics is data augmentation. For other data types it has been shown that augmentation can improve classification accuracy and prevent overfitting. Here, we compare three strategies for data augmentation of DNA microarray and RNA-seq data from two selected studies on respiratory diseases of viral origin. The first study involves samples of patients with either viral or bacterial origin of the respiratory disease, the second study involves patients with either SARS-CoV-2 or another respiratory virus as disease origin. Specifically, we reanalyze these public datasets to study whether patient classification by transcriptomic signatures can be improved when adding artificial data for training of the machine-learning models. Our comparison reveals that augmentation of transcriptomic data can improve the classification accuracy and that fewer genes are necessary as explanatory variables in the final models. We also report genes from our signatures that overlap with signatures presented in the original publications of our example data. Due to strict selection criteria, the molecular role of these genes in the context of respiratory infectious diseases is underlined.
Collapse
Affiliation(s)
- Magdalena Kircher
- Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Buenteweg 17p, 30559 Hannover, Germany; (M.K.); (J.K.); (B.S.)
| | - Elisa Chludzinski
- Department of Pathology, University of Veterinary Medicine Hannover, Buenteweg 17, 30559 Hannover, Germany; (E.C.); (A.B.)
| | - Jessica Krepel
- Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Buenteweg 17p, 30559 Hannover, Germany; (M.K.); (J.K.); (B.S.)
| | - Babak Saremi
- Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Buenteweg 17p, 30559 Hannover, Germany; (M.K.); (J.K.); (B.S.)
| | - Andreas Beineke
- Department of Pathology, University of Veterinary Medicine Hannover, Buenteweg 17, 30559 Hannover, Germany; (E.C.); (A.B.)
| | - Klaus Jung
- Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, Buenteweg 17p, 30559 Hannover, Germany; (M.K.); (J.K.); (B.S.)
- Correspondence: ; Tel.: +49-511-953-8878
| |
Collapse
|
18
|
Barbiero P, Viñas Torné R, Lió P. Graph Representation Forecasting of Patient's Medical Conditions: Toward a Digital Twin. Front Genet 2021; 12:652907. [PMID: 34603366 PMCID: PMC8481902 DOI: 10.3389/fgene.2021.652907] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2021] [Accepted: 06/24/2021] [Indexed: 01/05/2023] Open
Abstract
Objective: Modern medicine needs to shift from a wait and react, curative discipline to a preventative, interdisciplinary science aiming at providing personalized, systemic, and precise treatment plans to patients. To this purpose, we propose a "digital twin" of patients modeling the human body as a whole and providing a panoramic view over individuals' conditions. Methods: We propose a general framework that composes advanced artificial intelligence (AI) approaches and integrates mathematical modeling in order to provide a panoramic view over current and future pathophysiological conditions. Our modular architecture is based on a graph neural network (GNN) forecasting clinically relevant endpoints (such as blood pressure) and a generative adversarial network (GAN) providing a proof of concept of transcriptomic integrability. Results: We tested our digital twin model on two simulated clinical case studies combining information at organ, tissue, and cellular level. We provided a panoramic overview over current and future patient's conditions by monitoring and forecasting clinically relevant endpoints representing the evolution of patient's vital parameters using the GNN model. We showed how to use the GAN to generate multi-tissue expression data for blood and lung to find associations between cytokines conditioned on the expression of genes in the renin-angiotensin pathway. Our approach was to detect inflammatory cytokines, which are known to have effects on blood pressure and have previously been associated with SARS-CoV-2 infection (e.g., CXCR6, XCL1, and others). Significance: The graph representation of a computational patient has potential to solve important technological challenges in integrating multiscale computational modeling with AI. We believe that this work represents a step forward toward next-generation devices for precision and predictive medicine.
Collapse
|
19
|
Shu H, Zhou J, Lian Q, Li H, Zhao D, Zeng J, Ma J. Modeling gene regulatory networks using neural network architectures. NATURE COMPUTATIONAL SCIENCE 2021; 1:491-501. [PMID: 38217125 DOI: 10.1038/s43588-021-00099-8] [Citation(s) in RCA: 74] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 06/15/2021] [Indexed: 01/15/2024]
Abstract
Gene regulatory networks (GRNs) encode the complex molecular interactions that govern cell identity. Here we propose DeepSEM, a deep generative model that can jointly infer GRNs and biologically meaningful representation of single-cell RNA sequencing (scRNA-seq) data. In particular, we developed a neural network version of the structural equation model (SEM) to explicitly model the regulatory relationships among genes. Benchmark results show that DeepSEM achieves comparable or better performance on a variety of single-cell computational tasks, such as GRN inference, scRNA-seq data visualization, clustering and simulation, compared with the state-of-the-art methods. In addition, the gene regulations predicted by DeepSEM on cell-type marker genes in the mouse cortex can be validated by epigenetic data, which further demonstrates the accuracy and efficiency of our method. DeepSEM can provide a useful and powerful tool to analyze scRNA-seq data and infer a GRN.
Collapse
Affiliation(s)
- Hantao Shu
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Jingtian Zhou
- Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California, San Diego, La Jolla, CA, USA
| | - Qiuyu Lian
- UM-SJTU Joint Institute, Shanghai Jiao Tong University, Shanghai, China
- Department of Automation, Shanghai Jiao Tong University, Shanghai, China
| | - Han Li
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Dan Zhao
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Jianyang Zeng
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China.
| | - Jianzhu Ma
- Institute for Artificial Intelligence, Peking University, Beijing, China.
| |
Collapse
|
20
|
Viñas R, Azevedo T, Gamazon ER, Liò P. Deep Learning Enables Fast and Accurate Imputation of Gene Expression. Front Genet 2021; 12:624128. [PMID: 33927746 PMCID: PMC8076954 DOI: 10.3389/fgene.2021.624128] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Accepted: 03/12/2021] [Indexed: 11/26/2022] Open
Abstract
A question of fundamental biological significance is to what extent the expression of a subset of genes can be used to recover the full transcriptome, with important implications for biological discovery and clinical application. To address this challenge, we propose two novel deep learning methods, PMI and GAIN-GTEx, for gene expression imputation. In order to increase the applicability of our approach, we leverage data from GTEx v8, a reference resource that has generated a comprehensive collection of transcriptomes from a diverse set of human tissues. We show that our approaches compare favorably to several standard and state-of-the-art imputation methods in terms of predictive performance and runtime in two case studies and two imputation scenarios. In comparison conducted on the protein-coding genes, PMI attains the highest performance in inductive imputation whereas GAIN-GTEx outperforms the other methods in in-place imputation. Furthermore, our results indicate strong generalization on RNA-Seq data from 3 cancer types across varying levels of missingness. Our work can facilitate a cost-effective integration of large-scale RNA biorepositories into genomic studies of disease, with high applicability across diverse tissue types.
Collapse
Affiliation(s)
- Ramon Viñas
- Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
| | - Tiago Azevedo
- Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
| | - Eric R Gamazon
- Vanderbilt Genetics Institute and Data Science Institute, VUMC, Nashville, TN, United States.,MRC Epidemiology Unit, University of Cambridge, Cambridge, United Kingdom.,Clare Hall, University of Cambridge, Cambridge, United Kingdom
| | - Pietro Liò
- Department of Computer Science and Technology, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|