1
|
Mowlaei ME, Li C, Jamialahmadi O, Dias R, Chen J, Jamialahmadi B, Rebbeck TR, Carnevale V, Kumar S, Shi X. STICI: Split-Transformer with integrated convolutions for genotype imputation. Nat Commun 2025; 16:1218. [PMID: 39890780 PMCID: PMC11785734 DOI: 10.1038/s41467-025-56273-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Accepted: 01/08/2025] [Indexed: 02/03/2025] Open
Abstract
Despite advances in sequencing technologies, genome-scale datasets often contain missing bases and genomic segments, hindering downstream analyses. Genotype imputation addresses this issue and has been a cornerstone pre-processing step in genetic and genomic studies. Although various methods have been widely adopted for genotype imputation, it remains challenging to impute certain genomic regions and large structural variants. Here, we present a transformer-based framework, named STICI, for accurate genotype imputation. STICI models automatically learn genome-wide patterns of linkage disequilibrium, evidenced by much higher imputation accuracy in regions with highly linked variants. Our imputation results on the human 1000 Genomes Project and non-human genomes show that STICI can achieve high imputation accuracy comparable to the state-of-the-art genotype imputation methods, with the additional capability to impute multi-allelic variants and various types of genetic variants. STICI can be trained for any collection of genomes automatically using self-supervision. Moreover, STICI shows excellent performance without needing any special presuppositions about the underlying patterns in collections of non-human genomes, pointing to adaptability and applications of STICI to impute missing genotypes in any species.
Collapse
Affiliation(s)
- Mohammad Erfan Mowlaei
- Computer & Information Sciences, College of Science and Technology, Temple University, Philadelphia, PA, USA
| | - Chong Li
- Computer & Information Sciences, College of Science and Technology, Temple University, Philadelphia, PA, USA
| | - Oveis Jamialahmadi
- Department of Molecular and Clinical Medicine, Institute of Medicine, Sahlgrenska Academy, Wallenberg Laboratory, University of Gothenburg, Gothenburg, Sweden
| | - Raquel Dias
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
| | - Junjie Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Benyamin Jamialahmadi
- David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| | - Timothy Richard Rebbeck
- Division of Population Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Vincenzo Carnevale
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA
- Institute for Computational Molecular Science, Temple University, Philadelphia, PA, USA
| | - Sudhir Kumar
- Computer & Information Sciences, College of Science and Technology, Temple University, Philadelphia, PA, USA
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA
- Department of Biology, Temple University, Philadelphia, PA, USA
| | - Xinghua Shi
- Computer & Information Sciences, College of Science and Technology, Temple University, Philadelphia, PA, USA.
- Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA.
| |
Collapse
|
2
|
Naito T, Okada Y. Genotype imputation methods for whole and complex genomic regions utilizing deep learning technology. J Hum Genet 2024; 69:481-486. [PMID: 38225263 PMCID: PMC11422162 DOI: 10.1038/s10038-023-01213-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Revised: 11/23/2023] [Accepted: 12/04/2023] [Indexed: 01/17/2024]
Abstract
The imputation of unmeasured genotypes is essential in human genetic research, particularly in enhancing the power of genome-wide association studies and conducting subsequent fine-mapping. Recently, several deep learning-based genotype imputation methods for genome-wide variants with the capability of learning complex linkage disequilibrium patterns have been developed. Additionally, deep learning-based imputation has been applied to a distinct genomic region known as the major histocompatibility complex, referred to as HLA imputation. Despite their various advantages, the current deep learning-based genotype imputation methods do have certain limitations and have not yet become standard. These limitations include the modest accuracy improvement over statistical and conventional machine learning-based methods. However, their benefits include other aspects, such as their "reference-free" nature, which ensures complete privacy protection, and their higher computational efficiency. Furthermore, the continuing evolution of deep learning technologies is expected to contribute to further improvements in prediction accuracy and usability in the future.
Collapse
Affiliation(s)
- Tatsuhiko Naito
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, 2-2, Yamadaoka, Suita-shi, Osaka, 565-0871, Japan.
- Laboratory for Systems Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22, Suehiro-cho, Tsurumi-ku, Yokohama City, Kanagawa, 230-0045, Japan.
| | - Yukinori Okada
- Department of Statistical Genetics, Osaka University Graduate School of Medicine, 2-2, Yamadaoka, Suita-shi, Osaka, 565-0871, Japan
- Laboratory for Systems Genetics, RIKEN Center for Integrative Medical Sciences, 1-7-22, Suehiro-cho, Tsurumi-ku, Yokohama City, Kanagawa, 230-0045, Japan
- Department of Genome Informatics, Graduate School of Medicine, the University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan
- Integrated Frontier Research for Medical Science Division, Institute for Open and Transdisciplinary Research Initiatives, Osaka University, 2-2, Yamadaoka, Suita-shi, Osaka, 565-0871, Japan
- Premium Research Institute for Human Metaverse Medicine (WPI-PRIMe), Osaka University, 2-2, Yamadaoka, Suita-shi, Osaka, 565-0871, Japan
| |
Collapse
|
3
|
Kojima K, Tadaka S, Okamura Y, Kinoshita K. Two-stage strategy using denoising autoencoders for robust reference-free genotype imputation with missing input genotypes. J Hum Genet 2024; 69:511-518. [PMID: 38918526 PMCID: PMC11422160 DOI: 10.1038/s10038-024-01261-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 04/16/2024] [Accepted: 05/13/2024] [Indexed: 06/27/2024]
Abstract
Widely used genotype imputation methods are based on the Li and Stephens model, which assumes that new haplotypes can be represented by modifying existing haplotypes in a reference panel through mutations and recombinations. These methods use genotypes from SNP arrays as inputs to estimate haplotypes that align with the input genotypes by analyzing recombination patterns within a reference panel, and then infer unobserved variants. While these methods require reference panels in an identifiable form, their public use is limited due to privacy and consent concerns. One strategy to overcome these limitations is to use de-identified haplotype information, such as summary statistics or model parameters. Advances in deep learning (DL) offer the potential to develop imputation methods that use haplotype information in a reference-free manner by handling it as model parameters, while maintaining comparable imputation accuracy to methods based on the Li and Stephens model. Here, we provide a brief introduction to DL-based reference-free genotype imputation methods, including RNN-IMP, developed by our research group. We then evaluate the performance of RNN-IMP against widely-used Li and Stephens model-based imputation methods in terms of accuracy (R2), using the 1000 Genomes Project Phase 3 dataset and corresponding simulated Omni2.5 SNP genotype data. Although RNN-IMP is sensitive to missing values in input genotypes, we propose a two-stage imputation strategy: missing genotypes are first imputed using denoising autoencoders; RNN-IMP then processes these imputed genotypes. This approach restores the imputation accuracy that is degraded by missing values, enhancing the practical use of RNN-IMP.
Collapse
Affiliation(s)
- Kaname Kojima
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan.
| | - Shu Tadaka
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
| | - Yasunobu Okamura
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan
- Advanced Research Center for Innovations in Next-Generation Medicine, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-0873, Japan
| | - Kengo Kinoshita
- Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8573, Japan.
- Advanced Research Center for Innovations in Next-Generation Medicine, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-0873, Japan.
- Graduate School of Information Sciences, Tohoku University, 6-3-09 Aza-Aoba, Aramaki, Aoba-ku, Sendai, Miyagi, 980-8579, Japan.
- Institute of Development, Aging and Cancer, Tohoku University, 4-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8575, Japan.
| |
Collapse
|
4
|
Sharma A, Lysenko A, Jia S, Boroevich KA, Tsunoda T. Advances in AI and machine learning for predictive medicine. J Hum Genet 2024; 69:487-497. [PMID: 38424184 PMCID: PMC11422165 DOI: 10.1038/s10038-024-01231-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 02/04/2024] [Accepted: 02/12/2024] [Indexed: 03/02/2024]
Abstract
The field of omics, driven by advances in high-throughput sequencing, faces a data explosion. This abundance of data offers unprecedented opportunities for predictive modeling in precision medicine, but also presents formidable challenges in data analysis and interpretation. Traditional machine learning (ML) techniques have been partly successful in generating predictive models for omics analysis but exhibit limitations in handling potential relationships within the data for more accurate prediction. This review explores a revolutionary shift in predictive modeling through the application of deep learning (DL), specifically convolutional neural networks (CNNs). Using transformation methods such as DeepInsight, omics data with independent variables in tabular (table-like, including vector) form can be turned into image-like representations, enabling CNNs to capture latent features effectively. This approach not only enhances predictive power but also leverages transfer learning, reducing computational time, and improving performance. However, integrating CNNs in predictive omics data analysis is not without challenges, including issues related to model interpretability, data heterogeneity, and data size. Addressing these challenges requires a multidisciplinary approach, involving collaborations between ML experts, bioinformatics researchers, biologists, and medical doctors. This review illuminates these complexities and charts a course for future research to unlock the full predictive potential of CNNs in omics data analysis and related fields.
Collapse
Affiliation(s)
- Alok Sharma
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
- Institute for Integrated and Intelligent Systems, Griffith University, Queensland, Australia.
| | - Artem Lysenko
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
| | - Shangru Jia
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
| | - Keith A Boroevich
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan.
| |
Collapse
|
5
|
Liu A, Tian B, Qiu C, Su KJ, Jiang L, Zhao C, Song M, Liu Y, Qu G, Zhou Z, Zhang X, Gnanesh SSM, Thumbigere-Math V, Luo Z, Tian Q, Zhang LS, Wu C, Ding Z, Shen H, Deng HW. Multi-View Integrative Approach For Imputing Short-Chain Fatty Acids and Identifying Key factors predicting Blood SCFA. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.25.614767. [PMID: 39386638 PMCID: PMC11463355 DOI: 10.1101/2024.09.25.614767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/12/2024]
Abstract
Short-chain fatty acids (SCFAs) are the main metabolites produced by bacterial fermentation of dietary fiber within gastrointestinal tract. SCFAs produced by gut microbiotas (GMs) are absorbed by host, reach bloodstream, and are distributed to different organs, thus influencing host physiology. However, due to the limited budget or the poor sensitivity of instruments, most studies on GMs have incomplete blood SCFA data, limiting our understanding of the metabolic processes within the host. To address this gap, we developed an innovative multi-task multi-view integrative approach (M2AE, Multi-task Multi-View Attentive Encoders), to impute blood SCFA levels using gut metagenomic sequencing (MGS) data, while taking into account the intricate interplay among the gut microbiome, dietary features, and host characteristics, as well as the nuanced nature of SCFA dynamics within the body. Here, each view represents a distinct type of data input (i.e., gut microbiome compositions, dietary features, or host characteristics). Our method jointly explores both view-specific representations and cross-view correlations for effective predictions of SCFAs. We applied M2AE to two in-house datasets, which both include MGS and blood SCFAs profiles, host characteristics, and dietary features from 964 subjects and 171 subjects, respectively. Results from both of two datasets demonstrated that M2AE outperforms traditional regression-based and neural-network based approaches in imputing blood SCFAs. Furthermore, a series of gut bacterial species (e.g., Bacteroides thetaiotaomicron and Clostridium asparagiforme), host characteristics (e.g., race, gender), as well as dietary features (e.g., intake of fruits, pickles) were shown to contribute greatly to imputation of blood SCFAs. These findings demonstrated that GMs, dietary features and host characteristics might contribute to the complex biological processes involved in blood SCFA productions. These might pave the way for a deeper and more nuanced comprehension of how these factors impact human health.
Collapse
Affiliation(s)
- Anqi Liu
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, USA
| | - Bo Tian
- Center for System Biology, Data Sciences, and Reproductive Health, School of Basic Medical Science, Central South University, Yuelu, Changsha, P.R. China
| | - Chuan Qiu
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, USA
| | - Kuan-Jui Su
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, USA
| | - Lindong Jiang
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, USA
| | - Chen Zhao
- College of Computing and Software Engineering, Kennesaw State University, GA, USA
| | - Meng Song
- College of Science, Xi'an Shiyou University, Xi'an, P.R. China
| | - Yong Liu
- Center for System Biology, Data Sciences, and Reproductive Health, School of Basic Medical Science, Central South University, Yuelu, Changsha, P.R. China
| | - Gang Qu
- Department of Biomedical Engineering, Tulane University, New Orleans, Louisiana, USA
| | - Ziyu Zhou
- School of Science and Engineering, Tulane University, New Orleans, LA, USA
| | - Xiao Zhang
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, USA
| | - Shashank Sajjan Mungasavalli Gnanesh
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, USA
| | - Vivek Thumbigere-Math
- Division of Periodontics, University of Maryland Baltimore School of Dentistry, Baltimore, USA
| | - Zhe Luo
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, USA
| | - Qing Tian
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, USA
| | - Li-Shu Zhang
- School of Physical Science and Engineering, College of Life Sciences and Bioengineering, Beijing Jiaotong University, Beijing, China
| | - Chong Wu
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, USA
| | - Zhengming Ding
- School of Science and Engineering, Tulane University, New Orleans, LA, USA
| | - Hui Shen
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, USA
| | - Hong-Wen Deng
- Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, USA
| |
Collapse
|
6
|
Huang L, Song M, Shen H, Hong H, Gong P, Deng HW, Zhang C. Deep Learning Methods for Omics Data Imputation. BIOLOGY 2023; 12:1313. [PMID: 37887023 PMCID: PMC10604785 DOI: 10.3390/biology12101313] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 09/28/2023] [Accepted: 10/02/2023] [Indexed: 10/28/2023]
Abstract
One common problem in omics data analysis is missing values, which can arise due to various reasons, such as poor tissue quality and insufficient sample volumes. Instead of discarding missing values and related data, imputation approaches offer an alternative means of handling missing data. However, the imputation of missing omics data is a non-trivial task. Difficulties mainly come from high dimensionality, non-linear or non-monotonic relationships within features, technical variations introduced by sampling methods, sample heterogeneity, and the non-random missingness mechanism. Several advanced imputation methods, including deep learning-based methods, have been proposed to address these challenges. Due to its capability of modeling complex patterns and relationships in large and high-dimensional datasets, many researchers have adopted deep learning models to impute missing omics data. This review provides a comprehensive overview of the currently available deep learning-based methods for omics imputation from the perspective of deep generative model architectures such as autoencoder, variational autoencoder, generative adversarial networks, and Transformer, with an emphasis on multi-omics data imputation. In addition, this review also discusses the opportunities that deep learning brings and the challenges that it might face in this field.
Collapse
Affiliation(s)
- Lei Huang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| | - Meng Song
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| | - Hui Shen
- Center for Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR 72079, USA
| | - Ping Gong
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS 39180, USA
| | - Hong-Wen Deng
- Center for Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Chaoyang Zhang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| |
Collapse
|