1
|
Tahir M, Norouzi M, Khan SS, Davie JR, Yamanaka S, Ashraf A. Artificial intelligence and deep learning algorithms for epigenetic sequence analysis: A review for epigeneticists and AI experts. Comput Biol Med 2024; 183:109302. [PMID: 39500240 DOI: 10.1016/j.compbiomed.2024.109302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Revised: 09/22/2024] [Accepted: 10/17/2024] [Indexed: 11/20/2024]
Abstract
Epigenetics encompasses mechanisms that can alter the expression of genes without changing the underlying genetic sequence. The epigenetic regulation of gene expression is initiated and sustained by several mechanisms such as DNA methylation, histone modifications, chromatin conformation, and non-coding RNA. The changes in gene regulation and expression can manifest in the form of various diseases and disorders such as cancer and congenital deformities. Over the last few decades, high-throughput experimental approaches have been used to identify and understand epigenetic changes, but these laboratory experimental approaches and biochemical processes are time-consuming and expensive. To overcome these challenges, machine learning and artificial intelligence (AI) approaches have been extensively used for mapping epigenetic modifications to their phenotypic manifestations. In this paper we provide a narrative review of published research on AI models trained on epigenomic data to address a variety of problems such as prediction of disease markers, gene expression, enhancer-promoter interaction, and chromatin states. The purpose of this review is twofold as it is addressed to both AI experts and epigeneticists. For AI researchers, we provided a taxonomy of epigenetics research problems that can benefit from an AI-based approach. For epigeneticists, given each of the above problems we provide a list of candidate AI solutions in the literature. We have also identified several gaps in the literature, research challenges, and recommendations to address these challenges.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, R3T 5V6, MB, Canada
| | - Mahboobeh Norouzi
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, R3T 5V6, MB, Canada
| | - Shehroz S Khan
- College of Engineering and Technology, American University of the Middle East, Kuwait
| | - James R Davie
- Department of Biochemistry and Medical Genetics, Max Rady College of Medicine, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB, Canada
| | - Soichiro Yamanaka
- Graduate School of Science, Department of Biophysics and Biochemistry, University of Tokyo, Japan
| | - Ahmed Ashraf
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, R3T 5V6, MB, Canada.
| |
Collapse
|
2
|
Vidanagamachchi SM, Waidyarathna KMGTR. Opportunities, challenges and future perspectives of using bioinformatics and artificial intelligence techniques on tropical disease identification using omics data. Front Digit Health 2024; 6:1471200. [PMID: 39654982 PMCID: PMC11625773 DOI: 10.3389/fdgth.2024.1471200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/06/2024] [Indexed: 12/12/2024] Open
Abstract
Tropical diseases can often be caused by viruses, bacteria, parasites, and fungi. They can be spread over vectors. Analysis of multiple omics data types can be utilized in providing comprehensive insights into biological system functions and disease progression. To this end, bioinformatics tools and diverse AI techniques are pivotal in identifying and understanding tropical diseases through the analysis of omics data. In this article, we provide a thorough review of opportunities, challenges, and future directions of utilizing Bioinformatics tools and AI-assisted models on tropical disease identification using various omics data types. We conducted the review from 2015 to 2024 considering reliable databases of peer-reviewed journals and conference articles. Several keywords were taken for the article searching and around 40 articles were reviewed. According to the review, we observed that utilization of omics data with Bioinformatics tools like BLAST, and Clustal Omega can make significant outcomes in tropical disease identification. Further, the integration of multiple omics data improves biomarker identification, and disease predictions including disease outbreak predictions. Moreover, AI-assisted models can improve the precision, cost-effectiveness, and efficiency of CRISPR-based gene editing, optimizing gRNA design, and supporting advanced genetic correction. Several AI-assisted models including XAI can be used to identify diseases and repurpose therapeutic targets and biomarkers efficiently. Furthermore, recent advancements including Transformer-based models such as BERT and GPT-4, have been mainly applied for sequence analysis and functional genomics. Finally, the most recent GeneViT model, utilizing Vision Transformers, and other AI techniques like Generative Adversarial Networks, Federated Learning, Transfer Learning, Reinforcement Learning, Automated ML and Attention Mechanism have shown significant performance in disease classification using omics data.
Collapse
Affiliation(s)
- S. M. Vidanagamachchi
- Department of Computer Science, Faculty of Science, University of Ruhuna, Matara, Sri Lanka
| | - K. M. G. T. R. Waidyarathna
- Department of Information Technology, Sri Lanka Institute of Advanced Technological Education, Galle, Sri Lanka
| |
Collapse
|
3
|
Lu Q, Xu J, Zhang R, Liu H, Wang M, Liu X, Yue Z, Gao Y. RiceSNP-ABST: a deep learning approach to identify abiotic stress-associated single nucleotide polymorphisms in rice. Brief Bioinform 2024; 26:bbae702. [PMID: 39757606 PMCID: PMC11962596 DOI: 10.1093/bib/bbae702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Revised: 11/16/2024] [Accepted: 12/23/2024] [Indexed: 01/07/2025] Open
Abstract
Given the adverse effects faced by rice due to abiotic stresses, the precise and rapid identification of single nucleotide polymorphisms (SNPs) associated with abiotic stress traits (ABST-SNPs) in rice is crucial for developing resistant rice varieties. The scarcity of high-quality data related to abiotic stress in rice has hindered the development of computational models and constrained research efforts aimed at rice improvement and breeding. Genome-wide association studies provide a better statistical power to consider ABST-SNPs in rice. Meanwhile, deep learning methods have shown their capability in predicting disease- or phenotype-associated loci, but have primarily focused on human species. Therefore, developing predictive models for identifying ABST-SNPs in rice is both urgent and valuable. In this paper, a model called RiceSNP-ABST is proposed for predicting ABST-SNPs in rice. Firstly, six training datasets were generated using a novel strategy for negative sample construction. Secondly, four feature encoding methods were proposed based on DNA sequence fragments, followed by feature selection. Finally, convolutional neural networks with residual connections were used to determine whether the sequences contained rice ABST-SNPs. RiceSNP-ABST outperformed traditional machine learning and state-of-the-art methods on the benchmark dataset and demonstrated consistent generalization on an independent dataset and cross-species datasets. Notably, multi-granularity causal structure learning was employed to elucidate the relationships among DNA structural features, aiming to identify key genetic variants more effectively. The web-based tool for the RiceSNP-ABST can be accessed at http://rice-snp-abst.aielab.cc.
Collapse
Affiliation(s)
- Quan Lu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Jiajun Xu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Renyi Zhang
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Hangcheng Liu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Meng Wang
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Xiaoshuang Liu
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Zhenyu Yue
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Yujia Gao
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| |
Collapse
|
4
|
Xu J, Gao Y, Lu Q, Zhang R, Gui J, Liu X, Yue Z. RiceSNP-BST: a deep learning framework for predicting biotic stress-associated SNPs in rice. Brief Bioinform 2024; 25:bbae599. [PMID: 39562160 PMCID: PMC11576077 DOI: 10.1093/bib/bbae599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Revised: 10/07/2024] [Accepted: 11/04/2024] [Indexed: 11/21/2024] Open
Abstract
Rice consistently faces significant threats from biotic stresses, such as fungi, bacteria, pests, and viruses. Consequently, accurately and rapidly identifying previously unknown single-nucleotide polymorphisms (SNPs) in the rice genome is a critical challenge for rice research and the development of resistant varieties. However, the limited availability of high-quality rice genotype data has hindered this research. Deep learning has transformed biological research by facilitating the prediction and analysis of SNPs in biological sequence data. Convolutional neural networks are especially effective in extracting structural and local features from DNA sequences, leading to significant advancements in genomics. Nevertheless, the expanding catalog of genome-wide association studies provides valuable biological insights for rice research. Expanding on this idea, we introduce RiceSNP-BST, an automatic architecture search framework designed to predict SNPs associated with rice biotic stress traits (BST-associated SNPs) by integrating multidimensional features. Notably, the model successfully innovates the datasets, offering more precision than state-of-the-art methods while demonstrating good performance on an independent test set and cross-species datasets. Additionally, we extracted features from the original DNA sequences and employed causal inference to enhance the biological interpretability of the model. This study highlights the potential of RiceSNP-BST in advancing genome prediction in rice. Furthermore, a user-friendly web server for RiceSNP-BST (http://rice-snp-bst.aielab.cc) has been developed to support broader genome research.
Collapse
Affiliation(s)
- Jiajun Xu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Yujia Gao
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Quan Lu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Renyi Zhang
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Jianfeng Gui
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Xiaoshuang Liu
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Zhenyu Yue
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| |
Collapse
|
5
|
Li J, Zhang D, Yang F, Zhang Q, Pan S, Zhao X, Zhang Q, Han Y, Yang J, Wang K, Zhao C. TrG2P: A transfer-learning-based tool integrating multi-trait data for accurate prediction of crop yield. PLANT COMMUNICATIONS 2024; 5:100975. [PMID: 38751121 PMCID: PMC11287160 DOI: 10.1016/j.xplc.2024.100975] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Revised: 04/14/2024] [Accepted: 05/11/2024] [Indexed: 06/24/2024]
Abstract
Yield prediction is the primary goal of genomic selection (GS)-assisted crop breeding. Because yield is a complex quantitative trait, making predictions from genotypic data is challenging. Transfer learning can produce an effective model for a target task by leveraging knowledge from a different, but related, source domain and is considered a great potential method for improving yield prediction by integrating multi-trait data. However, it has not previously been applied to genotype-to-phenotype prediction owing to the lack of an efficient implementation framework. We therefore developed TrG2P, a transfer-learning-based framework. TrG2P first employs convolutional neural networks (CNN) to train models using non-yield-trait phenotypic and genotypic data, thus obtaining pre-trained models. Subsequently, the convolutional layer parameters from these pre-trained models are transferred to the yield prediction task, and the fully connected layers are retrained, thus obtaining fine-tuned models. Finally, the convolutional layer and the first fully connected layer of the fine-tuned models are fused, and the last fully connected layer is trained to enhance prediction performance. We applied TrG2P to five sets of genotypic and phenotypic data from maize (Zea mays), rice (Oryza sativa), and wheat (Triticum aestivum) and compared its model precision to that of seven other popular GS tools: ridge regression best linear unbiased prediction (rrBLUP), random forest, support vector regression, light gradient boosting machine (LightGBM), CNN, DeepGS, and deep neural network for genomic prediction (DNNGP). TrG2P improved the accuracy of yield prediction by 39.9%, 6.8%, and 1.8% in rice, maize, and wheat, respectively, compared with predictions generated by the best-performing comparison model. Our work therefore demonstrates that transfer learning is an effective strategy for improving yield prediction by integrating information from non-yield-trait data. We attribute its enhanced prediction accuracy to the valuable information available from traits associated with yield and to training dataset augmentation. The Python implementation of TrG2P is available at https://github.com/lijinlong1991/TrG2P. The web-based tool is available at http://trg2p.ebreed.cn:81.
Collapse
Affiliation(s)
- Jinlong Li
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Dongfeng Zhang
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Feng Yang
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Qiusi Zhang
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Shouhui Pan
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Xiangyu Zhao
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Qi Zhang
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Yanyun Han
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
| | - Jinliang Yang
- Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, NE 68583, USA; Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, NE 68583, USA
| | - Kaiyi Wang
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China.
| | - Chunjiang Zhao
- Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China; National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China.
| |
Collapse
|
6
|
Jin W, Xia Y, Thela SR, Liu Y, Chen L. In silico generation and augmentation of regulatory variants from massively parallel reporter assay using conditional variational autoencoder. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.25.600715. [PMID: 38979263 PMCID: PMC11230389 DOI: 10.1101/2024.06.25.600715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Predicting the functional consequences of genetic variants in non-coding regions is a challenging problem. Massively parallel reporter assays (MPRAs), which are an in vitro high-throughput method, can simultaneously test thousands of variants by evaluating the existence of allele specific regulatory activity. Nevertheless, the identified labelled variants by MPRAs, which shows differential allelic regulatory effects on the gene expression are usually limited to the scale of hundreds, limiting their potential to be used as the training set for achieving a robust genome-wide prediction. To address the limitation, we propose a deep generative model, MpraVAE, to in silico generate and augment the training sample size of labelled variants. By benchmarking on several MPRA datasets, we demonstrate that MpraVAE significantly improves the prediction performance for MPRA regulatory variants compared to the baseline method, conventional data augmentation approaches as well as existing variant scoring methods. Taking autoimmune diseases as one example, we apply MpraVAE to perform a genome-wide prediction of regulatory variants and find that predicted regulatory variants are more enriched than background variants in enhancers, active histone marks, open chromatin regions in immune-related cell types, and chromatin states associated with promoter, enhancer activity and binding sites of cMyC and Pol II that regulate gene expression. Importantly, predicted regulatory variants are found to link immune-related genes by leveraging chromatin loop and accessible chromatin, demonstrating the importance of MpraVAE in genetic and gene discovery for complex traits.
Collapse
Affiliation(s)
- Weijia Jin
- Department of Biostatistics, University of Florida, Gainesville, FL, 32603, USA
| | - Yi Xia
- Department of Biostatistics, University of Florida, Gainesville, FL, 32603, USA
| | - Sai Ritesh Thela
- Department of Biostatistics, University of Florida, Gainesville, FL, 32603, USA
| | - Yunlong Liu
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Li Chen
- Department of Biostatistics, University of Florida, Gainesville, FL, 32603, USA
| |
Collapse
|
7
|
Baumgarten N, Rumpf L, Kessler T, Schulz MH. A statistical approach for identifying single nucleotide variants that affect transcription factor binding. iScience 2024; 27:109765. [PMID: 38736546 PMCID: PMC11088338 DOI: 10.1016/j.isci.2024.109765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 01/30/2024] [Accepted: 04/15/2024] [Indexed: 05/14/2024] Open
Abstract
Non-coding variants located within regulatory elements may alter gene expression by modifying transcription factor (TF) binding sites, thereby leading to functional consequences. Different TF models are being used to assess the effect of DNA sequence variants, such as single nucleotide variants (SNVs). Often existing methods are slow and do not assess statistical significance of results. We investigated the distribution of absolute maximal differential TF binding scores for general computational models that affect TF binding. We find that a modified Laplace distribution can adequately approximate the empirical distributions. A benchmark on in vitro and in vivo datasets showed that our approach improves upon an existing method in terms of performance and speed. Applications on eQTLs and on a genome-wide association study illustrate the usefulness of our statistics by highlighting cell type-specific regulators and target genes. An implementation of our approach is freely available on GitHub and as bioconda package.
Collapse
Affiliation(s)
- Nina Baumgarten
- Institute of Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computational Genomic Medicine, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computer Science, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590 Frankfurt am Main, Germany
| | - Laura Rumpf
- Institute of Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computational Genomic Medicine, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computer Science, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590 Frankfurt am Main, Germany
| | - Thorsten Kessler
- German Heart Centre Munich, Department of Cardiology, School of Medicine and Health, Technical University of Munich, 80636 Munich, Germany
- German Centre for Cardiovascular Research, Partner Site Munich Heart Alliance, 80636 Munich, Germany
| | - Marcel H. Schulz
- Institute of Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computational Genomic Medicine, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computer Science, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590 Frankfurt am Main, Germany
| |
Collapse
|
8
|
V JP, S AAV, P GK, N K K. A novel attention-based cross-modal transfer learning framework for predicting cardiovascular disease. Comput Biol Med 2024; 170:107977. [PMID: 38217974 DOI: 10.1016/j.compbiomed.2024.107977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Revised: 12/19/2023] [Accepted: 01/08/2024] [Indexed: 01/15/2024]
Abstract
Cardiovascular disease (CVD) remains a leading cause of death globally, presenting significant challenges in early detection and treatment. The complexity of CVD arises from its multifaceted nature, influenced by a combination of genetic, environmental, and lifestyle factors. Traditional diagnostic approaches often struggle to effectively integrate and interpret the heterogeneous data associated with CVD. Addressing this challenge, we introduce a novel Attention-Based Cross-Modal (ABCM) transfer learning framework. This framework innovatively merges diverse data types, including clinical records, medical imagery, and genetic information, through an attention-driven mechanism. This mechanism adeptly identifies and focuses on the most pertinent attributes from each data source, thereby enhancing the model's ability to discern intricate interrelationships among various data types. Our extensive testing and validation demonstrate that the ABCM framework significantly surpasses traditional single-source models and other advanced multi-source methods in predicting CVD. Specifically, our approach achieves an accuracy of 93.5%, precision of 92.0%, recall of 94.5%, and an impressive area under the curve (AUC) of 97.2%. These results not only underscore the superior predictive capability of our model but also highlight its potential in offering more accurate and early detection of CVD. The integration of cross-modal data through attention-based mechanisms provides a deeper understanding of the disease, paving the way for more informed clinical decision-making and personalized patient care.
Collapse
Affiliation(s)
- Jothi Prakash V
- Karpagam College of Engineering, Myleripalayam Village, Coimbatore, 641032, Tamil Nadu, India.
| | - Arul Antran Vijay S
- Karpagam College of Engineering, Myleripalayam Village, Coimbatore, 641032, Tamil Nadu, India.
| | - Ganesh Kumar P
- College of Engineering, Guindy, Anna University, Chennai, 600025, Tamil Nadu, India.
| | - Karthikeyan N K
- Coimbatore Institute of Technology, Peelamedu, Coimbatore, 641014, Tamil Nadu, India.
| |
Collapse
|
9
|
Zhu C, Baumgarten N, Wu M, Wang Y, Das AP, Kaur J, Ardakani FB, Duong TT, Pham MD, Duda M, Dimmeler S, Yuan T, Schulz MH, Krishnan J. CVD-associated SNPs with regulatory potential reveal novel non-coding disease genes. Hum Genomics 2023; 17:69. [PMID: 37491351 PMCID: PMC10369730 DOI: 10.1186/s40246-023-00513-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2023] [Accepted: 07/12/2023] [Indexed: 07/27/2023] Open
Abstract
BACKGROUND Cardiovascular diseases (CVDs) are the leading cause of death worldwide. Genome-wide association studies (GWAS) have identified many single nucleotide polymorphisms (SNPs) appearing in non-coding genomic regions in CVDs. The SNPs may alter gene expression by modifying transcription factor (TF) binding sites and lead to functional consequences in cardiovascular traits or diseases. To understand the underlying molecular mechanisms, it is crucial to identify which variations are involved and how they affect TF binding. METHODS The SNEEP (SNP exploration and analysis using epigenomics data) pipeline was used to identify regulatory SNPs, which alter the binding behavior of TFs and link GWAS SNPs to their potential target genes for six CVDs. The human-induced pluripotent stem cells derived cardiomyocytes (hiPSC-CMs), monoculture cardiac organoids (MCOs) and self-organized cardiac organoids (SCOs) were used in the study. Gene expression, cardiomyocyte size and cardiac contractility were assessed. RESULTS By using our integrative computational pipeline, we identified 1905 regulatory SNPs in CVD GWAS data. These were associated with hundreds of genes, half of them non-coding RNAs (ncRNAs), suggesting novel CVD genes. We experimentally tested 40 CVD-associated non-coding RNAs, among them RP11-98F14.11, RPL23AP92, IGBP1P1, and CTD-2383I20.1, which were upregulated in hiPSC-CMs, MCOs and SCOs under hypoxic conditions. Further experiments showed that IGBP1P1 depletion rescued expression of hypertrophic marker genes, reduced hypoxia-induced cardiomyocyte size and improved hypoxia-reduced cardiac contractility in hiPSC-CMs and MCOs. CONCLUSIONS IGBP1P1 is a novel ncRNA with key regulatory functions in modulating cardiomyocyte size and cardiac function in our disease models. Our data suggest ncRNA IGBP1P1 as a potential therapeutic target to improve cardiac function in CVDs.
Collapse
Affiliation(s)
- Chaonan Zhu
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany
| | - Nina Baumgarten
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590, Frankfurt Am Main, Germany
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany
| | - Meiqian Wu
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
| | - Yue Wang
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
| | - Arka Provo Das
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany
| | - Jaskiran Kaur
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
| | - Fatemeh Behjati Ardakani
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590, Frankfurt Am Main, Germany
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany
| | - Thanh Thuy Duong
- Genome Biologics, Theodor-Stern-Kai 7, 60590, Frankfurt Am Main, Germany
| | - Minh Duc Pham
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany
- Department of Medicine III, Cardiology/Angiology/ Nephrology, Goethe University Hospital, Frankfurt, Germany
- Genome Biologics, Theodor-Stern-Kai 7, 60590, Frankfurt Am Main, Germany
| | - Maria Duda
- Genome Biologics, Theodor-Stern-Kai 7, 60590, Frankfurt Am Main, Germany
| | - Stefanie Dimmeler
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590, Frankfurt Am Main, Germany
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany
| | - Ting Yuan
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany.
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany.
- Department of Medicine III, Cardiology/Angiology/ Nephrology, Goethe University Hospital, Frankfurt, Germany.
| | - Marcel H Schulz
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany.
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590, Frankfurt Am Main, Germany.
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany.
| | - Jaya Krishnan
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany.
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590, Frankfurt Am Main, Germany.
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany.
- Department of Medicine III, Cardiology/Angiology/ Nephrology, Goethe University Hospital, Frankfurt, Germany.
| |
Collapse
|
10
|
Ranson JM, Bucholc M, Lyall D, Newby D, Winchester L, Oxtoby NP, Veldsman M, Rittman T, Marzi S, Skene N, Al Khleifat A, Foote IF, Orgeta V, Kormilitzin A, Lourida I, Llewellyn DJ. Harnessing the potential of machine learning and artificial intelligence for dementia research. Brain Inform 2023; 10:6. [PMID: 36829050 PMCID: PMC9958222 DOI: 10.1186/s40708-022-00183-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Accepted: 12/26/2022] [Indexed: 02/26/2023] Open
Abstract
Progress in dementia research has been limited, with substantial gaps in our knowledge of targets for prevention, mechanisms for disease progression, and disease-modifying treatments. The growing availability of multimodal data sets opens possibilities for the application of machine learning and artificial intelligence (AI) to help answer key questions in the field. We provide an overview of the state of the science, highlighting current challenges and opportunities for utilisation of AI approaches to move the field forward in the areas of genetics, experimental medicine, drug discovery and trials optimisation, imaging, and prevention. Machine learning methods can enhance results of genetic studies, help determine biological effects and facilitate the identification of drug targets based on genetic and transcriptomic information. The use of unsupervised learning for understanding disease mechanisms for drug discovery is promising, while analysis of multimodal data sets to characterise and quantify disease severity and subtype are also beginning to contribute to optimisation of clinical trial recruitment. Data-driven experimental medicine is needed to analyse data across modalities and develop novel algorithms to translate insights from animal models to human disease biology. AI methods in neuroimaging outperform traditional approaches for diagnostic classification, and although challenges around validation and translation remain, there is optimism for their meaningful integration to clinical practice in the near future. AI-based models can also clarify our understanding of the causality and commonality of dementia risk factors, informing and improving risk prediction models along with the development of preventative interventions. The complexity and heterogeneity of dementia requires an alternative approach beyond traditional design and analytical approaches. Although not yet widely used in dementia research, machine learning and AI have the potential to unlock current challenges and advance precision dementia medicine.
Collapse
Affiliation(s)
- Janice M Ranson
- University of Exeter Medical School, College House, St Luke's Campus, Heavitree Road, Exeter, EX1 2LU, UK.
| | - Magda Bucholc
- Cognitive Analytics Research Lab, School of Computing, Engineering & Intelligent Systems, Ulster University, Derry, UK
| | - Donald Lyall
- Institute of Health and Wellbeing, University of Glasgow, Glasgow, UK
| | - Danielle Newby
- Department of Psychiatry, University of Oxford, Oxford, UK
| | | | - Neil P Oxtoby
- Department of Computer Science, UCL Centre for Medical Image Computing, University College London, London, UK
| | | | - Timothy Rittman
- Department of Clinical Neurosciences, University of Cambridge, Cambridge, UK
| | - Sarah Marzi
- UK Dementia Research Institute, Imperial College London, London, UK
- Department of Brain Sciences, Imperial College London, London, UK
| | - Nathan Skene
- UK Dementia Research Institute, Imperial College London, London, UK
- Department of Brain Sciences, Imperial College London, London, UK
| | - Ahmad Al Khleifat
- Department of Basic and Clinical Neuroscience, King's College London, London, UK
| | | | - Vasiliki Orgeta
- Division of Psychiatry, University College London, London, UK
| | | | - Ilianna Lourida
- University of Exeter Medical School, College House, St Luke's Campus, Heavitree Road, Exeter, EX1 2LU, UK
| | - David J Llewellyn
- University of Exeter Medical School, College House, St Luke's Campus, Heavitree Road, Exeter, EX1 2LU, UK
- The Alan Turing Institute, London, UK
| |
Collapse
|
11
|
Agarwal A, Zhao F, Jiang Y, Chen L. TIVAN-indel: a computational framework for annotating and predicting non-coding regulatory small insertions and deletions. Bioinformatics 2023; 39:btad060. [PMID: 36707993 PMCID: PMC9900211 DOI: 10.1093/bioinformatics/btad060] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2022] [Revised: 01/20/2023] [Accepted: 01/25/2023] [Indexed: 01/29/2023] Open
Abstract
MOTIVATION Small insertion and deletion (sindel) of human genome has an important implication for human disease. One important mechanism for non-coding sindel (nc-sindel) to have an impact on human diseases and phenotypes is through the regulation of gene expression. Nevertheless, current sequencing experiments may lack statistical power and resolution to pinpoint the functional sindel due to lower minor allele frequency or small effect size. As an alternative strategy, a supervised machine learning method can identify the otherwise masked functional sindels by predicting their regulatory potential directly. However, computational methods for annotating and predicting the regulatory sindels, especially in the non-coding regions, are underdeveloped. RESULTS By leveraging labeled nc-sindels identified by cis-expression quantitative trait loci analyses across 44 tissues in Genotype-Tissue Expression (GTEx), and a compilation of both generic functional annotations and large-scale epigenomic profiles, we develop TIssue-specific Variant Annotation for Non-coding indel (TIVAN-indel), which is a supervised computational framework for predicting non-coding regulatory sindels. As a result, we demonstrate that TIVAN-indel achieves the best prediction performance in both with-tissue prediction and cross-tissue prediction. As an independent evaluation, we train TIVAN-indel from the 'Whole Blood' tissue in GTEx and test the model using 15 immune cell types from an independent study named Database of Immune Cell Expression. Lastly, we perform an enrichment analysis for both true and predicted sindels in key regulatory regions such as chromatin interactions, open chromatin regions and histone modification sites, and find biologically meaningful enrichment patterns. AVAILABILITY AND IMPLEMENTATION https://github.com/lichen-lab/TIVAN-indel. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Aman Agarwal
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| | - Fengdi Zhao
- Department of Biostatistics, University of Florida, Gainesville, FL 32603, USA
| | - Yuchao Jiang
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27516, USA
| | - Li Chen
- Department of Biostatistics, University of Florida, Gainesville, FL 32603, USA
| |
Collapse
|
12
|
Agarwal A, Chen L. DeepPHiC: predicting promoter-centered chromatin interactions using a novel deep learning approach. Bioinformatics 2023; 39:6887158. [PMID: 36495179 PMCID: PMC9825766 DOI: 10.1093/bioinformatics/btac801] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Revised: 11/23/2022] [Accepted: 12/09/2022] [Indexed: 12/14/2022] Open
Abstract
MOTIVATION Promoter-centered chromatin interactions, which include promoter-enhancer (PE) and promoter-promoter (PP) interactions, are important to decipher gene regulation and disease mechanisms. The development of next-generation sequencing technologies such as promoter capture Hi-C (pcHi-C) leads to the discovery of promoter-centered chromatin interactions. However, pcHi-C experiments are expensive and thus may be unavailable for tissues/cell types of interest. In addition, these experiments may be underpowered due to insufficient sequencing depth or various artifacts, which results in a limited finding of interactions. Most existing computational methods for predicting chromatin interactions are based on in situ Hi-C and can detect chromatin interactions across the entire genome. However, they may not be optimal for predicting promoter-centered chromatin interactions. RESULTS We develop a supervised multi-modal deep learning model, which utilizes a comprehensive set of features such as genomic sequence, epigenetic signal, anchor distance, evolutionary features and DNA structural features to predict tissue/cell type-specific PE and PP interactions. We further extend the deep learning model in a multi-task learning and a transfer learning framework and demonstrate that the proposed approach outperforms state-of-the-art deep learning methods. Moreover, the proposed approach can achieve comparable prediction performance using predefined biologically relevant tissues/cell types compared to using all tissues/cell types in the pretraining especially for predicting PE interactions. The prediction performance can be further improved by using computationally inferred biologically relevant tissues/cell types in the pretraining, which are defined based on the common genes in the proximity of two anchors in the chromatin interactions. AVAILABILITY AND IMPLEMENTATION https://github.com/lichen-lab/DeepPHiC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Aman Agarwal
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| | - Li Chen
- To whom correspondence should be addressed.
| |
Collapse
|