1
|
Le TT, Dang XT. Predicting TF-Target Gene Association Using a Heterogeneous Network and Enhanced Negative Sampling. Bioinform Biol Insights 2025; 19:11779322251316130. [PMID: 40012937 PMCID: PMC11863233 DOI: 10.1177/11779322251316130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Accepted: 01/10/2025] [Indexed: 02/28/2025] Open
Abstract
Identifying interactions between transcription factors (TFs) and target genes is crucial for understanding the molecular mechanisms involved in biological processes and diseases. Traditional biological experiments used to determine these interactions are often time-consuming, costly, and limited in scale. Current computational methods mainly predict binding sites rather than direct interactions. Although recent studies have achieved high performance in predicting TF-target gene associations, they still face a significant challenge related to constructing a robust dataset of positive and negative samples. Currently, methods do not adequately focus on selecting negative samples, resulting in incomplete coverage of potential TF-target gene relationships. This article proposes a method to select enhanced negative samples to improve the prediction performance of TF-target gene interactions. Experimental results show that the proposed method achieves an average area under the curve (AUC) value of 0.9024 ± 0.0008 through 5-fold cross-validation. These results demonstrate the model's high efficiency and accuracy, confirming its potential application in predicting TF-target gene interactions across various datasets and paving the way for large-scale biomedical research.
Collapse
Affiliation(s)
- Thanh Tuoi Le
- Faculty of Information Technology, Hanoi National University of Education, Hanoi, Vietnam
- Faculty of Information Technology, Vinh University of Technology Education, Vinh, Vietnam
| | - Xuan Tho Dang
- Faculty of Digital Economics, Academy of Policy and Development, Hanoi, Vietnam
| |
Collapse
|
2
|
Huang P, Charton F, Schmelzle JNM, Darnell SS, Prins P, Garrison E, Suh GE. Pangenome-Informed Language Models for Privacy-Preserving Synthetic Genome Sequence Generation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.18.612131. [PMID: 39386557 PMCID: PMC11463672 DOI: 10.1101/2024.09.18.612131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/12/2024]
Abstract
The public availability of genome datasets, such as The Human Genome Project (HGP), The 1000 Genomes Project, The Cancer Genome Atlas, and the International HapMap Project, has significantly advanced scientific research and medical understanding. Here our goal is to share such genomic information for downstream analysis while protecting the privacy of individuals through Differential Privacy (DP). We introduce synthetic DNA data generation based on pangenomes in combination with Pretrained-Language Models (PTLMs). We introduce two novel tokenization schemes based on pangenome graphs to enhance the modeling of DNA. We evaluated these tokenization methods, and compared them with classical single nucleotide and k-mer tokenizations. We find k-mer tokenization schemes, indicating that our tokenization schemes boost the model's performance consistency with long effective context length (covering longer sequences with the same number of tokens). Additionally, we propose a method to utilize the pangenome graph and make it comply with DP privacy standards. We assess the performance of DP training on the quality of generated sequences with discussion of the trade-offs between privacy and model accuracy. The source code for our work will be published under a free and open source license soon.
Collapse
Affiliation(s)
| | | | | | | | - Pjotr Prins
- University of Tennessee Health Science Center
| | | | | |
Collapse
|
3
|
Elkayam S, Tziony I, Orenstein Y. DeepCRISTL: deep transfer learning to predict CRISPR/Cas9 on-target editing efficiency in specific cellular contexts. Bioinformatics 2024; 40:btae481. [PMID: 39073893 PMCID: PMC11319645 DOI: 10.1093/bioinformatics/btae481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Revised: 05/28/2024] [Accepted: 07/27/2024] [Indexed: 07/31/2024] Open
Abstract
MOTIVATION CRISPR/Cas9 technology has been revolutionizing the field of gene editing. Guide RNAs (gRNAs) enable Cas9 proteins to target specific genomic loci for editing. However, editing efficiency varies between gRNAs and so computational methods were developed to predict editing efficiency for any gRNA of interest. High-throughput datasets of Cas9 editing efficiencies were produced to train machine-learning models to predict editing efficiency. However, these high-throughput datasets have a low correlation with functional and endogenous datasets, which are too small to train accurate machine-learning models on. RESULTS We developed DeepCRISTL, a deep-learning model to predict the editing efficiency in a specific cellular context. DeepCRISTL takes advantage of high-throughput datasets to learn general patterns of gRNA editing efficiency and then fine-tunes the model on functional or endogenous data to fit a specific cellular context. We tested two state-of-the-art models trained on high-throughput datasets for editing efficiency prediction, our newly improved DeepHF and CRISPRon, combined with various transfer-learning approaches. The combination of CRISPRon and fine-tuning all model weights was the overall best performer. DeepCRISTL outperformed state-of-the-art methods in predicting editing efficiency in a specific cellular context on functional and endogenous datasets. Using saliency maps, we identified and compared the important features learned by DeepCRISTL across cellular contexts. We believe DeepCRISTL will improve prediction performance in many other CRISPR/Cas9 editing contexts by leveraging transfer learning to utilize both high-throughput datasets and smaller and more biologically relevant datasets. AVAILABILITY AND IMPLEMENTATION DeepCRISTL is available via https://github.com/OrensteinLab/DeepCRISTL.
Collapse
Affiliation(s)
- Shai Elkayam
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| | - Ido Tziony
- Department of Computer Science, Bar-Ilan University, Ramat Gan 5290002, Israel
| | - Yaron Orenstein
- Department of Computer Science, Bar-Ilan University, Ramat Gan 5290002, Israel
- The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat Gan 5290002, Israel
| |
Collapse
|
4
|
Park JH, Prasad V, Newsom S, Najar F, Rajan R. IdMotif: An Interactive Motif Identification in Protein Sequences. IEEE COMPUTER GRAPHICS AND APPLICATIONS 2024; 44:114-125. [PMID: 38127603 DOI: 10.1109/mcg.2023.3345742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/23/2023]
Abstract
This article presents a visual analytics framework, idMotif, to support domain experts in identifying motifs in protein sequences. A motif is a short sequence of amino acids usually associated with distinct functions of a protein, and identifying similar motifs in protein sequences helps us to predict certain types of disease or infection. idMotif can be used to explore, analyze, and visualize such motifs in protein sequences. We introduce a deep-learning-based method for grouping protein sequences and allow users to discover motif candidates of protein groups based on local explanations of the decision of a deep-learning model. idMotif provides several interactive linked views for between and within protein cluster/group and sequence analysis. Through a case study and experts' feedback, we demonstrate how the framework helps domain experts analyze protein sequences and motif identification.
Collapse
|
5
|
Mahmoud MAB. Classification of DNA Sequence Based on a Non-gradient Algorithm: Pseudoinverse Learners. Methods Mol Biol 2024; 2744:359-373. [PMID: 38683331 DOI: 10.1007/978-1-0716-3581-0_23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/01/2024]
Abstract
This chapter proposes a prototype-based classification approach for analyzing DNA barcodes that uses a spectral representation of DNA sequences and a non-gradient neural network. Biological sequences can be viewed as data components with higher non-fixed dimensions, which correspond to the length of the sequences. Through computational procedures such as one-hot encoding, numerical encoding plays an important role in DNA sequence evaluation (OHE). However, the OHE method has some disadvantages: (1) It does not add any details that could result in an additional predictive variable, and (2) if the variable has many classes, OHE significantly expands the feature space. To address these shortcomings, this chapter proposes a computationally efficient framework for classifying DNA sequences of living organisms in the image domain. A multilayer perceptron trained by a pseudoinverse learning autoencoder (PILAE) algorithm is used in the proposed strategy. The learning control parameters and the number of hidden layers do not have to be specified during the PILAE training process. As a result, the PILAE classifier outperforms other deep neural network (DNN) strategies such as the VGG-16 and Xception models.
Collapse
Affiliation(s)
- Mohammed A B Mahmoud
- Faculty of Computer Science, October University for Modern Sciences and Arts, Cairo, Egypt.
| |
Collapse
|
6
|
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models. Int J Mol Sci 2023; 24:15858. [PMID: 37958843 PMCID: PMC10649223 DOI: 10.3390/ijms242115858] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open
Abstract
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics.
Collapse
Affiliation(s)
- Tianwei Yue
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Yuanxin Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Longxiang Zhang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Chunming Gu
- Department of Biomedical Engineering, School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA;
| | - Haoru Xue
- The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Wenping Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Qi Lyu
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824, USA;
| | - Yujie Dun
- School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China;
| |
Collapse
|
7
|
Huang J, Zhou Y, Zhang H, Wu Y. A neural network model to screen feature genes for pancreatic cancer. BMC Bioinformatics 2023; 24:193. [PMID: 37170188 PMCID: PMC10176951 DOI: 10.1186/s12859-023-05322-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Accepted: 05/05/2023] [Indexed: 05/13/2023] Open
Abstract
All the time, pancreatic cancer is a problem worldwide because of its high degree of malignancy and increased mortality. Neural network model analysis is an efficient and accurate machine learning method that can quickly and accurately predict disease feature genes. The aim of our research was to build a neural network model that would help screen out feature genes for pancreatic cancer diagnosis and prediction of prognosis. Our study confirmed that the neural network model is a reliable way to predict feature genes of pancreatic cancer, and immune cells infiltrating play an essential role in the development of pancreatic cancer, especially neutrophils. ANO1, AHNAK2, and ADAM9 were eventually identified as feature genes of pancreatic cancer, helping to diagnose and predict prognosis. Neural network model analysis provides us with a new idea for finding new intervention targets for pancreatic cancer.
Collapse
Affiliation(s)
- Jing Huang
- Department of Gastroenterology, First Hospital of Jiaxing, Jiaxing, 314001, Zhejiang, China
| | - Yuting Zhou
- Department of Respiratory, The 904Th Hospital of Joint Logistic Support Force of PLA, Affiliated Hospital of Jiangnan University, Wuxi, 214000, Jiangsu, China
| | - Haoran Zhang
- Department of Gastroenterology, First Hospital of Jiaxing, Jiaxing, 314001, Zhejiang, China
| | - Yiming Wu
- Department of Gastroenterology, First Hospital of Jiaxing, Jiaxing, 314001, Zhejiang, China.
| |
Collapse
|
8
|
Lee J, Lee SM, Ahn JM, Lee TR, Kim W, Cho EH, Ki CS. Development and performance evaluation of an artificial intelligence algorithm using cell-free DNA fragment distance for non-invasive prenatal testing (aiD-NIPT). Front Genet 2022; 13:999587. [DOI: 10.3389/fgene.2022.999587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Accepted: 11/09/2022] [Indexed: 11/30/2022] Open
Abstract
With advances in next-generation sequencing technology, non-invasive prenatal testing (NIPT) has been widely implemented to detect fetal aneuploidies, including trisomy 21, 18, and 13 (T21, T18, and T13). Most NIPT methods use cell-free DNA (cfDNA) fragment count (FC) in maternal blood. In this study, we developed a novel NIPT method using cfDNA fragment distance (FD) and convolutional neural network-based artificial intelligence algorithm (aiD-NIPT). Four types of aiD-NIPT algorithm (mean, median, interquartile range, and its ensemble) were developed using 2,215 samples. In an analysis of 17,678 clinical samples, all algorithms showed >99.40% accuracy for T21/T18/T13, and the ensemble algorithm showed the best performance (sensitivity: 99.07%, positive predictive value (PPV): 88.43%); the FC-based conventional Z-score and normalized chromosomal value showed 98.15% sensitivity, with 40.77% and 36.81% PPV, respectively. In conclusion, FD-based aiD-NIPT was successfully developed, and it showed better performance than FC-based NIPT methods.
Collapse
|
9
|
Bai X, Ren J, Sun F. MLR-OOD: A Markov Chain Based Likelihood Ratio Method for Out-Of-Distribution Detection of Genomic Sequences. J Mol Biol 2022; 434:167586. [PMID: 35427634 PMCID: PMC10433695 DOI: 10.1016/j.jmb.2022.167586] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2022] [Revised: 04/05/2022] [Accepted: 04/05/2022] [Indexed: 12/23/2022]
Abstract
Machine learning or deep learning models have been widely used for taxonomic classification of metagenomic sequences and many studies reported high classification accuracy. Such models are usually trained based on sequences in several training classes in hope of accurately classifying unknown sequences into these classes. However, when deploying the classification models on real testing data sets, sequences that do not belong to any of the training classes may be present and are falsely assigned to one of the training classes with high confidence. Such sequences are referred to as out-of-distribution (OOD) sequences and are ubiquitous in metagenomic studies. To address this problem, we develop a deep generative model-based method, MLR-OOD, that measures the probability of a testing sequencing belonging to OOD by the likelihood ratio of the maximum of the in-distribution (ID) class conditional likelihoods and the Markov chain likelihood of the testing sequence measuring the sequence complexity. We compose three different microbial data sets consisting of bacterial, viral, and plasmid sequences for comprehensively benchmarking OOD detection methods. We show that MLR-OOD achieves the state-of-the-art performance demonstrating the generality of MLR-OOD to various types of microbial data sets. It is also shown that MLR-OOD is robust to the GC content, which is a major confounding effect for OOD detection of genomic sequences. In conclusion, MLR-OOD will greatly reduce false positives caused by OOD sequences in metagenomic sequence classification.
Collapse
Affiliation(s)
- Xin Bai
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Jie Ren
- Google Research, Brain Team, USA
| | - Fengzhu Sun
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA.
| |
Collapse
|
10
|
Harikrishnan NB, Pranay SY, Nagaraj N. Classification of SARS-CoV-2 viral genome sequences using Neurochaos Learning. Med Biol Eng Comput 2022; 60:2245-2255. [PMID: 35668230 PMCID: PMC9170350 DOI: 10.1007/s11517-022-02591-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Accepted: 04/28/2022] [Indexed: 12/01/2022]
Abstract
Abstract The high spread rate of SARS-CoV-2 virus has put the researchers all over the world in a demanding situation. The need of the hour is to develop novel learning algorithms that can effectively learn a general pattern by training with fewer genome sequences of coronavirus. Learning from very few training samples is necessary and important during the beginning of a disease outbreak when sequencing data is limited. This is because a successful detection and isolation of patients can curb the spread of the virus. However, this poses a huge challenge for machine learning and deep learning algorithms as they require huge amounts of training data to learn the pattern and distinguish from other closely related viruses. In this paper, we propose a new paradigm – Neurochaos Learning (NL) for classification of coronavirus genome sequence that addresses this specific problem. NL is inspired from the empirical evidence of chaos and non-linearity at the level of neurons in biological neural networks. The average sensitivity, specificity and accuracy for NL are 0.998, 0.999 and 0.998 respectively for the multiclass classification problem (SARS-CoV-2, Coronaviridae, Metapneumovirus, Rhinovirus and Influenza) using leave one out crossvalidation. With just one training sample per class for 1000 independent random trials of training, we report an average macro F1-score \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$> 0.99$$\end{document}>0.99 for the classification of SARS-CoV-2 from SARS-CoV-1 genome sequences. We compare the performance of NL with K-nearest neighbours (KNN), logistic regression, random forest, SVM, and naïve Bayes classifiers. We foresee promising future applications in genome classification using NL with novel combinations of chaotic feature engineering and other machine learning algorithms. Graphical abstract ![]()
Supplementary Information The online version contains supplementary material available at 10.1007/s11517-022-02591-3.
Collapse
Affiliation(s)
- N. B. Harikrishnan
- The University of Trans-Disciplinary Health Sciences and Technology, Bengaluru, 560064 Karnataka India
- Consciousness Studies Programme, National Institute of Advanced Studies, Indian Institute of Science Campus, Bengaluru, 560012 Karnataka India
| | - S. Y. Pranay
- Consciousness Studies Programme, National Institute of Advanced Studies, Indian Institute of Science Campus, Bengaluru, 560012 Karnataka India
| | - Nithin Nagaraj
- Consciousness Studies Programme, National Institute of Advanced Studies, Indian Institute of Science Campus, Bengaluru, 560012 Karnataka India
| |
Collapse
|
11
|
Alharbi WS, Rashid M. A review of deep learning applications in human genomics using next-generation sequencing data. Hum Genomics 2022; 16:26. [PMID: 35879805 PMCID: PMC9317091 DOI: 10.1186/s40246-022-00396-x] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 07/12/2022] [Indexed: 12/02/2022] Open
Abstract
Genomics is advancing towards data-driven science. Through the advent of high-throughput data generating technologies in human genomics, we are overwhelmed with the heap of genomic data. To extract knowledge and pattern out of this genomic data, artificial intelligence especially deep learning methods has been instrumental. In the current review, we address development and application of deep learning methods/models in different subarea of human genomics. We assessed over- and under-charted area of genomics by deep learning techniques. Deep learning algorithms underlying the genomic tools have been discussed briefly in later part of this review. Finally, we discussed briefly about the late application of deep learning tools in genomic. Conclusively, this review is timely for biotechnology or genomic scientists in order to guide them why, when and how to use deep learning methods to analyse human genomic data.
Collapse
Affiliation(s)
- Wardah S Alharbi
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia
| | - Mamoon Rashid
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia.
| |
Collapse
|
12
|
Ng JWK, Ong EHQ, Tucker-Kellogg L, Tucker-Kellogg G. Deep learning for de-convolution of Smad2 versus Smad3 binding sites. BMC Genomics 2022; 23:525. [PMID: 35858839 PMCID: PMC9297549 DOI: 10.1186/s12864-022-08565-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 04/19/2022] [Indexed: 11/10/2022] Open
Abstract
Background The transforming growth factor beta-1 (TGF β-1) cytokine exerts both pro-tumor and anti-tumor effects in carcinogenesis. An increasing body of literature suggests that TGF β-1 signaling outcome is partially dependent on the regulatory targets of downstream receptor-regulated Smad (R-Smad) proteins Smad2 and Smad3. However, the lack of Smad-specific antibodies for ChIP-seq hinders convenient identification of Smad-specific binding sites. Results In this study, we use localization and affinity purification (LAP) tags to identify Smad-specific binding sites in a cancer cell line. Using ChIP-seq data obtained from LAP-tagged Smad proteins, we develop a convolutional neural network with long-short term memory (CNN-LSTM) as a deep learning approach to classify a pool of Smad-bound sites as being Smad2- or Smad3-bound. Our data showed that this approach is able to accurately classify Smad2- versus Smad3-bound sites. We use our model to dissect the role of each R-Smad in the progression of breast cancer using a previously published dataset. Conclusions Our results suggests that deep learning approaches can be used to dissect binding site specificity of closely related transcription factors. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-022-08565-x).
Collapse
Affiliation(s)
- Jeremy W K Ng
- Department of Biological Sciences, National University of Singapore, Singapore, Singapore
| | - Esther H Q Ong
- Department of Biological Sciences, National University of Singapore, Singapore, Singapore
| | - Lisa Tucker-Kellogg
- Cancer and Stem Cell Biology, and Centre for Computational Biology, Duke-NUS Medical School, Singapore, Singapore.
| | - Greg Tucker-Kellogg
- Department of Biological Sciences, National University of Singapore, Singapore, Singapore. .,Computational Biology Programme, Faculty of Science, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
13
|
Cao L, Liu P, Chen J, Deng L. Prediction of Transcription Factor Binding Sites Using a Combined Deep Learning Approach. Front Oncol 2022; 12:893520. [PMID: 35719916 PMCID: PMC9204005 DOI: 10.3389/fonc.2022.893520] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Accepted: 04/11/2022] [Indexed: 12/04/2022] Open
Abstract
In the process of regulating gene expression and evolution, such as DNA replication and mRNA transcription, the binding of transcription factors (TFs) to TF binding sites (TFBS) plays a vital role. Precisely modeling the specificity of genes and searching for TFBS are helpful to explore the mechanism of cell expression. In recent years, computational and deep learning methods searching for TFBS have become an active field of research. However, existing methods generally cannot meet high performance and interpretability simultaneously. Here, we develop an accurate and interpretable attention-based hybrid approach, DeepARC, that combines a convolutional neural network (CNN) and recurrent neural network (RNN) to predict TFBS. DeepARC employs a positional embedding method to extract the hidden embedding from DNA sequences, including the positional information from OneHot encoding and the distributed embedding from DNA2Vec. DeepARC feeds the positional embedding of the DNA sequence into a CNN-BiLSTM-Attention-based framework to complete the task of finding the motif. Taking advantage of the attention mechanism, DeepARC can gain greater access to valuable information about the motif and bring interpretability to the work of searching for motifs through the attention weight graph. Moreover, DeepARC achieves promising performances with an average area under the receiver operating characteristic curve (AUC) score of 0.908 on five cell lines (A549, GM12878, Hep-G2, H1-hESC, and Hela) in the benchmark dataset. We also compare the positional embedding with OneHot and DNA2Vec and gain a competitive advantage.
Collapse
Affiliation(s)
- Linan Cao
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Pei Liu
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Jialong Chen
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, China
| |
Collapse
|
14
|
Du ZH, Wu YH, Huang YA, Chen J, Pan GQ, Hu L, You ZH, Li JQ. GraphTGI: an attention-based graph embedding model for predicting TF-target gene interactions. Brief Bioinform 2022; 23:6576453. [PMID: 35511108 DOI: 10.1093/bib/bbac148] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 03/25/2022] [Accepted: 03/31/2022] [Indexed: 12/26/2022] Open
Abstract
MOTIVATION Interaction between transcription factor (TF) and its target genes establishes the knowledge foundation for biological researches in transcriptional regulation, the number of which is, however, still limited by biological techniques. Existing computational methods relevant to the prediction of TF-target interactions are mostly proposed for predicting binding sites, rather than directly predicting the interactions. To this end, we propose here a graph attention-based autoencoder model to predict TF-target gene interactions using the information of the known TF-target gene interaction network combined with two sequential and chemical gene characters, considering that the unobserved interactions between transcription factors and target genes can be predicted by learning the pattern of the known ones. To the best of our knowledge, the proposed model is the first attempt to solve this problem by learning patterns from the known TF-target gene interaction network. RESULTS In this paper, we formulate the prediction task of TF-target gene interactions as a link prediction problem on a complex knowledge graph and propose a deep learning model called GraphTGI, which is composed of a graph attention-based encoder and a bilinear decoder. We evaluated the prediction performance of the proposed method on a real dataset, and the experimental results show that the proposed model yields outstanding performance with an average AUC value of 0.8864 +/- 0.0057 in the 5-fold cross-validation. It is anticipated that the GraphTGI model can effectively and efficiently predict TF-target gene interactions on a large scale. AVAILABILITY Python code and the datasets used in our studies are made available at https://github.com/YanghanWu/GraphTGI.
Collapse
Affiliation(s)
- Zhi-Hua Du
- College of Computer Science and Software Engineering, ShenZhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Yang-Han Wu
- College of Computer Science and Software Engineering, ShenZhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Yu-An Huang
- College of Computer Science and Software Engineering, ShenZhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Jie Chen
- College of Computer Science and Software Engineering, ShenZhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Gui-Qing Pan
- College of Computer Science and Software Engineering, ShenZhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Lun Hu
- College of Computer Science and Software Engineering, ShenZhen University, 3688 Nanhai Avenue, Shenzhen, China
| | - Zhu-Hong You
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Jian-Qiang Li
- College of Computer Science and Software Engineering, ShenZhen University, 3688 Nanhai Avenue, Shenzhen, China
| |
Collapse
|
15
|
Kaplow IM, Banerjee A, Foo CS. Neural network modeling of differential binding between wild-type and mutant CTCF reveals putative binding preferences for zinc fingers 1-2. BMC Genomics 2022; 23:295. [PMID: 35410161 PMCID: PMC9004084 DOI: 10.1186/s12864-022-08486-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Accepted: 03/21/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Many transcription factors (TFs), such as multi zinc-finger (ZF) TFs, have multiple DNA binding domains (DBDs), and deciphering the DNA binding motifs of individual DBDs is a major challenge. One example of such a TF is CCCTC-binding factor (CTCF), a TF with eleven ZFs that plays a variety of roles in transcriptional regulation, most notably anchoring DNA loops. Previous studies found that CTCF ZFs 3-7 bind CTCF's core motif and ZFs 9-11 bind a specific upstream motif, but the motifs of ZFs 1-2 have yet to be identified. RESULTS We developed a new approach to identifying the binding motifs of individual DBDs of a TF through analyzing chromatin immunoprecipitation sequencing (ChIP-seq) experiments in which a single DBD is mutated: we train a deep convolutional neural network to predict whether wild-type TF binding sites are preserved in the mutant TF dataset and interpret the model. We applied this approach to mouse CTCF ChIP-seq data and identified the known binding preferences of CTCF ZFs 3-11 as well as a putative GAG binding motif for ZF 1. We analyzed other CTCF datasets to provide additional evidence that ZF 1 is associated with binding at the motif we identified, and we found that the presence of the motif for ZF 1 is associated with CTCF ChIP-seq peak strength. CONCLUSIONS Our approach can be applied to any TF for which in vivo binding data from both the wild-type and mutated versions of the TF are available, and our findings provide new potential insights binding preferences of CTCF's DBDs.
Collapse
Affiliation(s)
- Irene M Kaplow
- Departments of Computer Science, Stanford University, 240 Pasteur Drive, Stanford, California, 94305, USA. .,Present address: Department of Computational Biology, Carnegie Mellon University, 5000 Forbes Avenue, Gates-Hillman Building Room 7703, Pittsburgh, PA, 15213, USA.
| | - Abhimanyu Banerjee
- Departments of Physics, Stanford University, 240 Pasteur Drive, Stanford, California, 94305, USA
| | - Chuan Sheng Foo
- Departments of Computer Science, Stanford University, 240 Pasteur Drive, Stanford, California, 94305, USA. .,Present address: Machine Intellection Department, Institute for Infocomm Research, 1 Fusionopolis Way, #21-01 Connexis South Tower, Singapore, 138632, Singapore.
| |
Collapse
|
16
|
Interpreting neural networks for biological sequences by learning stochastic masks. NAT MACH INTELL 2022; 4:41-54. [DOI: 10.1038/s42256-021-00428-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
|
17
|
Niu R, Peng J, Zhang Z, Shang X. R-CRISPR: A Deep Learning Network to Predict Off-Target Activities with Mismatch, Insertion and Deletion in CRISPR-Cas9 System. Genes (Basel) 2021; 12:1878. [PMID: 34946828 PMCID: PMC8702036 DOI: 10.3390/genes12121878] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 11/19/2021] [Accepted: 11/22/2021] [Indexed: 12/26/2022] Open
Abstract
The Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)-associated protein 9 (Cas9) system is a groundbreaking gene-editing tool, which has been widely adopted in biomedical research. However, the guide RNAs in CRISPR-Cas9 system may induce unwanted off-target activities and further affect the practical application of the technique. Most existing in silico prediction methods that focused on off-target activities possess limited predictive precision and remain to be improved. Hence, it is necessary to propose a new in silico prediction method to address this problem. In this work, a deep learning framework named R-CRISPR is presented, which devises an encoding scheme to encode gRNA-target sequences into binary matrices, a convolutional neural network as feature extractor, and a recurrent neural network to predict off-target activities with mismatch, insertion, or deletion. It is demonstrated that R-CRISPR surpasses six mainstream prediction methods with a significant improvement on mismatch-only datasets verified by GUIDE-seq. Compared with the state-of-art prediction methods, R-CRISPR also achieves competitive performance on datasets with mismatch, insertion, and deletion. Furthermore, experiments show that data concatenate could influence the quality of training data, and investigate the optimal combination of datasets.
Collapse
Affiliation(s)
| | | | | | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China; (R.N.); (J.P.); (Z.Z.)
| |
Collapse
|
18
|
Zhang Y, Mitelut C, Silasi G, Bolanos F, Swindale N, Murphy T, Saxena S. Uncovering the effect of different brain regions on behavioral classification using recurrent neural networks. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2021; 2021:6602-6607. [PMID: 34892622 DOI: 10.1109/embc46164.2021.9629776] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
As our ability to record neural activity from a larger number of brain areas increases, we need to develop tools to understand how this activity is related to ongoing behavior. Recurrent neural networks (RNNs) have been shown to perform successful classification for sequence data. However, they are black box models: once trained, it is difficult to uncover the mechanisms that they are using to classify. In this study, we analyze the effect of RNNs on classifying behavior using a simulated dataset and a widefield neural activity dataset as mice perform a self-initiated behavior. We show that RNNs are comparable to, or outperform, traditional classification methods such as Support Vector Machine (SVM), and can also lead to accurate prediction of behavior. Using dimensionality reduction, we visualize the activity of the RNNs to better understand the classification mechanisms of the RNNs. Finally, we are able to accurately pinpoint the effect of different regions on behavioral classification. This study highlights the utility and interpretability of RNNs while classifying behavior using neural activity from different regions.
Collapse
|
19
|
Zhao Z, Woloszynek S, Agbavor F, Mell JC, Sokhansanj BA, Rosen GL. Learning, visualizing and exploring 16S rRNA structure using an attention-based deep neural network. PLoS Comput Biol 2021; 17:e1009345. [PMID: 34550967 PMCID: PMC8496832 DOI: 10.1371/journal.pcbi.1009345] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/07/2021] [Accepted: 08/12/2021] [Indexed: 01/04/2023] Open
Abstract
Recurrent neural networks with memory and attention mechanisms are widely used in natural language processing because they can capture short and long term sequential information for diverse tasks. We propose an integrated deep learning model for microbial DNA sequence data, which exploits convolutional neural networks, recurrent neural networks, and attention mechanisms to predict taxonomic classifications and sample-associated attributes, such as the relationship between the microbiome and host phenotype, on the read/sequence level. In this paper, we develop this novel deep learning approach and evaluate its application to amplicon sequences. We apply our approach to short DNA reads and full sequences of 16S ribosomal RNA (rRNA) marker genes, which identify the heterogeneity of a microbial community sample. We demonstrate that our implementation of a novel attention-based deep network architecture, Read2Pheno, achieves read-level phenotypic prediction. Training Read2Pheno models will encode sequences (reads) into dense, meaningful representations: learned embedded vectors output from the intermediate layer of the network model, which can provide biological insight when visualized. The attention layer of Read2Pheno models can also automatically identify nucleotide regions in reads/sequences which are particularly informative for classification. As such, this novel approach can avoid pre/post-processing and manual interpretation required with conventional approaches to microbiome sequence classification. We further show, as proof-of-concept, that aggregating read-level information can robustly predict microbial community properties, host phenotype, and taxonomic classification, with performance at least comparable to conventional approaches. An implementation of the attention-based deep learning network is available at https://github.com/EESI/sequence_attention (a python package) and https://github.com/EESI/seq2att (a command line tool).
Collapse
Affiliation(s)
- Zhengqiao Zhao
- Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Stephen Woloszynek
- Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America
- Harvard Medical School, Boston, Massachusetts, United States of America
| | - Felix Agbavor
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Joshua Chang Mell
- College of Medicine, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Bahrad A. Sokhansanj
- Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Gail L. Rosen
- Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
20
|
Misra J, Surampudi SG, Venkatesh M, Limbachia C, Jaja J, Pessoa L. Learning brain dynamics for decoding and predicting individual differences. PLoS Comput Biol 2021; 17:e1008943. [PMID: 34478442 PMCID: PMC8445454 DOI: 10.1371/journal.pcbi.1008943] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Revised: 09/16/2021] [Accepted: 08/19/2021] [Indexed: 12/04/2022] Open
Abstract
Insights from functional Magnetic Resonance Imaging (fMRI), as well as recordings of large numbers of neurons, reveal that many cognitive, emotional, and motor functions depend on the multivariate interactions of brain signals. To decode brain dynamics, we propose an architecture based on recurrent neural networks to uncover distributed spatiotemporal signatures. We demonstrate the potential of the approach using human fMRI data during movie-watching data and a continuous experimental paradigm. The model was able to learn spatiotemporal patterns that supported 15-way movie-clip classification (∼90%) at the level of brain regions, and binary classification of experimental conditions (∼60%) at the level of voxels. The model was also able to learn individual differences in measures of fluid intelligence and verbal IQ at levels comparable to that of existing techniques. We propose a dimensionality reduction approach that uncovers low-dimensional trajectories and captures essential informational (i.e., classification related) properties of brain dynamics. Finally, saliency maps and lesion analysis were employed to characterize brain-region/voxel importance, and uncovered how dynamic but consistent changes in fMRI activation influenced decoding performance. When applied at the level of voxels, our framework implements a dynamic version of multivariate pattern analysis. Our approach provides a framework for visualizing, analyzing, and discovering dynamic spatially distributed brain representations during naturalistic conditions.
Collapse
Affiliation(s)
- Joyneel Misra
- Department of Electrical and Computer Engineering, University of Maryland, College Park, Maryland, United States of America
| | - Srinivas Govinda Surampudi
- Department of Electrical and Computer Engineering, University of Maryland, College Park, Maryland, United States of America
| | - Manasij Venkatesh
- Department of Electrical and Computer Engineering, University of Maryland, College Park, Maryland, United States of America
| | - Chirag Limbachia
- Department of Psychology and Maryland Neuroimaging Center, University of Maryland, College Park, Maryland, United States of America
| | - Joseph Jaja
- Department of Electrical and Computer Engineering, University of Maryland, College Park, Maryland, United States of America
| | - Luiz Pessoa
- Department of Electrical and Computer Engineering, University of Maryland, College Park, Maryland, United States of America
- Department of Psychology and Maryland Neuroimaging Center, University of Maryland, College Park, Maryland, United States of America
| |
Collapse
|
21
|
Zeng X, Park SJ, Nakai K. Characterizing Promoter and Enhancer Sequences by a Deep Learning Method. Front Genet 2021; 12:681259. [PMID: 34211503 PMCID: PMC8239401 DOI: 10.3389/fgene.2021.681259] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 05/20/2021] [Indexed: 11/21/2022] Open
Abstract
Promoters and enhancers are well-known regulatory elements modulating gene expression. As confirmed by high-throughput sequencing technologies, these regulatory elements are bidirectionally transcribed. That is, promoters produce stable mRNA in the sense direction and unstable RNA in the antisense direction, while enhancers transcribe unstable RNA in both directions. Although it is thought that enhancers and promoters share a similar architecture of transcription start sites (TSSs), how the transcriptional machinery distinctly uses these genomic regions as promoters or enhancers remains unclear. To address this issue, we developed a deep learning (DL) method by utilizing a convolutional neural network (CNN) and the saliency algorithm. In comparison with other classifiers, our CNN presented higher predictive performance, suggesting the overarching importance of the high-order sequence features, captured by the CNN. Moreover, our method revealed that there are substantial sequence differences between the enhancers and promoters. Remarkably, the 20-120 bp downstream regions from the center of bidirectional TSSs seemed to contribute to the RNA stability. These regions in promoters tend to have a larger number of guanines and cytosines compared to those in enhancers, and this feature contributed to the classification of the regulatory elements. Our CNN-based method can capture the complex TSS architectures. We found that the genomic regions around TSSs for promoters and enhancers contribute to RNA stability and show GC-biased characteristics as a critical determinant for promoter TSSs.
Collapse
Affiliation(s)
- Xin Zeng
- Department of Computational Biology and Medical Science, The University of Tokyo, Kashiwa, Japan
| | - Sung-Joon Park
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| | - Kenta Nakai
- Department of Computational Biology and Medical Science, The University of Tokyo, Kashiwa, Japan
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
22
|
Neural Network Methodology for the Identification and Classification of Lipopeptides Based on SMILES Annotation. COMPUTERS 2021. [DOI: 10.3390/computers10060074] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Artificial Neural Networks can be applied for the identification and classification of prospective drug candidates such as complex compounds, including lipopeptide, based on their SMILES string representation. The training of neural networks is done with SMILES strings, which are predictive of structural identification; the ANNs are efficient of correctly classifying all compounds, substructures and their analogues distinguishing the drugs based upon atomic organization to obtain lead optimization in drug discovery. The proficiency of the trained ANN models in recognizing and classifying the analogous compounds was tested for analysis of similar compounds, which were not taken previously for training and achieved results with correct classification in the validation set. The best result was achieved with 10 numbers of hidden layers. The R2 value for training is 0.90586; the R2 value for testing is 0.99508; the R2 value after validation is 0.94151; the final value of R2 for total sets is 0.89456. The graphs are plotted between 21 epochs and mean square error (MSE) to report the performance of the model. The value of 798.1735 for the gradient of the curve after 21 iterations and 6 validation checks was obtained. A successful model was developed for the identification and classification of lipopeptides from their SMILES annotation that efficiently classifies similar compounds and supports in decision making for analogue-based drug discovery. This will help in appropriate lead optimization studies for the prediction of potential anticancer and antimicrobial lipopeptide-based therapeutics.
Collapse
|
23
|
Zrimec J, Buric F, Kokina M, Garcia V, Zelezniak A. Learning the Regulatory Code of Gene Expression. Front Mol Biosci 2021; 8:673363. [PMID: 34179082 PMCID: PMC8223075 DOI: 10.3389/fmolb.2021.673363] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Accepted: 05/24/2021] [Indexed: 11/13/2022] Open
Abstract
Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.
Collapse
Affiliation(s)
- Jan Zrimec
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Filip Buric
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Mariia Kokina
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Victor Garcia
- School of Life Sciences and Facility Management, Zurich University of Applied Sciences, Wädenswil, Switzerland
| | - Aleksej Zelezniak
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Science for Life Laboratory, Stockholm, Sweden
| |
Collapse
|
24
|
Abbasi K, Razzaghi P, Poso A, Ghanbari-Ara S, Masoudi-Nejad A. Deep Learning in Drug Target Interaction Prediction: Current and Future Perspectives. Curr Med Chem 2021; 28:2100-2113. [PMID: 32895036 DOI: 10.2174/0929867327666200907141016] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2020] [Revised: 07/30/2020] [Accepted: 07/30/2020] [Indexed: 11/22/2022]
Abstract
Drug-target Interactions (DTIs) prediction plays a central role in drug discovery. Computational methods in DTIs prediction have gained more attention because carrying out in vitro and in vivo experiments on a large scale is costly and time-consuming. Machine learning methods, especially deep learning, are widely applied to DTIs prediction. In this study, the main goal is to provide a comprehensive overview of deep learning-based DTIs prediction approaches. Here, we investigate the existing approaches from multiple perspectives. We explore these approaches to find out which deep network architectures are utilized to extract features from drug compound and protein sequences. Also, the advantages and limitations of each architecture are analyzed and compared. Moreover, we explore the process of how to combine descriptors for drug and protein features. Likewise, a list of datasets that are commonly used in DTIs prediction is investigated. Finally, current challenges are discussed and a short future outlook of deep learning in DTI prediction is given.
Collapse
Affiliation(s)
- Karim Abbasi
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran 1417614411, Iran
| | - Parvin Razzaghi
- Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran
| | - Antti Poso
- School of Pharmacy, Faculty of Health Sciences, University of Eastern Finland, Kuopio 80100, Finland
| | - Saber Ghanbari-Ara
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran 1417614411, Iran
| | - Ali Masoudi-Nejad
- Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran 1417614411, Iran
| |
Collapse
|
25
|
Talukder A, Barham C, Li X, Hu H. Interpretation of deep learning in genomics and epigenomics. Brief Bioinform 2021; 22:bbaa177. [PMID: 34020542 PMCID: PMC8138893 DOI: 10.1093/bib/bbaa177] [Citation(s) in RCA: 49] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Revised: 06/26/2020] [Accepted: 07/10/2020] [Indexed: 12/17/2022] Open
Abstract
Machine learning methods have been widely applied to big data analysis in genomics and epigenomics research. Although accuracy and efficiency are common goals in many modeling tasks, model interpretability is especially important to these studies towards understanding the underlying molecular and cellular mechanisms. Deep neural networks (DNNs) have recently gained popularity in various types of genomic and epigenomic studies due to their capabilities in utilizing large-scale high-throughput bioinformatics data and achieving high accuracy in predictions and classifications. However, DNNs are often challenged by their potential to explain the predictions due to their black-box nature. In this review, we present current development in the model interpretation of DNNs, focusing on their applications in genomics and epigenomics. We first describe state-of-the-art DNN interpretation methods in representative machine learning fields. We then summarize the DNN interpretation methods in recent studies on genomics and epigenomics, focusing on current data- and computing-intensive topics such as sequence motif identification, genetic variations, gene expression, chromatin interactions and non-coding RNAs. We also present the biological discoveries that resulted from these interpretation methods. We finally discuss the advantages and limitations of current interpretation approaches in the context of genomic and epigenomic studies. Contact:xiaoman@mail.ucf.edu, haihu@cs.ucf.edu.
Collapse
Affiliation(s)
- Amlan Talukder
- Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Clayton Barham
- Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Xiaoman Li
- Burnett School of Biomedical Science, University of Central Florida, Orlando, FL 32816, USA
| | - Haiyan Hu
- Computer Science, University of Central Florida, Orlando, FL 32816, USA
| |
Collapse
|
26
|
Bukhari SAS, Razzaq A, Jabeen J, Khan S, Khan Z. Deep-BSC: Predicting Raw DNA Binding Pattern in Arabidopsis Thaliana. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200707142852] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
With the rapid development of the sequencing methods in recent years,
binding sites have been systematically identified in such projects as Nested-MICA and MEME.
Prediction of DNA motifs with higher accuracy and precision has been a very important task for
bioinformaticians. Nevertheless, experimental approaches are still time-consuming for big data set,
making computational identification of binding sites indispensable.
Objective:
To facilitate the identification of the binding site, we proposed a deep learning architecture, named Deep-BSC
(Deep-Learning Binary Search Classification), to predict binding sites in a raw DNA sequence with more precision and
accuracy.
Methods:
Our proposed architecture purely relies on the raw DNA sequence to predict the binding
sites for protein by using a convolutional neural network (CNN). We trained our deep learning
model on binding sites at the nucleotide level. DNA sequence of A. thaliana is used in this study
because it is a model plant.
Results:
The results demonstrate the effectiveness and efficiency of our method in the classification
of binding sites against random sequences, using deep learning. We construct a CNN with different
layers and filters to show the usefulness of max-pooling technique in the proposed method. To gain
the interpretability of our approach, we further visualized binding sites in the saliency map and
successfully identified similar motifs in the raw sequence. The proposed computational framework
is time and resource efficient.
Conclusion:
Deep-BSC enables the identification of binding sites in the DNA sequences via a highly accurate CNN. The
proposed computational framework can also be applied to problems such as operator, repeats in the genome, DNA
markers, and recognition sites for enzymes, thereby promoting the use of Deep-BSC method in life sciences.
Collapse
Affiliation(s)
- Syed Adnan Shah Bukhari
- Department of Computer Science, Faculty of Social Science and Humanities, Muhammad Nawaz Sharif University of Agriculture, Multan, Pakistan
| | - Abdul Razzaq
- Department of Computer Science, Faculty of Social Science and Humanities, Muhammad Nawaz Sharif University of Agriculture, Multan, Pakistan
| | - Javeria Jabeen
- Department of Computer Science, Faculty of Social Science and Humanities, Muhammad Nawaz Sharif University of Agriculture, Multan, Pakistan
| | - Shaheer Khan
- Department of Computer Science, Faculty of Social Science and Humanities, Muhammad Nawaz Sharif University of Agriculture, Multan, Pakistan
| | - Zulqurnain Khan
- Department of Biotechnology, Institute of Plant Breeding and Biotechnology, Muhammad Nawaz Sharif University of Agriculture, Multan, Pakistan
| |
Collapse
|
27
|
Xiong Y, He X, Zhao D, Tian T, Hong L, Jiang T, Zeng J. Modeling multi-species RNA modification through multi-task curriculum learning. Nucleic Acids Res 2021; 49:3719-3734. [PMID: 33744973 PMCID: PMC8053129 DOI: 10.1093/nar/gkab124] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2021] [Accepted: 02/12/2021] [Indexed: 01/01/2023] Open
Abstract
N6-methyladenosine (m6A) is the most pervasive modification in eukaryotic mRNAs. Numerous biological processes are regulated by this critical post-transcriptional mark, such as gene expression, RNA stability, RNA structure and translation. Recently, various experimental techniques and computational methods have been developed to characterize the transcriptome-wide landscapes of m6A modification for understanding its underlying mechanisms and functions in mRNA regulation. However, the experimental techniques are generally costly and time-consuming, while the existing computational models are usually designed only for m6A site prediction in a single-species and have significant limitations in accuracy, interpretability and generalizability. Here, we propose a highly interpretable computational framework, called MASS, based on a multi-task curriculum learning strategy to capture m6A features across multiple species simultaneously. Extensive computational experiments demonstrate the superior performances of MASS when compared to the state-of-the-art prediction methods. Furthermore, the contextual sequence features of m6A captured by MASS can be explained by the known critical binding motifs of the related RNA-binding proteins, which also help elucidate the similarity and difference among m6A features across species. In addition, based on the predicted m6A profiles, we further delineate the relationships between m6A and various properties of gene regulation, including gene expression, RNA stability, translation, RNA structure and histone modification. In summary, MASS may serve as a useful tool for characterizing m6A modification and studying its regulatory code. The source code of MASS can be downloaded from https://github.com/mlcb-thu/MASS.
Collapse
Affiliation(s)
- Yuanpeng Xiong
- Bioinformatics Division, BNRIST/Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
| | - Xuan He
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China
| | - Dan Zhao
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China
| | - Tingzhong Tian
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China
| | - Lixiang Hong
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China
| | - Tao Jiang
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
- Bioinformatics Division, BNRIST/Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
| | - Jianyang Zeng
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China
| |
Collapse
|
28
|
Bartoszewicz JM, Seidel A, Renard BY. Interpretable detection of novel human viruses from genome sequencing data. NAR Genom Bioinform 2021; 3:lqab004. [PMID: 33554119 PMCID: PMC7849996 DOI: 10.1093/nargab/lqab004] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Revised: 01/04/2021] [Accepted: 01/15/2021] [Indexed: 01/21/2023] Open
Abstract
Viruses evolve extremely quickly, so reliable methods for viral host prediction are necessary to safeguard biosecurity and biosafety alike. Novel human-infecting viruses are difficult to detect with standard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next-generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology-based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host prediction task. We propose a new approach for convolutional filter visualization to disentangle the information content of each nucleotide from its contribution to the final classification decision. Nucleotide-resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example, the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easy-to-install packages not only enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.
Collapse
Affiliation(s)
- Jakub M Bartoszewicz
- Bioinformatics (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany
- Data Analytics and Computational Statistics, Hasso Plattner Institute for Digital Engineering, 14482 Potsdam, Brandenburg, Germany
- Digital Engineering Faculty, University of Postdam, 14482 Potsdam, Brandenburg, Germany
| | - Anja Seidel
- Bioinformatics (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Free University of Berlin, 14195 Berlin, Germany
| | - Bernhard Y Renard
- Bioinformatics (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, 13353 Berlin, Germany
- Data Analytics and Computational Statistics, Hasso Plattner Institute for Digital Engineering, 14482 Potsdam, Brandenburg, Germany
- Digital Engineering Faculty, University of Postdam, 14482 Potsdam, Brandenburg, Germany
| |
Collapse
|
29
|
Avsec Ž, Weilert M, Shrikumar A, Krueger S, Alexandari A, Dalal K, Fropf R, McAnany C, Gagneur J, Kundaje A, Zeitlinger J. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet 2021; 53:354-366. [PMID: 33603233 PMCID: PMC8812996 DOI: 10.1038/s41588-021-00782-6] [Citation(s) in RCA: 317] [Impact Index Per Article: 79.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2020] [Accepted: 01/07/2021] [Indexed: 01/30/2023]
Abstract
The arrangement (syntax) of transcription factor (TF) binding motifs is an important part of the cis-regulatory code, yet remains elusive. We introduce a deep learning model, BPNet, that uses DNA sequence to predict base-resolution chromatin immunoprecipitation (ChIP)-nexus binding profiles of pluripotency TFs. We develop interpretation tools to learn predictive motif representations and identify soft syntax rules for cooperative TF binding interactions. Strikingly, Nanog preferentially binds with helical periodicity, and TFs often cooperate in a directional manner, which we validate using clustered regularly interspaced short palindromic repeat (CRISPR)-induced point mutations. Our model represents a powerful general approach to uncover the motifs and syntax of cis-regulatory sequences in genomics data.
Collapse
Affiliation(s)
- Žiga Avsec
- Department of Informatics, Technical University of Munich, Garching, Germany,Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-Universität München, Munich, Germany,Currently at DeepMind, London, UK
| | - Melanie Weilert
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Avanti Shrikumar
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Sabrina Krueger
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Amr Alexandari
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Khyati Dalal
- Stowers Institute for Medical Research, Kansas City, MO, USA,The University of Kansas Medical Center, Kansas City, KS, USA
| | - Robin Fropf
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Charles McAnany
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Julien Gagneur
- Department of Informatics, Technical University of Munich, Garching, Germany
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA, USA,Department of Genetics, Stanford University, Stanford, CA, USA,correspondence: ,
| | - Julia Zeitlinger
- Stowers Institute for Medical Research, Kansas City, MO, USA,The University of Kansas Medical Center, Kansas City, KS, USA,correspondence: ,
| |
Collapse
|
30
|
Gu T, Zhao X, Barbazuk WB, Lee JH. miTAR: a hybrid deep learning-based approach for predicting miRNA targets. BMC Bioinformatics 2021; 22:96. [PMID: 33639834 PMCID: PMC7912887 DOI: 10.1186/s12859-021-04026-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2020] [Accepted: 02/14/2021] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND microRNAs (miRNAs) have been shown to play essential roles in a wide range of biological processes. Many computational methods have been developed to identify targets of miRNAs. However, the majority of these methods depend on pre-defined features that require considerable efforts and resources to compute and often prove suboptimal at predicting miRNA targets. RESULTS We developed a novel hybrid deep learning-based (DL-based) approach that is capable of predicting miRNA targets at a higher accuracy. This approach integrates convolutional neural networks (CNNs) that excel in learning spatial features and recurrent neural networks (RNNs) that discern sequential features. Therefore, our approach has the advantages of learning both the intrinsic spatial and sequential features of miRNA:target. The inputs for our approach are raw sequences of miRNAs and genes that can be obtained effortlessly. We applied our approach on two human datasets from recently miRNA target prediction studies and trained two models. We demonstrated that the two models consistently outperform the previous methods according to evaluation metrics on test datasets. Comparing our approach with currently available alternatives on independent datasets shows that our approach delivers substantial improvements in performance. We also show with multiple evidences that our approach is more robust than other methods on small datasets. Our study is the first study to perform comparisons across multiple existing DL-based approaches on miRNA target prediction. Furthermore, we examined the contribution of a Max pooling layer in between the CNN and RNN and demonstrated that it improves the performance of all our models. Finally, a unified model was developed that is robust on fitting different input datasets. CONCLUSIONS We present a new DL-based approach for predicting miRNA targets and demonstrate that our approach outperforms the current alternatives. We supplied an easy-to-use tool, miTAR, at https://github.com/tjgu/miTAR . Furthermore, our analysis results support that Max Pooling generally benefits the hybrid models and potentially prevents overfitting for hybrid models.
Collapse
Affiliation(s)
- Tongjun Gu
- Bioinformatics, Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, FL, USA. .,Division of Quantitative Sciences, University of Florida Health Cancer Center, University of Florida, Gainesville, FL, USA.
| | - Xiwu Zhao
- Department of Ophthalmology and Visual Sciences, University of Michigan, Ann Arbor, MI, USA
| | - William Bradley Barbazuk
- Bioinformatics, Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, FL, USA.,Department of Biology, University of Florida, Gainesville, FL, USA.,Genetics Institute, University of Florida, Gainesville, FL, USA
| | - Ji-Hyun Lee
- Division of Quantitative Sciences, University of Florida Health Cancer Center, University of Florida, Gainesville, FL, USA.,Department of Biostatistics, University of Florida, Gainesville, FL, USA
| |
Collapse
|
31
|
Zheng A, Lamkin M, Zhao H, Wu C, Su H, Gymrek M. Deep neural networks identify sequence context features predictive of transcription factor binding. NAT MACH INTELL 2021; 3:172-180. [PMID: 33796819 PMCID: PMC8009085 DOI: 10.1038/s42256-020-00282-y] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Accepted: 12/10/2020] [Indexed: 12/11/2022]
Abstract
Transcription factors (TFs) bind DNA by recognizing specific sequence motifs, typically of length 6-12bp. A motif can occur many thousands of times in the human genome, but only a subset of those sites are actually bound. Here we present a machine learning framework leveraging existing convolutional neural network architectures and model interpretation techniques to identify and interpret sequence context features most important for predicting whether a particular motif instance will be bound. We apply our framework to predict binding at motifs for 38 TFs in a lymphoblastoid cell line, score the importance of context sequences at base-pair resolution, and characterize context features most predictive of binding. We find that the choice of training data heavily influences classification accuracy and the relative importance of features such as open chromatin. Overall, our framework enables novel insights into features predictive of TF binding and is likely to inform future deep learning applications to interpret non-coding genetic variants.
Collapse
Affiliation(s)
- An Zheng
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA USA
| | - Michael Lamkin
- Department of Bioengineering, University of California San Diego, La Jolla, CA USA
| | - Hanqing Zhao
- Department of Biology, University of California San Diego, La Jolla, CA, USA
| | - Cynthia Wu
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA USA
| | - Hao Su
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA USA
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA USA
- Department of Medicine, University of California San Diego, La Jolla, CA USA
| |
Collapse
|
32
|
So W, Bogucka EP, Scepanovic S, Joglekar S, Zhou K, Quercia D. Humane Visual AI: Telling the Stories Behind a Medical Condition. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2021; 27:678-688. [PMID: 33048711 DOI: 10.1109/tvcg.2020.3030391] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
A biological understanding is key for managing medical conditions, yet psychological and social aspects matter too. The main problem is that these two aspects are hard to quantify and inherently difficult to communicate. To quantify psychological aspects, this work mined around half a million Reddit posts in the sub-communities specialised in 14 medical conditions, and it did so with a new deep-learning framework. In so doing, it was able to associate mentions of medical conditions with those of emotions. To then quantify social aspects, this work designed a probabilistic approach that mines open prescription data from the National Health Service in England to compute the prevalence of drug prescriptions, and to relate such a prevalence to census data. To finally visually communicate each medical condition's biological, psychological, and social aspects through storytelling, we designed a narrative-style layered Martini Glass visualization. In a user study involving 52 participants, after interacting with our visualization, a considerable number of them changed their mind on previously held opinions: 10% gave more importance to the psychological aspects of medical conditions, and 27% were more favourable to the use of social media data in healthcare, suggesting the importance of persuasive elements in interactive visualizations.
Collapse
|
33
|
Wahab A, Tayara H, Xuan Z, Chong KT. DNA sequences performs as natural language processing by exploiting deep learning algorithm for the identification of N4-methylcytosine. Sci Rep 2021; 11:212. [PMID: 33420191 PMCID: PMC7794489 DOI: 10.1038/s41598-020-80430-x] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Accepted: 12/14/2020] [Indexed: 12/17/2022] Open
Abstract
N4-methylcytosine is a biochemical alteration of DNA that affects the genetic operations without modifying the DNA nucleotides such as gene expression, genomic imprinting, chromosome stability, and the development of the cell. In the proposed work, a computational model, 4mCNLP-Deep, used the word embedding approach as a vector formulation by exploiting deep learning based CNN algorithm to predict 4mC and non-4mC sites on the C.elegans genome dataset. Diversity of ranges employed for the experimental such as corpus k-mer and k-fold cross-validation to obtain the prevailing capabilities. The 4mCNLP-Deep outperform from the state-of-the-art predictor by achieving the results in five evaluation metrics by following; Accuracy (ACC) as 0.9354, Mathew’s correlation coefficient (MCC) as 0.8608, Specificity (Sp) as 0.89.96, Sensitivity (Sn) as 0.9563, and Area under curve (AUC) as 0.9731 by using 3-mer corpus word2vec and 3-fold cross-validation and attained the increment of 1.1%, 0.6%, 0.58%, 0.77%, and 4.89%, respectively. At last, we developed the online webserver http://nsclbio.jbnu.ac.kr/tools/4mCNLP-Deep/, for the experimental researchers to get the results easily.
Collapse
Affiliation(s)
- Abdul Wahab
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju, 54896, South Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju, 54896, South Korea
| | - Zhenyu Xuan
- Department of Biological Sciences, The University of Texas at Dallas, Richardson, 75080, USA.
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju, 54896, South Korea. .,Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju, 54896, South Korea.
| |
Collapse
|
34
|
"Omics" in traumatic brain injury: novel approaches to a complex disease. Acta Neurochir (Wien) 2021; 163:2581-2594. [PMID: 34273044 PMCID: PMC8357753 DOI: 10.1007/s00701-021-04928-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Accepted: 06/23/2021] [Indexed: 11/12/2022]
Abstract
BACKGROUND To date, there is neither any pharmacological treatment with efficacy in traumatic brain injury (TBI) nor any method to halt the disease progress. This is due to an incomplete understanding of the vast complexity of the biological cascades and failure to appreciate the diversity of secondary injury mechanisms in TBI. In recent years, techniques for high-throughput characterization and quantification of biological molecules that include genomics, proteomics, and metabolomics have evolved and referred to as omics. METHODS In this narrative review, we highlight how omics technology can be applied to potentiate diagnostics and prognostication as well as to advance our understanding of injury mechanisms in TBI. RESULTS The omics platforms provide possibilities to study function, dynamics, and alterations of molecular pathways of normal and TBI disease states. Through advanced bioinformatics, large datasets of molecular information from small biological samples can be analyzed in detail and provide valuable knowledge of pathophysiological mechanisms, to include in prognostic modeling when connected to clinically relevant data. In such a complex disease as TBI, omics enables broad categories of studies from gene compositions associated with susceptibility to secondary injury or poor outcome, to potential alterations in metabolites following TBI. CONCLUSION The field of omics in TBI research is rapidly evolving. The recent data and novel methods reviewed herein may form the basis for improved precision medicine approaches, development of pharmacological approaches, and individualization of therapeutic efforts by implementing mathematical "big data" predictive modeling in the near future.
Collapse
|
35
|
|
36
|
Chen L, Capra JA. Learning and interpreting the gene regulatory grammar in a deep learning framework. PLoS Comput Biol 2020; 16:e1008334. [PMID: 33137083 PMCID: PMC7660921 DOI: 10.1371/journal.pcbi.1008334] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Revised: 11/12/2020] [Accepted: 09/12/2020] [Indexed: 12/12/2022] Open
Abstract
Deep neural networks (DNNs) have achieved state-of-the-art performance in identifying gene regulatory sequences, but they have provided limited insight into the biology of regulatory elements due to the difficulty of interpreting the complex features they learn. Several models of how combinatorial binding of transcription factors, i.e. the regulatory grammar, drives enhancer activity have been proposed, ranging from the flexible TF billboard model to the stringent enhanceosome model. However, there is limited knowledge of the prevalence of these (or other) sequence architectures across enhancers. Here we perform several hypothesis-driven analyses to explore the ability of DNNs to learn the regulatory grammar of enhancers. We created synthetic datasets based on existing hypotheses about combinatorial transcription factor binding site (TFBS) patterns, including homotypic clusters, heterotypic clusters, and enhanceosomes, from real TF binding motifs from diverse TF families. We then trained deep residual neural networks (ResNets) to model the sequences under a range of scenarios that reflect real-world multi-label regulatory sequence prediction tasks. We developed a gradient-based unsupervised clustering method to extract the patterns learned by the ResNet models. We demonstrated that simulated regulatory grammars are best learned in the penultimate layer of the ResNets, and the proposed method can accurately retrieve the regulatory grammar even when there is heterogeneity in the enhancer categories and a large fraction of TFBS outside of the regulatory grammar. However, we also identify common scenarios where ResNets fail to learn simulated regulatory grammars. Finally, we applied the proposed method to mouse developmental enhancers and were able to identify the components of a known heterotypic TF cluster. Our results provide a framework for interpreting the regulatory rules learned by ResNets, and they demonstrate that the ability and efficiency of ResNets in learning the regulatory grammar depends on the nature of the prediction task.
Collapse
Affiliation(s)
- Ling Chen
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States of America
| | - John A. Capra
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States of America
- Vanderbilt Genetics Institute and Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States of America
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States of America
| |
Collapse
|
37
|
Application of deep learning in genomics. SCIENCE CHINA-LIFE SCIENCES 2020; 63:1860-1878. [PMID: 33051704 DOI: 10.1007/s11427-020-1804-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/13/2020] [Accepted: 08/15/2020] [Indexed: 12/19/2022]
Abstract
In recent years, deep learning has been widely used in diverse fields of research, such as speech recognition, image classification, autonomous driving and natural language processing. Deep learning has showcased dramatically improved performance in complex classification and regression problems, where the intricate structure in the high-dimensional data is difficult to discover using conventional machine learning algorithms. In biology, applications of deep learning are gaining increasing popularity in predicting the structure and function of genomic elements, such as promoters, enhancers, or gene expression levels. In this review paper, we described the basic concepts in machine learning and artificial neural network, followed by elaboration on the workflow of using convolutional neural network in genomics. Then we provided a concise introduction of deep learning applications in genomics and synthetic biology at the levels of DNA, RNA and protein. Finally, we discussed the current challenges and future perspectives of deep learning in genomics.
Collapse
|
38
|
He Y, Shen Z, Zhang Q, Wang S, Huang DS. A survey on deep learning in DNA/RNA motif mining. Brief Bioinform 2020; 22:5916939. [PMID: 33005921 PMCID: PMC8293829 DOI: 10.1093/bib/bbaa229] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2020] [Revised: 08/19/2020] [Accepted: 08/24/2020] [Indexed: 01/18/2023] Open
Abstract
DNA/RNA motif mining is the foundation of gene function research. The DNA/RNA motif mining plays an extremely important role in identifying the DNA- or RNA-protein binding site, which helps to understand the mechanism of gene regulation and management. For the past few decades, researchers have been working on designing new efficient and accurate algorithms for mining motif. These algorithms can be roughly divided into two categories: the enumeration approach and the probabilistic method. In recent years, machine learning methods had made great progress, especially the algorithm represented by deep learning had achieved good performance. Existing deep learning methods in motif mining can be roughly divided into three types of models: convolutional neural network (CNN) based models, recurrent neural network (RNN) based models, and hybrid CNN–RNN based models. We introduce the application of deep learning in the field of motif mining in terms of data preprocessing, features of existing deep learning architectures and comparing the differences between the basic deep learning models. Through the analysis and comparison of existing deep learning methods, we found that the more complex models tend to perform better than simple ones when data are sufficient, and the current methods are relatively simple compared with other fields such as computer vision, language processing (NLP), computer games, etc. Therefore, it is necessary to conduct a summary in motif mining by deep learning, which can help researchers understand this field.
Collapse
Affiliation(s)
- Ying He
- computer science and technology at Tongji University, China
| | - Zhen Shen
- computer science and technology at Tongji University, China
| | - Qinhu Zhang
- computer science and technology at Tongji University, China
| | - Siguo Wang
- computer science and technology at Tongji University, China
| | - De-Shuang Huang
- Institute of Machines Learning and Systems Biology, Tongji University
| |
Collapse
|
39
|
Liu L, Hou Y, He J, Lungu J, Dong R. An Energy-Efficient Fall Detection Method Based on FD-DNN for Elderly People. SENSORS 2020; 20:s20154192. [PMID: 32731465 PMCID: PMC7435651 DOI: 10.3390/s20154192] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/20/2020] [Revised: 07/22/2020] [Accepted: 07/25/2020] [Indexed: 11/16/2022]
Abstract
A fall detection module is an important component of community-based care for the elderly to reduce their health risk. It requires the accuracy of detections as well as maintains energy saving. In order to meet the above requirements, a sensing module-integrated energy-efficient sensor was developed which can sense and cache the data of human activity in sleep mode, and an interrupt-driven algorithm is proposed to transmit the data to a server integrated with ZigBee. Secondly, a deep neural network for fall detection (FD-DNN) running on the server is carefully designed to detect falls accurately. FD-DNN, which combines the convolutional neural networks (CNN) with long short-term memory (LSTM) algorithms, was tested on both with online and offline datasets. The experimental result shows that it takes advantage of CNN and LSTM, and achieved 99.17% fall detection accuracy, while its specificity and sensitivity are 99.94% and 94.09%, respectively. Meanwhile, it has the characteristics of low power consumption.
Collapse
Affiliation(s)
- Leyuan Liu
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; (L.L.); (Y.H.); (J.L.)
| | - Yibin Hou
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; (L.L.); (Y.H.); (J.L.)
- Beijing Engineering Research Center for IOT Software and Systems, Beijing University of Technology, Beijing 100124, China
| | - Jian He
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; (L.L.); (Y.H.); (J.L.)
- Beijing Engineering Research Center for IOT Software and Systems, Beijing University of Technology, Beijing 100124, China
- Correspondence: ; Tel.: +86-106-739-6653
| | - Jonathan Lungu
- Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China; (L.L.); (Y.H.); (J.L.)
| | - Ruihai Dong
- Insight Centre for Data Analytics, University College Dublin, Dublin 4, Ireland;
| |
Collapse
|
40
|
Wahab A, Mahmoudi O, Kim J, Chong KT. DNC4mC-Deep: Identification and Analysis of DNA N4-Methylcytosine Sites Based on Different Encoding Schemes By Using Deep Learning. Cells 2020; 9:E1756. [PMID: 32707969 PMCID: PMC7465362 DOI: 10.3390/cells9081756] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2020] [Revised: 07/17/2020] [Accepted: 07/17/2020] [Indexed: 11/24/2022] Open
Abstract
N4-methylcytosine as one kind of modification of DNA has a critical role which alters genetic performance such as protein interactions, conformation, stability in DNA as well as the regulation of gene expression same cell developmental and genomic imprinting. Some different 4mC site identifiers have been proposed for various species. Herein, we proposed a computational model, DNC4mC-Deep, including six encoding techniques plus a deep learning model to predict 4mC sites in the genome of F. vesca, R. chinensis, and Cross-species dataset. It was demonstrated by the 10-fold cross-validation test to get superior performance. The DNC4mC-Deep obtained 0.829 and 0.929 of MCC on F. vesca and R. chinensis training dataset, respectively, and 0.814 on cross-species. This means the proposed method outperforms the state-of-the-art predictors at least 0.284 and 0.265 on F. vesca and R. chinensis training dataset in turn. Furthermore, the DNC4mC-Deep achieved 0.635 and 0.565 of MCC on F. vesca and R. chinensis independent dataset, respectively, and 0.562 on cross-species which shows it can achieve the best performance to predict 4mC sites as compared to the state-of-the-art predictor.
Collapse
Affiliation(s)
- Abdul Wahab
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea; (A.W.); (O.M.)
| | - Omid Mahmoudi
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea; (A.W.); (O.M.)
| | - Jeehong Kim
- Department of New & Renewable Energy, VISION College of Jeonju, Jeonju 55069, Korea
| | - Kil To Chong
- Department of Electronics Engineering, Jeonbuk National University, Jeonju 54896, Korea
- Advance Electronics & Information Research Center, Jeonbuk National University, Jeonju 54896, Korea
| |
Collapse
|
41
|
Zaroug A, Lai DTH, Mudie K, Begg R. Lower Limb Kinematics Trajectory Prediction Using Long Short-Term Memory Neural Networks. Front Bioeng Biotechnol 2020; 8:362. [PMID: 32457881 PMCID: PMC7227385 DOI: 10.3389/fbioe.2020.00362] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Accepted: 03/31/2020] [Indexed: 12/03/2022] Open
Abstract
This study determined whether the kinematics of lower limb trajectories during walking could be extrapolated using long short-term memory (LSTM) neural networks. It was hypothesised that LSTM auto encoders could reliably forecast multiple time-step trajectories of the lower limb kinematics, specifically linear acceleration (LA) and angular velocity (AV). Using 3D motion capture, lower limb position-time coordinates were sampled (100 Hz) from six male participants (age 22 ± 2 years, height 1.77 ± 0.02 m, body mass 82 ± 4 kg) who walked for 10 min at 5 km/h on a 0% gradient motor-driven treadmill. These data were fed into an LSTM model with a sliding window of four kinematic variables with 25 samples or time steps: LA and AV for thigh and shank. The LSTM was tested to forecast five samples (i.e., time steps) of the four kinematic input variables. To attain generalisation, the model was trained on a dataset of 2,665 strides from five participants and evaluated on a test set of 1 stride from a sixth participant. The LSTM model learned the lower limb kinematic trajectories using the training samples and tested for generalisation across participants. The forecasting horizon suggested higher model reliability in predicting earlier future trajectories. The mean absolute error (MAE) was evaluated on each variable across the single tested stride, and for the five-sample forecast, it obtained 0.047 m/s2 thigh LA, 0.047 m/s2 shank LA, 0.028 deg/s thigh AV and 0.024 deg/s shank AV. All predicted trajectories were highly correlated with the measured trajectories, with correlation coefficients greater than 0.98. The motion prediction model may have a wide range of applications, such as mitigating the risk of falls or balance loss and improving the human-machine interface for wearable assistive devices.
Collapse
Affiliation(s)
- Abdelrahman Zaroug
- Institute for Health and Sport, Victoria University, Melbourne, VIC, Australia
| | - Daniel T. H. Lai
- Institute for Health and Sport, Victoria University, Melbourne, VIC, Australia
- College of Engineering and Science, Victoria University, Melbourne, VIC, Australia
| | - Kurt Mudie
- Defence Science and Technology Group, Melbourne, VIC, Australia
| | - Rezaul Begg
- Institute for Health and Sport, Victoria University, Melbourne, VIC, Australia
| |
Collapse
|
42
|
Ghanbari M, Ohler U. Deep neural networks for interpreting RNA-binding protein target preferences. Genome Res 2020; 30:214-226. [PMID: 31992613 PMCID: PMC7050519 DOI: 10.1101/gr.247494.118] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2018] [Accepted: 01/07/2020] [Indexed: 11/29/2022]
Abstract
Deep learning has become a powerful paradigm to analyze the binding sites of regulatory factors including RNA-binding proteins (RBPs), owing to its strength to learn complex features from possibly multiple sources of raw data. However, the interpretability of these models, which is crucial to improve our understanding of RBP binding preferences and functions, has not yet been investigated in significant detail. We have designed a multitask and multimodal deep neural network for characterizing in vivo RBP targets. The model incorporates not only the sequence but also the region type of the binding sites as input, which helps the model to boost the prediction performance. To interpret the model, we quantified the contribution of the input features to the predictive score of each RBP. Learning across multiple RBPs at once, we are able to avoid experimental biases and to identify the RNA sequence motifs and transcript context patterns that are the most important for the predictions of each individual RBP. Our findings are consistent with known motifs and binding behaviors and can provide new insights about the regulatory functions of RBPs.
Collapse
Affiliation(s)
- Mahsa Ghanbari
- The Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine, 10115 Berlin, Germany
| | - Uwe Ohler
- The Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine, 10115 Berlin, Germany.,Department of Biology, Humboldt Universität zu Berlin, 10117 Berlin, Germany.,Department of Computer Science, Humboldt Universität zu Berlin, 10117 Berlin, Germany
| |
Collapse
|
43
|
Celik N, O'Brien F, Brennan S, Rainbow RD, Dart C, Zheng Y, Coenen F, Barrett-Jolley R. Deep-Channel uses deep neural networks to detect single-molecule events from patch-clamp data. Commun Biol 2020; 3:3. [PMID: 31925311 PMCID: PMC6946689 DOI: 10.1038/s42003-019-0729-3] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2019] [Accepted: 12/06/2019] [Indexed: 12/05/2022] Open
Abstract
Single-molecule research techniques such as patch-clamp electrophysiology deliver unique biological insight by capturing the movement of individual proteins in real time, unobscured by whole-cell ensemble averaging. The critical first step in analysis is event detection, so called "idealisation", where noisy raw data are turned into discrete records of protein movement. To date there have been practical limitations in patch-clamp data idealisation; high quality idealisation is typically laborious and becomes infeasible and subjective with complex biological data containing many distinct native single-ion channel proteins gating simultaneously. Here, we show a deep learning model based on convolutional neural networks and long short-term memory architecture can automatically idealise complex single molecule activity more accurately and faster than traditional methods. There are no parameters to set; baseline, channel amplitude or numbers of channels for example. We believe this approach could revolutionise the unsupervised automatic detection of single-molecule transition events in the future.
Collapse
Affiliation(s)
- Numan Celik
- Faculty of Health and Life Science, University of Liverpool, Liverpool, UK
| | - Fiona O'Brien
- Faculty of Health and Life Science, University of Liverpool, Liverpool, UK
| | - Sean Brennan
- Faculty of Health and Life Science, University of Liverpool, Liverpool, UK
| | - Richard D Rainbow
- Faculty of Health and Life Science, University of Liverpool, Liverpool, UK
| | - Caroline Dart
- Faculty of Health and Life Science, University of Liverpool, Liverpool, UK
| | - Yalin Zheng
- Faculty of Health and Life Science, University of Liverpool, Liverpool, UK
| | - Frans Coenen
- Department of Computer Science, University of Liverpool, Liverpool, UK
| | | |
Collapse
|
44
|
Liu G, Zeng H, Gifford DK. Visualizing complex feature interactions and feature sharing in genomic deep neural networks. BMC Bioinformatics 2019; 20:401. [PMID: 31324140 PMCID: PMC6642501 DOI: 10.1186/s12859-019-2957-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Accepted: 06/18/2019] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND Visualization tools for deep learning models typically focus on discovering key input features without considering how such low level features are combined in intermediate layers to make decisions. Moreover, many of these methods examine a network's response to specific input examples that may be insufficient to reveal the complexity of model decision making. RESULTS We present DeepResolve, an analysis framework for deep convolutional models of genome function that visualizes how input features contribute individually and combinatorially to network decisions. Unlike other methods, DeepResolve does not depend upon the analysis of a predefined set of inputs. Rather, it uses gradient ascent to stochastically explore intermediate feature maps to 1) discover important features, 2) visualize their contribution and interaction patterns, and 3) analyze feature sharing across tasks that suggests shared biological mechanism. We demonstrate the visualization of decision making using our proposed method on deep neural networks trained on both experimental and synthetic data. DeepResolve is competitive with existing visualization tools in discovering key sequence features, and identifies certain negative features and non-additive feature interactions that are not easily observed with existing tools. It also recovers similarities between poorly correlated classes which are not observed by traditional methods. DeepResolve reveals that DeepSEA's learned decision structure is shared across genome annotations including histone marks, DNase hypersensitivity, and transcription factor binding. We identify groups of TFs that suggest known shared biological mechanism, and recover correlation between DNA hypersensitivities and TF/Chromatin marks. CONCLUSIONS DeepResolve is capable of visualizing complex feature contribution patterns and feature interactions that contribute to decision making in genomic deep convolutional networks. It also recovers feature sharing and class similarities which suggest interesting biological mechanisms. DeepResolve is compatible with existing visualization tools and provides complementary insights.
Collapse
Affiliation(s)
- Ge Liu
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Haoyang Zeng
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - David K Gifford
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA.
| |
Collapse
|
45
|
Chen H, Shaw D, Zeng J, Bu D, Jiang T. DIFFUSE: predicting isoform functions from sequences and expression profiles via deep learning. Bioinformatics 2019; 35:i284-i294. [PMID: 31510699 PMCID: PMC6612874 DOI: 10.1093/bioinformatics/btz367] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
MOTIVATION Alternative splicing generates multiple isoforms from a single gene, greatly increasing the functional diversity of a genome. Although gene functions have been well studied, little is known about the specific functions of isoforms, making accurate prediction of isoform functions highly desirable. However, the existing approaches to predicting isoform functions are far from satisfactory due to at least two reasons: (i) unlike genes, isoform-level functional annotations are scarce. (ii) The information of isoform functions is concealed in various types of data including isoform sequences, co-expression relationship among isoforms, etc. RESULTS In this study, we present a novel approach, DIFFUSE (Deep learning-based prediction of IsoForm FUnctions from Sequences and Expression), to predict isoform functions. To integrate various types of data, our approach adopts a hybrid framework by first using a deep neural network (DNN) to predict the functions of isoforms from their genomic sequences and then refining the prediction using a conditional random field (CRF) based on co-expression relationship. To overcome the lack of isoform-level ground truth labels, we further propose an iterative semi-supervised learning algorithm to train both the DNN and CRF together. Our extensive computational experiments demonstrate that DIFFUSE could effectively predict the functions of isoforms and genes. It achieves an average area under the receiver operating characteristics curve of 0.840 and area under the precision-recall curve of 0.581 over 4184 GO functional categories, which are significantly higher than the state-of-the-art methods. We further validate the prediction results by analyzing the correlation between functional similarity, sequence similarity, expression similarity and structural similarity, as well as the consistency between the predicted functions and some well-studied functional features of isoform sequences. AVAILABILITY AND IMPLEMENTATION https://github.com/haochenucr/DIFFUSE. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hao Chen
- Department of Compute Science and Engineering, University of California, Riverside, CA, USA
| | - Dipan Shaw
- Department of Compute Science and Engineering, University of California, Riverside, CA, USA
| | - Jianyang Zeng
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Dongbo Bu
- Key Lab of Intelligent Information Process, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Tao Jiang
- Department of Compute Science and Engineering, University of California, Riverside, CA, USA
- Bioinformatics Division, BNRIST/Department of Computer Science and Technology, Tsinghua University, Beijing, China
| |
Collapse
|
46
|
Nusrat S, Harbig T, Gehlenborg N. Tasks, Techniques, and Tools for Genomic Data Visualization. COMPUTER GRAPHICS FORUM : JOURNAL OF THE EUROPEAN ASSOCIATION FOR COMPUTER GRAPHICS 2019; 38:781-805. [PMID: 31768085 PMCID: PMC6876635 DOI: 10.1111/cgf.13727] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Genomic data visualization is essential for interpretation and hypothesis generation as well as a valuable aid in communicating discoveries. Visual tools bridge the gap between algorithmic approaches and the cognitive skills of investigators. Addressing this need has become crucial in genomics, as biomedical research is increasingly data-driven and many studies lack well-defined hypotheses. A key challenge in data-driven research is to discover unexpected patterns and to formulate hypotheses in an unbiased manner in vast amounts of genomic and other associated data. Over the past two decades, this has driven the development of numerous data visualization techniques and tools for visualizing genomic data. Based on a comprehensive literature survey, we propose taxonomies for data, visualization, and tasks involved in genomic data visualization. Furthermore, we provide a comprehensive review of published genomic visualization tools in the context of the proposed taxonomies.
Collapse
Affiliation(s)
- S Nusrat
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - T Harbig
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - N Gehlenborg
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
47
|
Luo G. A roadmap for semi-automatically extracting predictive and clinically meaningful temporal features from medical data for predictive modeling. GLOBAL TRANSITIONS 2019; 1:61-82. [PMID: 31032483 PMCID: PMC6482973 DOI: 10.1016/j.glt.2018.11.001] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Predictive modeling based on machine learning with medical data has great potential to improve healthcare and reduce costs. However, two hurdles, among others, impede its widespread adoption in hdealthcare. First, medical data are by nature longitudinal. Pre-processing them, particularly for feature engineering, is labor intensive and often takes 50-80% of the model building effort. Predictive temporal features are the basis of building accurate models, but are difficult to identify. This is problematic. Healthcare systems have limited resources for model building, while inaccurate models produce sub-optimal outcomes and are often useless. Second, most machine learning models provide no explanation of their prediction results. However, offering such explanations is essential for a model to be used in usual clinical practice. To address these two hurdles, this paper outlines: 1) a data-driven method for semi-automatically extracting predictive and clinically meaningful temporal features from medical data for predictive modeling; and 2) a method of using these features to automatically explain machine learning prediction results and suggest tailored interventions. This provides a roadmap for future research.
Collapse
Affiliation(s)
- Gang Luo
- Department of Biomedical Informatics and Medical Education, University of Washington, UW Medicine South Lake Union, 850 Republican Street, Building C, Box 358047, Seattle, WA, 98109, USA
| |
Collapse
|
48
|
Woloszynek S, Zhao Z, Chen J, Rosen GL. 16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses. PLoS Comput Biol 2019; 15:e1006721. [PMID: 30807567 PMCID: PMC6407789 DOI: 10.1371/journal.pcbi.1006721] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2018] [Revised: 03/08/2019] [Accepted: 12/17/2018] [Indexed: 12/26/2022] Open
Abstract
Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding ("embedding") each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the k-mer embedding space captured distinct k-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks.
Collapse
Affiliation(s)
- Stephen Woloszynek
- Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Zhengqiao Zhao
- Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Jian Chen
- Department of Computer Science and Engineering, State University of New York at Buffalo, Buffalo, New York, United States of America
| | - Gail L. Rosen
- Department of Electrical and Computer Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
49
|
Zhang Y, Hamada M. DeepM6ASeq: prediction and characterization of m6A-containing sequences using deep learning. BMC Bioinformatics 2018; 19:524. [PMID: 30598068 PMCID: PMC6311933 DOI: 10.1186/s12859-018-2516-4] [Citation(s) in RCA: 94] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Background N6-methyladensine (m6A) is a common and abundant RNA methylation modification found in various species. As a type of post-transcriptional methylation, m6A plays an important role in diverse RNA activities such as alternative splicing, an interplay with microRNAs and translation efficiency. Although existing tools can predict m6A at single-base resolution, it is still challenging to extract the biological information surrounding m6A sites. Results We implemented a deep learning framework, named DeepM6ASeq, to predict m6A-containing sequences and characterize surrounding biological features based on miCLIP-Seq data, which detects m6A sites at single-base resolution. DeepM6ASeq showed better performance as compared to other machine learning classifiers. Moreover, an independent test on m6A-Seq data, which identifies m6A-containing genomic regions, revealed that our model is competitive in predicting m6A-containing sequences. The learned motifs from DeepM6ASeq correspond to known m6A readers. Notably, DeepM6ASeq also identifies a newly recognized m6A reader: FMR1. Besides, we found that a saliency map in the deep learning model could be utilized to visualize locations of m6A sites. Conculsion We developed a deep-learning-based framework to predict and characterize m6A-containing sequences and hope to help investigators to gain more insights for m6A research. The source code is available at https://github.com/rreybeyb/DeepM6ASeq. Electronic supplementary material The online version of this article (10.1186/s12859-018-2516-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yiqian Zhang
- Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, 55N-06-10, 3-4-1 Okubo Shinjuku-ku, Tokyo, 169-8555, Japan.,AIST-Waseda University Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), 3-4-1, Okubo Shinjuku-ku, Tokyo, 169-8555, Japan
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, 55N-06-10, 3-4-1 Okubo Shinjuku-ku, Tokyo, 169-8555, Japan. .,AIST-Waseda University Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), 3-4-1, Okubo Shinjuku-ku, Tokyo, 169-8555, Japan. .,Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-41-6 Aomi, Koto-ku, Tokyo, 135-0064, Japan. .,Institute for Medical-oriented Structural Biology, Waseda University, 2-2, Wakamatsu-cho Shinjuku-ku, Tokyo, 162-8480, Japan. .,Graduate School of Medicine, Nippon Medical School, 1-1-5, Sendagi, Bunkyo-ku, Tokyo, 113-8602, Japan.
| |
Collapse
|
50
|
Hirohara M, Saito Y, Koda Y, Sato K, Sakakibara Y. Convolutional neural network based on SMILES representation of compounds for detecting chemical motif. BMC Bioinformatics 2018; 19:526. [PMID: 30598075 PMCID: PMC6311897 DOI: 10.1186/s12859-018-2523-5] [Citation(s) in RCA: 87] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Background Previous studies have suggested deep learning to be a highly effective approach for screening lead compounds for new drugs. Several deep learning models have been developed by addressing the use of various kinds of fingerprints and graph convolution architectures. However, these methods are either advantageous or disadvantageous depending on whether they (1) can distinguish structural differences including chirality of compounds, and (2) can automatically discover effective features. Results We developed another deep learning model for compound classification. In this method, we constructed a distributed representation of compounds based on the SMILES notation, which linearly represents a compound structure, and applied the SMILES-based representation to a convolutional neural network (CNN). The use of SMILES allows us to process all types of compounds while incorporating a broad range of structure information, and representation learning by CNN automatically acquires a low-dimensional representation of input features. In a benchmark experiment using the TOX 21 dataset, our method outperformed conventional fingerprint methods, and performed comparably against the winning model of the TOX 21 Challenge. Multivariate analysis confirmed that the chemical space consisting of the features learned by SMILES-based representation learning adequately expressed a richer feature space that enabled the accurate discrimination of compounds. Using motif detection with the learned filters, not only important known structures (motifs) such as protein-binding sites but also structures of unknown functional groups were detected. Conclusions The source code of our SMILES-based convolutional neural network software in the deep learning framework Chainer is available at http://www.dna.bio.keio.ac.jp/smiles/, and the dataset used for performance evaluation in this work is available at the same URL.
Collapse
Affiliation(s)
- Maya Hirohara
- Department of Biosciences and Informatics, Keio University, Yokohama, 223-8522, Japan
| | - Yutaka Saito
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, 135-0064, Japan.,Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, 169-8555, Japan
| | - Yuki Koda
- Department of Biosciences and Informatics, Keio University, Yokohama, 223-8522, Japan
| | - Kengo Sato
- Department of Biosciences and Informatics, Keio University, Yokohama, 223-8522, Japan
| | - Yasubumi Sakakibara
- Department of Biosciences and Informatics, Keio University, Yokohama, 223-8522, Japan.
| |
Collapse
|