1
|
Yan M, Dong Z, Zhu Z, Qiao C, Wang M, Teng Z, Xing Y, Liu G, Liu G, Cai L, Meng H. Cancer type and survival prediction based on transcriptomic feature map. Comput Biol Med 2025; 192:110267. [PMID: 40311464 DOI: 10.1016/j.compbiomed.2025.110267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2024] [Revised: 04/05/2025] [Accepted: 04/22/2025] [Indexed: 05/03/2025]
Abstract
This study achieved cancer type and survival time prediction by transforming transcriptomic features into feature maps and employing deep learning models. Using transcriptomic data from 27 cancer types and survival data from 10 types in the TCGA database, a pan-cancer transcriptomic feature map was constructed through data cleaning, feature extraction, and visualization. Using Inception network and gated convolutional modules yielded a pan-cancer classification accuracy of 91.8 %. Additionally, by extracting 31 differential genes from different cancer feature maps, an interaction network diagram was drawn, identifying two key genes, ANXA5 and ACTB. These genes are potential biomarkers related to cancer progression, angiogenesis, metastasis, and treatment resistance. Survival prediction analysis on 10 cancer types, combined with feature maps and data amplification, cancer survival prediction accuracy reached from 0.75 to 0.91. This transcriptomic feature map provides a novel approach for cancer omics analysis, to facilitate personalized treatments and reflecting individual differences.
Collapse
Affiliation(s)
- Ming Yan
- Inner Mongolia Key Laboratory of Life Health and Bioinformatics, College of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Zirou Dong
- Inner Mongolia Key Laboratory of Life Health and Bioinformatics, College of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Zhaopo Zhu
- Center for Medical Genetics & Hunan Key Laboratory, School of Life Sciences, Central South University, Changsha, Huna, 410008, China
| | - Chengliang Qiao
- Inner Mongolia Key Laboratory of Life Health and Bioinformatics, College of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Meizhi Wang
- Inner Mongolia Key Laboratory of Life Health and Bioinformatics, College of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Zhixia Teng
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Yongqiang Xing
- Inner Mongolia Key Laboratory of Life Health and Bioinformatics, College of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Guojun Liu
- Inner Mongolia Key Laboratory of Life Health and Bioinformatics, College of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Guoqing Liu
- Inner Mongolia Key Laboratory of Life Health and Bioinformatics, College of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China
| | - Lu Cai
- Inner Mongolia Key Laboratory of Life Health and Bioinformatics, College of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China.
| | - Hu Meng
- Inner Mongolia Key Laboratory of Life Health and Bioinformatics, College of Life Science and Technology, Inner Mongolia University of Science and Technology, Baotou, 014010, China.
| |
Collapse
|
2
|
Alshammry N. Developing a method for predicting DNA nucleosomal sequences using deep learning. Technol Health Care 2025; 33:989-999. [PMID: 40105177 DOI: 10.1177/09287329241297900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/20/2025]
Abstract
BackgroundDeep learning excels at processing raw data because it automatically extracts and classifies high-level features. Despite biology's low popularity in data analysis, incorporating computer technology can improve biological research.ObjectiveTo create a deep learning model that can identify nucleosomes from nucleotide sequences and to show that simpler models outperform more complicated ones in solving biological challenges.MethodsA classifier was created utilising deep learning and machine learning approaches. The final model consists of two convolutional layers, one max pooling layer, two fully connected layers, and a dropout regularisation layer. This structure was chosen on the basis of the 'less is frequently more' approach, which emphasises simple design without large hidden layers.ResultsExperimental results show that deep learning methods, specifically deep neural networks, outperform typical machine learning algorithms for recognising nucleosomes. The simplified network architecture proved suitable without the requirement for numerous hidden neurons, resulting in effective network performance.ConclusionThis study demonstrates that machine learning and other computational techniques may streamline and expedite the resolution of biological issues. The model helps identify nucleosomes and can be used in future research or labs. This study discusses the challenges of understanding and addressing simple biological problems with sophisticated computer technology and offers practical solutions for academic and economic sectors.
Collapse
Affiliation(s)
- Nizal Alshammry
- Department of Computer Sciences, Faculty of Computing and Information Technology, Northern Border University, Rafha, Saudi Arabia
| |
Collapse
|
3
|
Masoudi-Sobhanzadeh Y, Li S, Peng Y, Panchenko A. Interpretable deep residual network uncovers nucleosome positioning and associated features. Nucleic Acids Res 2024; 52:8734-8745. [PMID: 39036965 PMCID: PMC11347144 DOI: 10.1093/nar/gkae623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Revised: 05/31/2024] [Accepted: 07/04/2024] [Indexed: 07/23/2024] Open
Abstract
Nucleosomes represent elementary building units of eukaryotic chromosomes and consist of DNA wrapped around a histone octamer flanked by linker DNA segments. Nucleosomes are central in epigenetic pathways and their genomic positioning is associated with regulation of gene expression, DNA replication, DNA methylation and DNA repair, among other functions. Building on prior discoveries that DNA sequences noticeably affect nucleosome positioning, our objective is to identify nucleosome positions and related features across entire genome. Here, we introduce an interpretable framework based on the concepts of deep residual networks (NuPoSe). Trained on high-coverage human experimental MNase-seq data, NuPoSe is able to learn sequence and structural patterns associated with nucleosome organization in human genome. NuPoSe can be also applied to unseen data from different organisms and cell types. Our findings point to 43 informative features, most of them constitute tri-nucleotides, di-nucleotides and one tetra-nucleotide. Most features are significantly associated with the nucleosomal structural characteristics, namely, periodicity of nucleosomal DNA and its location with respect to a histone octamer. Importantly, we show that features derived from the 27 bp linker DNA flanking nucleosomes contribute up to 10% to the quality of the prediction model. This, along with the comprehensive training sets, deep-learning architecture, and feature selection method, may contribute to the NuPoSe's 80-89% classification accuracy on different independent datasets.
Collapse
Affiliation(s)
| | - Shuxiang Li
- Department of Pathology and Molecular Medicine, Queen's University, Kingston, K7L3N6, Canada
| | - Yunhui Peng
- Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan, 430079, China
| | - Anna R Panchenko
- Department of Pathology and Molecular Medicine, Queen's University, Kingston, K7L3N6, Canada
- Department of Biology and Molecular Sciences, Queen's University, Kingston, K7L3N6, Canada
- School of Computing, Queen's University, Kingston, K7L3N6, Canada
- Ontario Institute of Cancer Research, Toronto, M5G 0A3, Canada
| |
Collapse
|
4
|
Sahrhage M, Paul NB, Beißbarth T, Haubrock M. The importance of DNA sequence for nucleosome positioning in transcriptional regulation. Life Sci Alliance 2024; 7:e202302380. [PMID: 38830772 PMCID: PMC11147951 DOI: 10.26508/lsa.202302380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 05/15/2024] [Accepted: 05/16/2024] [Indexed: 06/05/2024] Open
Abstract
Nucleosome positioning is a key factor for transcriptional regulation. Nucleosomes regulate the dynamic accessibility of chromatin and interact with the transcription machinery at every stage. Influences to steer nucleosome positioning are diverse, and the according importance of the DNA sequence in contrast to active chromatin remodeling has been the subject of long discussion. In this study, we evaluate the functional role of DNA sequence for all major elements along the process of transcription. We developed a random forest classifier based on local DNA structure that assesses the sequence-intrinsic support for nucleosome positioning. On this basis, we created a simple data resource that we applied genome-wide to the human genome. In our comprehensive analysis, we found a special role of DNA in mediating the competition of nucleosomes with cis-regulatory elements, in enabling steady transcription, for positioning of stable nucleosomes in exons, and for repelling nucleosomes during transcription termination. In contrast, we relate these findings to concurrent processes that generate strongly positioned nucleosomes in vivo that are not mediated by sequence, such as energy-dependent remodeling of chromatin.
Collapse
Affiliation(s)
- Malte Sahrhage
- Department of Medical Bioinformatics, University Medical Center, Göttingen, Germany
| | - Niels Benjamin Paul
- Department of Medical Bioinformatics, University Medical Center, Göttingen, Germany
- Department of Cardiology and Pneumology, University Medical Center, Göttingen, Germany
| | - Tim Beißbarth
- Department of Medical Bioinformatics, University Medical Center, Göttingen, Germany
| | - Martin Haubrock
- Department of Medical Bioinformatics, University Medical Center, Göttingen, Germany
| |
Collapse
|
5
|
Zhao L, Xue Q, Zhang H, Hao Y, Yi H, Liu X, Pan W, Fu J, Zhang A. CatNet: Sequence-based deep learning with cross-attention mechanism for identifying endocrine-disrupting chemicals. JOURNAL OF HAZARDOUS MATERIALS 2024; 465:133055. [PMID: 38016311 DOI: 10.1016/j.jhazmat.2023.133055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 11/02/2023] [Accepted: 11/20/2023] [Indexed: 11/30/2023]
Abstract
Endocrine-disrupting chemicals (EDCs) pose significant environmental and health risks due to their potential to interfere with nuclear receptors (NRs), key regulators of physiological processes. Despite the evident risks, the majority of existing research narrows its focus on the interaction between compounds and the individual NR target, neglecting a comprehensive assessment across the entire NR family. In response, this study assembled a comprehensive human NR dataset, capturing 49,244 interactions between 35,467 unique compounds and 42 NRs. We introduced a cross-attention network framework, "CatNet", innovatively integrating compound and protein representations through cross-attention mechanisms. The results showed that CatNet model achieved excellent performance with an area under the receiver operating characteristic curve (AUCROC) = 0.916 on the test set, and exhibited reliable generalization on unseen compound-NR pairs. A distinguishing feature of our research is its capacity to expand to novel targets. Beyond its predictive accuracy, CatNet offers a valuable mechanistic perspective on compound-NR interactions through feature visualization. Augmenting the utility of our research, we have also developed a graphical user interface, empowering researchers to predict chemical binding to diverse NRs. Our model enables the prediction of human NR-related EDCs and shows the potential to identify EDCs related to other targets.
Collapse
Affiliation(s)
- Lu Zhao
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, PR China; Sino-Danish College, University of Chinese Academy of Sciences, Beijing 100049, PR China
| | - Qiao Xue
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, PR China.
| | - Huazhou Zhang
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, PR China; Sino-Danish College, University of Chinese Academy of Sciences, Beijing 100049, PR China
| | - Yuxing Hao
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, PR China; Sino-Danish College, University of Chinese Academy of Sciences, Beijing 100049, PR China
| | - Hang Yi
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, PR China; College of Resources and Environment, University of Chinese Academy of Sciences, Beijing 100190, PR China
| | - Xian Liu
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, PR China
| | - Wenxiao Pan
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, PR China
| | - Jianjie Fu
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, PR China; College of Resources and Environment, University of Chinese Academy of Sciences, Beijing 100190, PR China; School of Environment, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310012, PR China
| | - Aiqian Zhang
- State Key Laboratory of Environmental Chemistry and Ecotoxicology, Research Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, PR China; Sino-Danish College, University of Chinese Academy of Sciences, Beijing 100049, PR China; College of Resources and Environment, University of Chinese Academy of Sciences, Beijing 100190, PR China; School of Environment, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310012, PR China.
| |
Collapse
|
6
|
Comprehensive computational analysis of epigenetic descriptors affecting CRISPR-Cas9 off-target activity. BMC Genomics 2022; 23:805. [PMID: 36474180 PMCID: PMC9724382 DOI: 10.1186/s12864-022-09012-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2022] [Accepted: 10/17/2022] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND A common issue in CRISPR-Cas9 genome editing is off-target activity, which prevents the widespread use of CRISPR-Cas9 in medical applications. Among other factors, primary chromatin structure and epigenetics may influence off-target activity. METHODS In this work, we utilize crisprSQL, an off-target database, to analyze the effect of 19 epigenetic descriptors on CRISPR-Cas9 off-target activity. Termed as 19 epigenetic features/scores, they consist of 6 experimental epigenetic and 13 computed nucleosome organization-related features. In terms of novel features, 15 of the epigenetic scores are newly considered. The 15 newly considered scores consist of 13 freshly computed nucleosome occupancy/positioning scores and 2 experimental features (MNase and DRIP). The other 4 existing scores are experimental features (CTCF, DNase I, H3K4me3, RRBS) commonly used in deep learning models for off-target activity prediction. For data curation, MNase was aggregated from existing experimental nucleosome occupancy data. Based on the sequence context information available in crisprSQL, we also computed nucleosome occupancy/positioning scores for off-target sites. RESULTS To investigate the relationship between the 19 epigenetic features and off-target activity, we first conducted Spearman and Pearson correlation analysis. Such analysis shows that some computed scores derived from training-based models and training-free algorithms outperform all experimental epigenetic features. Next, we evaluated the contribution of all epigenetic features in two successful machine/deep learning models which predict off-target activity. We found that some computed scores, unlike all 6 experimental features, significantly contribute to the predictions of both models. As a practical research contribution, we make the off-target dataset containing all 19 epigenetic features available to the research community. CONCLUSIONS Our comprehensive computational analysis helps the CRISPR-Cas9 community better understand the relationship between epigenetic features and CRISPR-Cas9 off-target activity.
Collapse
|
7
|
Zhou Y, Wu T, Jiang Y, Li Y, Li K, Quan L, Lyu Q. DeepNup: Prediction of Nucleosome Positioning from DNA Sequences Using Deep Neural Network. Genes (Basel) 2022; 13:1983. [PMID: 36360220 PMCID: PMC9689664 DOI: 10.3390/genes13111983] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Revised: 10/25/2022] [Accepted: 10/26/2022] [Indexed: 10/29/2024] Open
Abstract
Nucleosome positioning is involved in diverse cellular biological processes by regulating the accessibility of DNA sequences to DNA-binding proteins and plays a vital role. Previous studies have manifested that the intrinsic preference of nucleosomes for DNA sequences may play a dominant role in nucleosome positioning. As a consequence, it is nontrivial to develop computational methods only based on DNA sequence information to accurately identify nucleosome positioning, and thus intend to verify the contribution of DNA sequences responsible for nucleosome positioning. In this work, we propose a new deep learning-based method, named DeepNup, which enables us to improve the prediction of nucleosome positioning only from DNA sequences. Specifically, we first use a hybrid feature encoding scheme that combines One-hot encoding and Trinucleotide composition encoding to encode raw DNA sequences; afterwards, we employ multiscale convolutional neural network modules that consist of two parallel convolution kernels with different sizes and gated recurrent units to effectively learn the local and global correlation feature representations; lastly, we use a fully connected layer and a sigmoid unit serving as a classifier to integrate these learned high-order feature representations and generate the final prediction outcomes. By comparing the experimental evaluation metrics on two benchmark nucleosome positioning datasets, DeepNup achieves a better performance for nucleosome positioning prediction than that of several state-of-the-art methods. These results demonstrate that DeepNup is a powerful deep learning-based tool that enables one to accurately identify potential nucleosome sequences.
Collapse
Affiliation(s)
- Yiting Zhou
- School of Computer Science and Technology, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
| | - Tingfang Wu
- School of Computer Science and Technology, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
- Key Lab for Information Processing Technologies, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Organization, Nanjing 210000, China
| | - Yelu Jiang
- School of Computer Science and Technology, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
| | - Yan Li
- School of Computer Science and Technology, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
| | - Kailong Li
- School of Computer Science and Technology, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
| | - Lijun Quan
- School of Computer Science and Technology, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
- Key Lab for Information Processing Technologies, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Organization, Nanjing 210000, China
| | - Qiang Lyu
- School of Computer Science and Technology, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
- Key Lab for Information Processing Technologies, Soochow University, Suzhou Ganjiang East Streat 333, Suzhou 215006, China
- Collaborative Innovation Center of Novel Software Technology and Industrialization, Organization, Nanjing 210000, China
| |
Collapse
|
8
|
Zhao Y, Shao J, Asmann YW. Assessment and Optimization of Explainable Machine Learning Models Applied to Transcriptomic Data. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:899-911. [PMID: 35931322 PMCID: PMC10025763 DOI: 10.1016/j.gpb.2022.07.003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Revised: 06/05/2022] [Accepted: 07/25/2022] [Indexed: 01/12/2023]
Abstract
Explainable artificial intelligence aims to interpret how machine learning models make decisions, and many model explainers have been developed in the computer vision field. However, understanding of the applicability of these model explainers to biological data is still lacking. In this study, we comprehensively evaluated multiple explainers by interpreting pre-trained models for predicting tissue types from transcriptomic data and by identifying the top contributing genes from each sample with the greatest impacts on model prediction. To improve the reproducibility and interpretability of results generated by model explainers, we proposed a series of optimization strategies for each explainer on two different model architectures of multilayer perceptron (MLP) and convolutional neural network (CNN). We observed three groups of explainer and model architecture combinations with high reproducibility. Group II, which contains three model explainers on aggregated MLP models, identified top contributing genes in different tissues that exhibited tissue-specific manifestation and were potential cancer biomarkers. In summary, our work provides novel insights and guidance for exploring biological mechanisms using explainable machine learning models.
Collapse
Affiliation(s)
- Yongbing Zhao
- Department of Quantitative Health Sciences, Mayo Clinic, Jacksonville, FL 32224, USA.
| | - Jinfeng Shao
- The Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Rockville, MD 20852, USA
| | - Yan W Asmann
- Department of Quantitative Health Sciences, Mayo Clinic, Jacksonville, FL 32224, USA.
| |
Collapse
|
9
|
Abstract
The tremendous amount of biological sequence data available, combined with the recent methodological breakthrough in deep learning in domains such as computer vision or natural language processing, is leading today to the transformation of bioinformatics through the emergence of deep genomics, the application of deep learning to genomic sequences. We review here the new applications that the use of deep learning enables in the field, focusing on three aspects: the functional annotation of genomes, the sequence determinants of the genome functions and the possibility to write synthetic genomic sequences.
Collapse
|
10
|
Liu J, Zhou D, Jin W. Prediction of nucleosome dynamic interval based on long-short-term memory network (LSTM). J Bioinform Comput Biol 2022; 20:2250009. [PMID: 35603935 DOI: 10.1142/s0219720022500093] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Nucleosome localization is a dynamic process and consists of nucleosome dynamic intervals (NDIs). We preprocessed nucleosome sequence data as time series data (TSD) and developed a long short-term memory network (LSTM) model for training time series data (TSD; LSTM-TSD model) using iterative training and feature learning that predicts NDIs with high accuracy. Sn, Sp, Acc, and MCC of the obtained LSTM model is 91.88%, 92.72%, 92.30%, and 84.61%, respectively. LSTM model could precisely predict the NDIs of yeast 16 chromosome. The NDIs contain 90.29% of nucleosome core DNA and 91.20% of nucleosome central sites, indicating that NDIs have high confidence. We found that the binding sites of transcriptional proteins and other proteins are outside NDIs, not in NDIs. These results are important for analysis of nucleosome localization and gene transcriptional regulation.
Collapse
Affiliation(s)
- Jianli Liu
- School of Water Resource and Environment Engineering, China University of Geosciences (Beijing), Beijing 100083, P. R. China
| | - Deliang Zhou
- Beijing Zhongdianyida Technology Co., Ltd, Beijing 100190, P. R. China
| | - Wen Jin
- Department of Clinical Medical Research Center/Inner Mongolia, Key Laboratory of Gene Regulation of the Metabolic Disease, Inner Mongolia People's Hospital, Hohhot 010010, P. R. China
| |
Collapse
|
11
|
Galaxy Dnpatterntools for Computational Analysis of Nucleosome Positioning Sequence Patterns. Int J Mol Sci 2022; 23:ijms23094869. [PMID: 35563261 PMCID: PMC9102330 DOI: 10.3390/ijms23094869] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Revised: 04/25/2022] [Accepted: 04/26/2022] [Indexed: 01/25/2023] Open
Abstract
Nucleosomes are basic units of DNA packing in eukaryotes. Their structure is well conserved from yeast to human and consists of the histone octamer core and 147 bp DNA wrapped around it. Nucleosomes are bound to a majority of the eukaryotic genomic DNA, including its regulatory regions. Hence, they also play a major role in gene regulation. For the latter, their precise positioning on DNA is essential. In the present paper, we describe Galaxy dnpatterntools—software package for nucleosome DNA sequence analysis and mapping. This software will be useful for computational biologists practitioners to conduct more profound studies of gene regulatory mechanisms.
Collapse
|
12
|
Han GS, Li Q, Li Y. Nucleosome positioning based on DNA sequence embedding and deep learning. BMC Genomics 2022; 23:301. [PMID: 35418074 PMCID: PMC9006412 DOI: 10.1186/s12864-022-08508-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Accepted: 03/28/2022] [Indexed: 11/25/2022] Open
Abstract
Background Nucleosome positioning is the precise determination of the location of nucleosomes on DNA sequence. With the continuous advancement of biotechnology and computer technology, biological data is showing explosive growth. It is of practical significance to develop an efficient nucleosome positioning algorithm. Indeed, convolutional neural networks (CNN) can capture local features in DNA sequences, but ignore the order of bases. While the bidirectional recurrent neural network can make up for CNN's shortcomings in this regard and extract the long-term dependent features of DNA sequence. Results In this work, we use word vectors to represent DNA sequences and propose three new deep learning models for nucleosome positioning, and the integrative model NP_CBiR reaches a better prediction performance. The overall accuracies of NP_CBiR on H. sapiens, C. elegans, and D. melanogaster datasets are 86.18%, 89.39%, and 85.55% respectively. Conclusions Benefited by different network structures, NP_CBiR can effectively extract local features and bases order features of DNA sequences, thus can be considered as a complementary tool for nucleosome positioning.
Collapse
Affiliation(s)
- Guo-Sheng Han
- Department of Mathematics and Computational Science, Xiangtan University, Xiangtan, 411105, Hunan, China. .,Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, 411105, Hunan, China.
| | - Qi Li
- Department of Mathematics and Computational Science, Xiangtan University, Xiangtan, 411105, Hunan, China.,Xiangtan Medicine Health Vocational College, Xiangtan, 411102, Hunan, China
| | - Ying Li
- Department of Mathematics and Computational Science, Xiangtan University, Xiangtan, 411105, Hunan, China.,Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, 411105, Hunan, China
| |
Collapse
|
13
|
Li K, Carroll M, Vafabakhsh R, Wang XA, Wang JP. OUP accepted manuscript. Nucleic Acids Res 2022; 50:3142-3154. [PMID: 35288750 PMCID: PMC8989542 DOI: 10.1093/nar/gkac162] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Revised: 02/16/2022] [Accepted: 02/23/2022] [Indexed: 11/16/2022] Open
Abstract
DNA mechanical properties play a critical role in every aspect of DNA-dependent biological processes. Recently a high throughput assay named loop-seq has been developed to quantify the intrinsic bendability of a massive number of DNA fragments simultaneously. Using the loop-seq data, we develop a software tool, DNAcycP, based on a deep-learning approach for intrinsic DNA cyclizability prediction. We demonstrate DNAcycP predicts intrinsic DNA cyclizability with high fidelity compared to the experimental data. Using an independent dataset from in vitro selection for enrichment of loopable sequences, we further verified the predicted cyclizability score, termed C-score, can well distinguish DNA fragments with different loopability. We applied DNAcycP to multiple species and compared the C-scores with available high-resolution chemical nucleosome maps. Our analyses showed that both yeast and mouse genomes share a conserved feature of high DNA bendability spanning nucleosome dyads. Additionally, we extended our analysis to transcription factor binding sites and surprisingly found that the cyclizability is substantially elevated at CTCF binding sites in the mouse genome. We further demonstrate this distinct mechanical property is conserved across mammalian species and is inherent to CTCF binding DNA motif.
Collapse
Affiliation(s)
- Keren Li
- Department of Statistics, Northwestern University, 633 Clark Street, Evanston, IL 60208, USA
- NSF-Simons Center for Quantitative Biology, Northwestern University, Evanston, IL 60208, USA
| | - Matthew Carroll
- Weinberg College IT Solutions (WITS), Northwestern University, 633 Clark Street, Evanston, IL 60208, USA
| | - Reza Vafabakhsh
- Department of Molecular Biosciences, Northwestern University, Evanston, IL 60208, USA
| | - Xiaozhong A Wang
- Correspondence may also be addressed to Xiaozhong A. Wang. Tel: +1 847 467 4897;
| | - Ji-Ping Wang
- To whom correspondence should be addressed. Tel: +1 847 467 6896;
| |
Collapse
|
14
|
Shi K, Lin W, Zhao XM. Identifying Molecular Biomarkers for Diseases With Machine Learning Based on Integrative Omics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2514-2525. [PMID: 32305934 DOI: 10.1109/tcbb.2020.2986387] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Molecular biomarkers are certain molecules or set of molecules that can be of help for diagnosis or prognosis of diseases or disorders. In the past decades, thanks to the advances in high-throughput technologies, a huge amount of molecular 'omics' data, e.g., transcriptomics and proteomics, have been accumulated. The availability of these omics data makes it possible to screen biomarkers for diseases or disorders. Accordingly, a number of computational approaches have been developed to identify biomarkers by exploring the omics data. In this review, we present a comprehensive survey on the recent progress of identification of molecular biomarkers with machine learning approaches. Specifically, we categorize the machine learning approaches into supervised, un-supervised and recommendation approaches, where the biomarkers including single genes, gene sets and small gene networks. In addition, we further discuss potential problems underlying bio-medical data that may pose challenges for machine learning, and provide possible directions for future biomarker identification.
Collapse
|
15
|
Li JY, Jin S, Tu XM, Ding Y, Gao G. Identifying complex motifs in massive omics data with a variable-convolutional layer in deep neural network. Brief Bioinform 2021; 22:6312656. [PMID: 34219140 DOI: 10.1093/bib/bbab233] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 05/25/2021] [Accepted: 05/28/2021] [Indexed: 01/10/2023] Open
Abstract
Motif identification is among the most common and essential computational tasks for bioinformatics and genomics. Here we proposed a novel convolutional layer for deep neural network, named variable convolutional (vConv) layer, for effective motif identification in high-throughput omics data by learning kernel length from data adaptively. Empirical evaluations on DNA-protein binding and DNase footprinting cases well demonstrated that vConv-based networks have superior performance to their convolutional counterparts regardless of model complexity. Meanwhile, vConv could be readily integrated into multi-layer neural networks as an 'in-place replacement' of canonical convolutional layer. All source codes are freely available on GitHub for academic usage.
Collapse
Affiliation(s)
- Jing-Yi Li
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Shen Jin
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Xin-Ming Tu
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Yang Ding
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Ge Gao
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| |
Collapse
|
16
|
Han GS, Li Q, Li Y. Comparative analysis and prediction of nucleosome positioning using integrative feature representation and machine learning algorithms. BMC Bioinformatics 2021; 22:129. [PMID: 34078256 PMCID: PMC8170966 DOI: 10.1186/s12859-021-04006-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 02/08/2021] [Indexed: 12/01/2022] Open
Abstract
Background Nucleosome plays an important role in the process of genome expression, DNA replication, DNA repair and transcription. Therefore, the research of nucleosome positioning has invariably received extensive attention. Considering the diversity of DNA sequence representation methods, we tried to integrate multiple features to analyze its effect in the process of nucleosome positioning analysis. This process can also deepen our understanding of the theoretical analysis of nucleosome positioning. Results Here, we not only used frequency chaos game representation (FCGR) to construct DNA sequence features, but also integrated it with other features and adopted the principal component analysis (PCA) algorithm. Simultaneously, support vector machine (SVM), extreme learning machine (ELM), extreme gradient boosting (XGBoost), multilayer perceptron (MLP) and convolutional neural networks (CNN) are used as predictors for nucleosome positioning prediction analysis, respectively. The integrated feature vector prediction quality is significantly superior to a single feature. After using principal component analysis (PCA) to reduce the feature dimension, the prediction quality of H. sapiens dataset has been significantly improved. Conclusions Comparative analysis and prediction on H. sapiens, C. elegans, D. melanogaster and S. cerevisiae datasets, demonstrate that the application of FCGR to nucleosome positioning is feasible, and we also found that integrative feature representation would be better.
Collapse
Affiliation(s)
- Guo-Sheng Han
- Department of Mathematics and Computational Science, Xiangtan University, Xiangtan, 411105, Hunan, China. .,Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, 411105, Hunan, China.
| | - Qi Li
- Department of Mathematics and Computational Science, Xiangtan University, Xiangtan, 411105, Hunan, China.,Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, 411105, Hunan, China
| | - Ying Li
- Department of Mathematics and Computational Science, Xiangtan University, Xiangtan, 411105, Hunan, China.,Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education and Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Xiangtan, 411105, Hunan, China
| |
Collapse
|
17
|
Guo Y, Zhou D, Li W, Cao J, Nie R, Xiong L, Ruan X. Identifying polyadenylation signals with biological embedding via self-attentive gated convolutional highway networks. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107133] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
18
|
Routhier E, Pierre E, Khodabandelou G, Mozziconacci J. Genome-wide prediction of DNA mutation effect on nucleosome positions for yeast synthetic genomics. Genome Res 2021; 31:317-326. [PMID: 33355297 PMCID: PMC7849406 DOI: 10.1101/gr.264416.120] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Accepted: 12/11/2020] [Indexed: 12/15/2022]
Abstract
Genetically modified genomes are often used today in many areas of fundamental and applied research. In many studies, coding or noncoding regions are modified in order to change protein sequences or gene expression levels. Modifying one or several nucleotides in a genome can also lead to unexpected changes in the epigenetic regulation of genes. When designing a synthetic genome with many mutations, it would thus be very informative to be able to predict the effect of these mutations on chromatin. We develop here a deep learning approach that quantifies the effect of every possible single mutation on nucleosome positions on the full Saccharomyces cerevisiae genome. This type of annotation track can be used when designing a modified S. cerevisiae genome. We further highlight how this track can provide new insights on the sequence-dependent mechanisms that drive nucleosomes' positions in vivo.
Collapse
Affiliation(s)
- Etienne Routhier
- Sorbonne Universite, CNRS, Laboratoire de Physique Théorique de la Matière Condensée, LPTMC, Paris F-75252, France
| | - Edgard Pierre
- Sorbonne Universite, CNRS, Laboratoire de Physique Théorique de la Matière Condensée, LPTMC, Paris F-75252, France
| | | | - Julien Mozziconacci
- Sorbonne Universite, CNRS, Laboratoire de Physique Théorique de la Matière Condensée, LPTMC, Paris F-75252, France
- Muséum National d'Histoire Naturelle, Structure et Instabilité des Génomes, UMR7196, Paris 75231, France
- Institut Universitaire de France, Paris 75005, France
| |
Collapse
|
19
|
Jing F, Zhang SW, Cao Z, Zhang S. An Integrative Framework for Combining Sequence and Epigenomic Data to Predict Transcription Factor Binding Sites Using Deep Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:355-364. [PMID: 30835229 DOI: 10.1109/tcbb.2019.2901789] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Knowing the transcription factor binding sites (TFBSs) is essential for modeling the underlying binding mechanisms and follow-up cellular functions. Convolutional neural networks (CNNs) have outperformed methods in predicting TFBSs from the primary DNA sequence. In addition to DNA sequences, histone modifications and chromatin accessibility are also important factors influencing their activity. They have been explored to predict TFBSs recently. However, current methods rarely take into account histone modifications and chromatin accessibility using CNN in an integrative framework. To this end, we developed a general CNN model to integrate these data for predicting TFBSs. We systematically benchmarked a series of architecture variants by changing network structure in terms of width and depth, and explored the effects of sample length at flanking regions. We evaluated the performance of the three types of data and their combinations using 256 ChIP-seq experiments and also compared it with competing machine learning methods. We find that contributions from these three types of data are complementary to each other. Moreover, the integrative CNN framework is superior to traditional machine learning methods with significant improvements.
Collapse
|
20
|
Amato D, Bosco GL, Rizzo R. CORENup: a combination of convolutional and recurrent deep neural networks for nucleosome positioning identification. BMC Bioinformatics 2020; 21:326. [PMID: 32938377 PMCID: PMC7493859 DOI: 10.1186/s12859-020-03627-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Accepted: 06/22/2020] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Nucleosomes wrap the DNA into the nucleus of the Eukaryote cell and regulate its transcription phase. Several studies indicate that nucleosomes are determined by the combined effects of several factors, including DNA sequence organization. Interestingly, the identification of nucleosomes on a genomic scale has been successfully performed by computational methods using DNA sequence as input data. RESULTS In this work, we propose CORENup, a deep learning model for nucleosome identification. CORENup processes a DNA sequence as input using one-hot representation and combines in a parallel fashion a fully convolutional neural network and a recurrent layer. These two parallel levels are devoted to catching both non periodic and periodic DNA string features. A dense layer is devoted to their combination to give a final classification. CONCLUSIONS Results computed on public data sets of different organisms show that CORENup is a state of the art methodology for nucleosome positioning identification based on a Deep Neural Network architecture. The comparisons have been carried out using two groups of datasets, currently adopted by the best performing methods, and CORENup has shown top performance both in terms of classification metrics and elapsed computation time.
Collapse
Affiliation(s)
- Domenico Amato
- Dipartimento di Matematica e Informatica, Università degli studi di Palermo, Via Archirafi, 34, Palermo, 90123, Italy
| | - Giosue' Lo Bosco
- Dipartimento di Matematica e Informatica, Università degli studi di Palermo, Via Archirafi, 34, Palermo, 90123, Italy. .,Dipartimento di Scienze per l'Innovazione tecnologica, Istituto Euro-Mediterraneo di Scienza e Tecnologia, Via Michele Miraglia, 20, Palermo, 9039, Italy.
| | - Riccardo Rizzo
- CNR-ICAR, National Research Council of Italy, Via Ugo La Malfa, 153, Palermo, 90146, Italy
| |
Collapse
|
21
|
Urso A, Fiannaca A, La Rosa M, La Paglia L, Lo Bosco G, Rizzo R. BITS2019: the sixteenth annual meeting of the Italian society of bioinformatics. BMC Bioinformatics 2020; 21:363. [PMID: 32938383 PMCID: PMC7493178 DOI: 10.1186/s12859-020-03708-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
The 16th Annual Meeting of the Bioinformatics Italian Society was held in Palermo, Italy, on June 26-28, 2019. More than 80 scientific contributions were presented, including 4 keynote lectures, 31 oral communications and 49 posters. Also, three workshops were organised before and during the meeting. Full papers from some of the works presented in Palermo were submitted for this Supplement of BMC Bioinformatics. Here, we provide an overview of meeting aims and scope. We also shortly introduce selected papers that have been accepted for publication in this Supplement, for a complete presentation of the outcomes of the meeting.
Collapse
Affiliation(s)
- Alfonso Urso
- ICAR-CNR, Institute for high performance computing and networking, National Research Council of Italy, Palermo, 90146, Italy.
| | - Antonino Fiannaca
- ICAR-CNR, Institute for high performance computing and networking, National Research Council of Italy, Palermo, 90146, Italy
| | - Massimo La Rosa
- ICAR-CNR, Institute for high performance computing and networking, National Research Council of Italy, Palermo, 90146, Italy
| | - Laura La Paglia
- ICAR-CNR, Institute for high performance computing and networking, National Research Council of Italy, Palermo, 90146, Italy
| | - Giosue' Lo Bosco
- Department of Mathematics and Computer Science, University of Palermo, Palermo, 90128, Italy
| | - Riccardo Rizzo
- ICAR-CNR, Institute for high performance computing and networking, National Research Council of Italy, Palermo, 90146, Italy
| |
Collapse
|
22
|
崔 颖, 徐 泽, 李 建. [Identification of nucleosome positioning using support vector machine method based on comprehensive DNA sequence feature]. SHENG WU YI XUE GONG CHENG XUE ZA ZHI = JOURNAL OF BIOMEDICAL ENGINEERING = SHENGWU YIXUE GONGCHENGXUE ZAZHI 2020; 37:496-501. [PMID: 32597092 PMCID: PMC10319573 DOI: 10.7507/1001-5515.201911064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Received: 11/23/2019] [Indexed: 11/03/2022]
Abstract
In this article, based on z-curve theory and position weight matrix (PWM), a model for nucleosome sequences was constructed. Nucleosome sequence dataset was transformed into three-dimensional coordinates, PWM of the nucleosome sequences was calculated and the similarity score was obtained. After integrating them, a nucleosome feature model based on the comprehensive DNA sequences was obtained and named CSeqFM. We calculated the Euclidean distance between nucleosome sequence candidates or linker sequences and CSeqFM model as the feature dataset, and put the feature datasets into the support vector machine (SVM) for training and testing by ten-fold cross-validation. The results showed that the sensitivity, specificity, accuracy and Matthews correlation coefficient (MCC) of identifying nucleosome positioning for S. cerevisiae were 97.1%, 96.9%, 94.2% and 0.89, respectively, and the area under the receiver operating characteristic curve (AUC) was 0.980 1. Compared with another z-curve method, it was found that our method had better identifying effect and each evaluation performance showed better superiority. CSeqFM method was applied to identify nucleosome positioning for other three species, including C. elegans, H. sapiens and D. melanogaster. The results showed that AUCs of the three species were all higher than 0.90, and CSeqFM method also showed better stability and effectiveness compared with iNuc-STNC and iNuc-PseKNC methods, which is further demonstrated that CSeqFM method has strong reliability and good identification performance.
Collapse
Affiliation(s)
- 颖 崔
- 黑龙江大学 电子工程学院(哈尔滨 150080)Electronic Engineering College, Heilongjiang University, Harbin 150080, P.R.China
- 哈尔滨医科大学 生物信息科学与技术学院(哈尔滨 150081)School of Bioinformatics Sciences and Technology, Harbin Medical University, Harbin 150081, P.R.China
| | - 泽龙 徐
- 黑龙江大学 电子工程学院(哈尔滨 150080)Electronic Engineering College, Heilongjiang University, Harbin 150080, P.R.China
| | - 建中 李
- 黑龙江大学 电子工程学院(哈尔滨 150080)Electronic Engineering College, Heilongjiang University, Harbin 150080, P.R.China
- 哈尔滨医科大学 生物信息科学与技术学院(哈尔滨 150081)School of Bioinformatics Sciences and Technology, Harbin Medical University, Harbin 150081, P.R.China
| |
Collapse
|
23
|
|
24
|
Zhao Y, Wang J, Liang F, Liu Y, Wang Q, Zhang H, Jiang M, Zhang Z, Zhao W, Bao Y, Zhang Z, Wu J, Asmann YW, Li R, Xiao J. NucMap: a database of genome-wide nucleosome positioning map across species. Nucleic Acids Res 2020; 47:D163-D169. [PMID: 30335176 PMCID: PMC6323900 DOI: 10.1093/nar/gky980] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2018] [Accepted: 10/10/2018] [Indexed: 12/16/2022] Open
Abstract
Dynamics of nucleosome positioning affects chromatin state, transcription and all other biological processes occurring on genomic DNA. While MNase-Seq has been used to depict nucleosome positioning map in eukaryote in the past years, nucleosome positioning data is increasing dramatically. To facilitate the usage of published data across studies, we developed a database named nucleosome positioning map (NucMap, http://bigd.big.ac.cn/nucmap). NucMap includes 798 experimental data from 477 samples across 15 species. With a series of functional modules, users can search profile of nucleosome positioning at the promoter region of each gene across all samples and make enrichment analysis on nucleosome positioning data in all genomic regions. Nucleosome browser was built to visualize the profiles of nucleosome positioning. Users can also visualize multiple sources of omics data with the nucleosome browser and make side-by-side comparisons. All processed data in the database are freely available. NucMap is the first comprehensive nucleosome positioning platform and it will serve as an important resource to facilitate the understanding of chromatin regulation.
Collapse
Affiliation(s)
- Yongbing Zhao
- Department of Health Sciences Research, Mayo Clinic, Jacksonville, FL 32224, USA
| | - Jinyue Wang
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Fang Liang
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China
| | - Yanxia Liu
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Qi Wang
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Hao Zhang
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Meiye Jiang
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China
| | - Zhewen Zhang
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China
| | - Wenming Zhao
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China
| | - Yiming Bao
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China
| | - Zhang Zhang
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,Collaborative Innovation Center of Genetics and Development, Fudan University, Shanghai 200438, China
| | - Jiayan Wu
- CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China
| | - Yan W Asmann
- Department of Health Sciences Research, Mayo Clinic, Jacksonville, FL 32224, USA
| | - Rujiao Li
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China
| | - Jingfa Xiao
- BIG Data Center, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,CAS Key Laboratory of Genome Sciences and Information, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, China.,College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China.,Collaborative Innovation Center of Genetics and Development, Fudan University, Shanghai 200438, China
| |
Collapse
|
25
|
Wang L, Zhang J. Prediction of sgRNA on-target activity in bacteria by deep learning. BMC Bioinformatics 2019; 20:517. [PMID: 31651233 PMCID: PMC6814057 DOI: 10.1186/s12859-019-3151-4] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2019] [Accepted: 10/04/2019] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND One of the main challenges for the CRISPR-Cas9 system is selecting optimal single-guide RNAs (sgRNAs). Recently, deep learning has enhanced sgRNA prediction in eukaryotes. However, the prokaryotic chromatin structure is different from eukaryotes, so models trained on eukaryotes may not apply to prokaryotes. RESULTS We designed and implemented a convolutional neural network to predict sgRNA activity in Escherichia coli. The network was trained and tested on the recently-released sgRNA activity dataset. Our convolutional neural network achieved excellent performance, yielding average Spearman correlation coefficients of 0.5817, 0.7105, and 0.3602, respectively for Cas9, eSpCas9 and Cas9 with a recA coding region deletion. We confirmed that the sgRNA prediction models trained on prokaryotes do not apply to eukaryotes and vice versa. We adopted perturbation-based approaches to analyze distinct biological patterns between prokaryotic and eukaryotic editing. Then, we improved the predictive performance of the prokaryotic Cas9 system by transfer learning. Finally, we determined that potential off-target scores accumulated on a genome-wide scale affect on-target activity, which could slightly improve on-target predictive performance. CONCLUSIONS We developed convolutional neural networks to predict sgRNA activity for wild type and mutant Cas9 in prokaryotes. Our results show that the prediction accuracy of our method is improved over state-of-the-art models.
Collapse
Affiliation(s)
- Lei Wang
- School of Life Science, Beijing Institute of Technology, South Zhongguancun Street, Beijing, 100081 China
| | - Juhua Zhang
- School of Life Science, Beijing Institute of Technology, South Zhongguancun Street, Beijing, 100081 China
- Key Laboratory of Convergence Medical Engineering System and Healthcare Technology, The Ministry of Industry and Information Technology, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
26
|
ZCMM: A Novel Method Using Z-Curve Theory- Based and Position Weight Matrix for Predicting Nucleosome Positioning. Genes (Basel) 2019; 10:genes10100765. [PMID: 31569414 PMCID: PMC6827144 DOI: 10.3390/genes10100765] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2019] [Revised: 09/25/2019] [Accepted: 09/26/2019] [Indexed: 02/04/2023] Open
Abstract
Nucleosomes are the basic units of eukaryotes. The accurate positioning of nucleosomes plays a significant role in understanding many biological processes such as transcriptional regulation mechanisms and DNA replication and repair. Here, we describe the development of a novel method, termed ZCMM, based on Z-curve theory and position weight matrix (PWM). The ZCMM was trained and tested using the nucleosomal and linker sequences determined by support vector machine (SVM) in Saccharomyces cerevisiae (S. cerevisiae), and experimental results showed that the sensitivity (Sn), specificity (Sp), accuracy (Acc), and Matthews correlation coefficient (MCC) values for ZCMM were 91.40%, 96.56%, 96.75%, and 0.88, respectively, and the average area under the receiver operating characteristic curve (AUC) value was 0.972. A ZCMM predictor was developed to predict nucleosome positioning in Homo sapiens (H. sapiens), Caenorhabditis elegans (C. elegans), and Drosophila melanogaster (D. melanogaster) genomes, and the accuracy (Acc) values were 77.72%, 85.34%, and 93.62%, respectively. The maximum AUC values of the four species were 0.982, 0.861, 0.912 and 0.911, respectively. Another independent dataset for S. cerevisiae was used to predict nucleosome positioning. Compared with the results of Wu's method, it was found that the Sn, Sp, Acc, and MCC of ZCMM results for S. cerevisiae were all higher, reaching 96.72%, 96.54%, 94.10%, and 0.88. Compared with the Guo's method 'iNuc-PseKNC', the results of ZCMM for D. melanogaster were better. Meanwhile, the ZCMM was compared with some experimental data in vitro and in vivo for S. cerevisiae, and the results showed that the nucleosomes predicted by ZCMM were highly consistent with those confirmed by these experiments. Therefore, it was further confirmed that the ZCMM method has good accuracy and reliability in predicting nucleosome positioning.
Collapse
|
27
|
Abstract
Background The DNase I hypersensitive sites (DHSs) are associated with the cis-regulatory DNA elements. An efficient method of identifying DHSs can enhance the understanding on the accessibility of chromatin. Despite a multitude of resources available on line including experimental datasets and computational tools, the complex language of DHSs remains incompletely understood. Methods Here, we address this challenge using an approach based on a state-of-the-art machine learning method. We present a novel convolutional neural network (CNN) which combined Inception like networks with a gating mechanism for the response of multiple patterns and longterm association in DNA sequences to predict multi-scale DHSs in Arabidopsis, rice and Homo sapiens. Results Our method obtains 0.961 area under curve (AUC) on Arabidopsis, 0.969 AUC on rice and 0.918 AUC on Homo sapiens. Conclusions Our method provides an efficient and accurate way to identify multi-scale DHSs sequences by deep learning.
Collapse
Affiliation(s)
- Chuqiao Lyu
- School of Life Science, Beijing Institute of Technology, South Zhongguancun Street, Beijing, 100081, China
| | - Lei Wang
- School of Life Science, Beijing Institute of Technology, South Zhongguancun Street, Beijing, 100081, China
| | - Juhua Zhang
- School of Life Science, Beijing Institute of Technology, South Zhongguancun Street, Beijing, 100081, China. .,Key Laboratory of Convergence Medical Engineering System and Healthcare Technology the Ministry of Industry and Information Technology, Beijing Institute of Technology, Beijing, China.
| |
Collapse
|
28
|
|
29
|
Grapov D, Fahrmann J, Wanichthanarak K, Khoomrung S. Rise of Deep Learning for Genomic, Proteomic, and Metabolomic Data Integration in Precision Medicine. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2018; 22:630-636. [PMID: 30124358 PMCID: PMC6207407 DOI: 10.1089/omi.2018.0097] [Citation(s) in RCA: 125] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Machine learning (ML) is being ubiquitously incorporated into everyday products such as Internet search, email spam filters, product recommendations, image classification, and speech recognition. New approaches for highly integrated manufacturing and automation such as the Industry 4.0 and the Internet of things are also converging with ML methodologies. Many approaches incorporate complex artificial neural network architectures and are collectively referred to as deep learning (DL) applications. These methods have been shown capable of representing and learning predictable relationships in many diverse forms of data and hold promise for transforming the future of omics research and applications in precision medicine. Omics and electronic health record data pose considerable challenges for DL. This is due to many factors such as low signal to noise, analytical variance, and complex data integration requirements. However, DL models have already been shown capable of both improving the ease of data encoding and predictive model performance over alternative approaches. It may not be surprising that concepts encountered in DL share similarities with those observed in biological message relay systems such as gene, protein, and metabolite networks. This expert review examines the challenges and opportunities for DL at a systems and biological scale for a precision medicine readership.
Collapse
Affiliation(s)
- Dmitry Grapov
- CDS-Creative Data Solutions LLC, Ballwin, Missouri, www.createdatasol.com
| | - Johannes Fahrmann
- Department of Clinical Cancer Prevention, University of Texas MD Anderson, Houston, Texas
| | - Kwanjeera Wanichthanarak
- Department of Biochemistry, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
- Siriraj Metabolomics and Phenomics Center, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| | - Sakda Khoomrung
- Department of Biochemistry, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
- Siriraj Metabolomics and Phenomics Center, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| |
Collapse
|