1
|
Suntsova M, Gaifullin N, Allina D, Reshetun A, Li X, Mendeleeva L, Surin V, Sergeeva A, Spirin P, Prassolov V, Morgan A, Garazha A, Sorokin M, Buzdin A. Atlas of RNA sequencing profiles for normal human tissues. Sci Data 2019; 6:36. [PMID: 31015567 PMCID: PMC6478850 DOI: 10.1038/s41597-019-0043-4] [Citation(s) in RCA: 86] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2018] [Accepted: 03/12/2019] [Indexed: 11/09/2022] Open
Abstract
Comprehensive analysis of molecular pathology requires a collection of reference samples representing normal tissues from healthy donors. For the available limited collections of normal tissues from postmortal donors, there is a problem of data incompatibility, as different datasets generated using different experimental platforms often cannot be merged in a single panel. Here, we constructed and deposited the gene expression database of normal human tissues based on uniformly screened original sequencing data. In total, 142 solid tissue samples representing 20 organs were taken from post-mortal human healthy donors of different age killed in road accidents no later than 36 hours after death. Blood samples were taken from 17 healthy volunteers. We then compared them with the 758 transcriptomic profiles taken from the other databases. We found that overall 463 biosamples showed tissue-specific rather than platform- or database-specific clustering and could be aggregated in a single database termed Oncobox Atlas of Normal Tissue Expression (ANTE). Our data will be useful to all those working with the analysis of human gene expression.
Collapse
Affiliation(s)
- Maria Suntsova
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Moscow, 117997, Russia
| | - Nurshat Gaifullin
- Department of Pathology, Faculty of Medicine, Lomonosov Moscow State University, Moscow, 119991, Russia
| | - Daria Allina
- Pathology Department, Morozov Children's City Hospital, 4th Dobryninsky Lane 1/9, Moscow, 119049, Russia
| | | | - Xinmin Li
- Department of Pathology and Laboratory Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Larisa Mendeleeva
- National Research Center for Hematology, Novy Zykovsky proezd, 4, Moscow, 125167, Russia
| | - Vadim Surin
- National Research Center for Hematology, Novy Zykovsky proezd, 4, Moscow, 125167, Russia
| | - Anna Sergeeva
- National Research Center for Hematology, Novy Zykovsky proezd, 4, Moscow, 125167, Russia
| | - Pavel Spirin
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilova Street, 32, Moscow, 119991, Russia
| | - Vladimir Prassolov
- Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Vavilova Street, 32, Moscow, 119991, Russia
| | | | - Andrew Garazha
- Omicsway Corp., 340S Lemon Ave, 6040, Walnut, 91789 CA, USA
- Oncobox ltd., Moscow, 121205, Russia
| | - Maxim Sorokin
- Omicsway Corp., 340S Lemon Ave, 6040, Walnut, 91789 CA, USA.
- I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia.
| | - Anton Buzdin
- Omicsway Corp., 340S Lemon Ave, 6040, Walnut, 91789 CA, USA
- Oncobox ltd., Moscow, 121205, Russia
- I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| |
Collapse
|
2
|
Recurrent Neural Network for Predicting Transcription Factor Binding Sites. Sci Rep 2018; 8:15270. [PMID: 30323198 PMCID: PMC6189047 DOI: 10.1038/s41598-018-33321-1] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2018] [Accepted: 09/25/2018] [Indexed: 12/23/2022] Open
Abstract
It is well known that DNA sequence contains a certain amount of transcription factors (TF) binding sites, and only part of them are identified through biological experiments. However, these experiments are expensive and time-consuming. To overcome these problems, some computational methods, based on k-mer features or convolutional neural networks, have been proposed to identify TF binding sites from DNA sequences. Although these methods have good performance, the context information that relates to TF binding sites is still lacking. Research indicates that standard recurrent neural networks (RNN) and its variants have better performance in time-series data compared with other models. In this study, we propose a model, named KEGRU, to identify TF binding sites by combining Bidirectional Gated Recurrent Unit (GRU) network with k-mer embedding. Firstly, DNA sequences are divided into k-mer sequences with a specified length and stride window. And then, we treat each k-mer as a word and pre-trained word representation model though word2vec algorithm. Thirdly, we construct a deep bidirectional GRU model for feature learning and classification. Experimental results have shown that our method has better performance compared with some state-of-the-art methods. Additional experiments about embedding strategy show that k-mer embedding will be helpful to enhance model performance. The robustness of KEGRU is proved by experiments with different k-mer length, stride window and embedding vector dimension.
Collapse
|
3
|
Niu M, Tabari E, Ni P, Su Z. Towards a map of cis-regulatory sequences in the human genome. Nucleic Acids Res 2018; 46:5395-5409. [PMID: 29733395 PMCID: PMC6009671 DOI: 10.1093/nar/gky338] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2018] [Revised: 04/14/2018] [Accepted: 04/19/2018] [Indexed: 01/10/2023] Open
Abstract
Accumulating evidence indicates that transcription factor (TF) binding sites, or cis-regulatory elements (CREs), and their clusters termed cis-regulatory modules (CRMs) play a more important role than do gene-coding sequences in specifying complex traits in humans, including the susceptibility to common complex diseases. To fully characterize their roles in deriving the complex traits/diseases, it is necessary to annotate all CREs and CRMs encoded in the human genome. However, the current annotations of CREs and CRMs in the human genome are still very limited and mostly coarse-grained, as they often lack the detailed information of CREs in CRMs. Here, we integrated 620 TF ChIP-seq datasets produced by the ENCODE project for 168 TFs in 79 different cell/tissue types and predicted an unprecedentedly completely map of CREs in CRMs in the human genome at single nucleotide resolution. The map includes 305 912 CRMs containing a total of 1 178 913 CREs belonging to 736 unique TF binding motifs. The predicted CREs and CRMs tend to be subject to either purifying selection or positive selection, thus are likely to be functional. Based on the results, we also examined the status of available ChIP-seq datasets for predicting the entire regulatory genome of humans.
Collapse
Affiliation(s)
- Meng Niu
- Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, NC 28223, USA
| | - Ehsan Tabari
- Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, NC 28223, USA
| | - Pengyu Ni
- Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, NC 28223, USA
| | - Zhengchang Su
- Department of Bioinformatics and Genomics, College of Computing and Informatics, The University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, NC 28223, USA
| |
Collapse
|
4
|
Qin Q, Feng J. Imputation for transcription factor binding predictions based on deep learning. PLoS Comput Biol 2017; 13:e1005403. [PMID: 28234893 PMCID: PMC5345877 DOI: 10.1371/journal.pcbi.1005403] [Citation(s) in RCA: 74] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2016] [Revised: 03/10/2017] [Accepted: 02/09/2017] [Indexed: 01/11/2023] Open
Abstract
Understanding the cell-specific binding patterns of transcription factors (TFs) is fundamental to studying gene regulatory networks in biological systems, for which ChIP-seq not only provides valuable data but is also considered as the gold standard. Despite tremendous efforts from the scientific community to conduct TF ChIP-seq experiments, the available data represent only a limited percentage of ChIP-seq experiments, considering all possible combinations of TFs and cell lines. In this study, we demonstrate a method for accurately predicting cell-specific TF binding for TF-cell line combinations based on only a small fraction (4%) of the combinations using available ChIP-seq data. The proposed model, termed TFImpute, is based on a deep neural network with a multi-task learning setting to borrow information across transcription factors and cell lines. Compared with existing methods, TFImpute achieves comparable accuracy on TF-cell line combinations with ChIP-seq data; moreover, TFImpute achieves better accuracy on TF-cell line combinations without ChIP-seq data. This approach can predict cell line specific enhancer activities in K562 and HepG2 cell lines, as measured by massively parallel reporter assays, and predicts the impact of SNPs on TF binding. Transcription factors play a central role in regulating various cellular processes. They bind to DNA in a cell-specific way. To study where a TF would bind to DNA, ChIP-seq experiment has been developed and widely adopted by the science community to study genome-wide in vivo protein-DNA interactions. However, for each TF, only a limited number of cell types have been explored by ChIP-seq experiments. To study the binding of a TF to a DNA sequence in a cell line without corresponding ChIP-seq data, researchers would check whether there is a motif for the TF in the sequence. However, motif alone contains only sequence information and therefore cannot reflect the cell specificity of TF binding. In this work, we demonstrate how to model the TF binding problem using deep learning and achieve cell specific binding prediction for TF-cell line combinations without ChIP-seq data.
Collapse
Affiliation(s)
- Qian Qin
- Department of Bioinformatics, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Jianxing Feng
- Department of Bioinformatics, School of Life Sciences and Technology, Tongji University, Shanghai, China
- * E-mail:
| |
Collapse
|