Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 2021;37:2112-2120. [PMID: 33538820 PMCID: PMC11025658 DOI: 10.1093/bioinformatics/btab083] [Citation(s) in RCA: 337] [Impact Index Per Article: 84.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Revised: 12/31/2020] [Accepted: 02/01/2021] [Indexed: 12/19/2022] Open

For:	Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 2021;37:2112-2120. [PMID: 33538820 PMCID: PMC11025658 DOI: 10.1093/bioinformatics/btab083] [Citation(s) in RCA: 337] [Impact Index Per Article: 84.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2020] [Revised: 12/31/2020] [Accepted: 02/01/2021] [Indexed: 12/19/2022] Open

Number

Cited by Other Article(s)

301

He Y, Zhang Q, Wang S, Chen Z, Cui Z, Guo ZH, Huang DS. Predicting the Sequence Specificities of DNA-Binding Proteins by DNA Fine-Tuned Language Model With Decaying Learning Rates. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023;20:616-624. [PMID: 35389869 DOI: 10.1109/tcbb.2022.3165592] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]

302

Li J, Wu Z, Lin W, Luo J, Zhang J, Chen Q, Chen J. iEnhancer-ELM: improve enhancer identification by extracting position-related multiscale contextual information based on enhancer language models. BIOINFORMATICS ADVANCES 2023;3:vbad043. [PMID: 37113248 PMCID: PMC10125906 DOI: 10.1093/bioadv/vbad043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 02/04/2023] [Accepted: 03/24/2023] [Indexed: 04/29/2023]

303

Zeng W, Gautam A, Huson DH. MuLan-Methyl-multiple transformer-based language models for accurate DNA methylation prediction. Gigascience 2022;12:giad054. [PMID: 37489753 PMCID: PMC10367125 DOI: 10.1093/gigascience/giad054] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2023] [Revised: 05/09/2023] [Accepted: 07/18/2023] [Indexed: 07/26/2023] Open

304

Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework. PLoS Comput Biol 2022;18:e1010779. [PMID: 36520922 PMCID: PMC9836277 DOI: 10.1371/journal.pcbi.1010779] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 01/12/2023] [Accepted: 11/29/2022] [Indexed: 12/23/2022] Open

Abstract

Enhancers are short non-coding DNA sequences outside of the target promoter regions that can be bound by specific proteins to increase a gene's transcriptional activity, which has a crucial role in the spatiotemporal and quantitative regulation of gene expression. However, enhancers do not have a specific sequence motifs or structures, and their scattered distribution in the genome makes the identification of enhancers from human cell lines particularly challenging. Here we present a novel, stacked multivariate fusion framework called SMFM, which enables a comprehensive identification and analysis of enhancers from regulatory DNA sequences as well as their interpretation. Specifically, to characterize the hierarchical relationships of enhancer sequences, multi-source biological information and dynamic semantic information are fused to represent regulatory DNA enhancer sequences. Then, we implement a deep learning-based sequence network to learn the feature representation of the enhancer sequences comprehensively and to extract the implicit relationships in the dynamic semantic information. Ultimately, an ensemble machine learning classifier is trained based on the refined multi-source features and dynamic implicit relations obtained from the deep learning-based sequence network. Benchmarking experiments demonstrated that SMFM significantly outperforms other existing methods using several evaluation metrics. In addition, an independent test set was used to validate the generalization performance of SMFM by comparing it to other state-of-the-art enhancer identification methods. Moreover, we performed motif analysis based on the contribution scores of different bases of enhancer sequences to the final identification results. Besides, we conducted interpretability analysis of the identified enhancer sequences based on attention weights of EnhancerBERT, a fine-tuned BERT model that provides new insights into exploring the gene semantic information likely to underlie the discovered enhancers in an interpretable manner. Finally, in a human placenta study with 4,562 active distal gene regulatory enhancers, SMFM successfully exposed tissue-related placental development and the differential mechanism, demonstrating the generalizability and stability of our proposed framework.

Collapse

305

Chen Y, Xie M, Wen J. Predicting gene expression from histone modifications with self-attention based neural networks and transfer learning. Front Genet 2022;13:1081842. [PMID: 36588793 PMCID: PMC9797047 DOI: 10.3389/fgene.2022.1081842] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 11/28/2022] [Indexed: 12/15/2022] Open

306

miRBind: A Deep Learning Method for miRNA Binding Classification. Genes (Basel) 2022;13:genes13122323. [PMID: 36553590 PMCID: PMC9777820 DOI: 10.3390/genes13122323] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 11/30/2022] [Accepted: 12/08/2022] [Indexed: 12/15/2022] Open

307

Markova EA, Shaw RE, Reynolds CR. Prediction of strain engineerings that amplify recombinant protein secretion through the machine learning approach MaLPHAS. ENGINEERING BIOLOGY 2022;6:82-90. [PMID: 36968340 PMCID: PMC9995161 DOI: 10.1049/enb2.12025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 09/01/2022] [Accepted: 09/05/2022] [Indexed: 11/19/2022] Open

308

Mai DHA, Nguyen LT, Lee EY. TSSNote-CyaPromBERT: Development of an integrated platform for highly accurate promoter prediction and visualization of Synechococcus sp. and Synechocystis sp. through a state-of-the-art natural language processing model BERT. Front Genet 2022;13:1067562. [PMID: 36523764 PMCID: PMC9745317 DOI: 10.3389/fgene.2022.1067562] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/17/2022] [Indexed: 07/30/2023] Open

309

IUP-BERT: Identification of Umami Peptides Based on BERT Features. Foods 2022;11:foods11223742. [PMID: 36429332 PMCID: PMC9689418 DOI: 10.3390/foods11223742] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 11/14/2022] [Accepted: 11/16/2022] [Indexed: 11/23/2022] Open

310

Lan AY, Corces MR. Deep learning approaches for noncoding variant prioritization in neurodegenerative diseases. Front Aging Neurosci 2022;14:1027224. [PMID: 36466610 PMCID: PMC9716280 DOI: 10.3389/fnagi.2022.1027224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Accepted: 10/24/2022] [Indexed: 11/19/2022] Open

311

Lee D, Yang J, Kim S. Learning the histone codes with large genomic windows and three-dimensional chromatin interactions using transformer. Nat Commun 2022;13:6678. [PMID: 36335101 PMCID: PMC9637148 DOI: 10.1038/s41467-022-34152-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 10/17/2022] [Indexed: 11/06/2022] Open

312

Rahardja S, Wang M, Nguyen BP, Fränti P, Rahardja S. A lightweight classification of adaptor proteins using transformer networks. BMC Bioinformatics 2022;23:461. [PMID: 36333658 PMCID: PMC9635127 DOI: 10.1186/s12859-022-05000-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Accepted: 09/13/2022] [Indexed: 11/06/2022] Open

313

Zhang Y, Liu Y, Wang Z, Wang M, Xiong S, Huang G, Gong M. Uncovering the Relationship between Tissue-Specific TF-DNA Binding and Chromatin Features through a Transformer-Based Model. Genes (Basel) 2022;13:1952. [PMID: 36360189 PMCID: PMC9690320 DOI: 10.3390/genes13111952] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Revised: 10/19/2022] [Accepted: 10/23/2022] [Indexed: 09/08/2024] Open

314

Piao C, Lv M, Wang S, Zhou R, Wang Y, Wei J, Liu J. Multi-objective data enhancement for deep learning-based ultrasound analysis. BMC Bioinformatics 2022;23:438. [PMID: 36266626 PMCID: PMC9583467 DOI: 10.1186/s12859-022-04985-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Accepted: 10/10/2022] [Indexed: 11/10/2022] Open

315

Jin J, Yu Y, Wang R, Zeng X, Pang C, Jiang Y, Li Z, Dai Y, Su R, Zou Q, Nakai K, Wei L. iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations. Genome Biol 2022;23:219. [PMID: 36253864 PMCID: PMC9575223 DOI: 10.1186/s13059-022-02780-1] [Citation(s) in RCA: 76] [Impact Index Per Article: 25.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Accepted: 10/03/2022] [Indexed: 11/29/2022] Open

Affiliation(s)

Junru Jin School of Software, Shandong University, Jinan, 250101, China Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
Yingying Yu School of Software, Shandong University, Jinan, 250101, China Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
Ruheng Wang School of Software, Shandong University, Jinan, 250101, China Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
Xin Zeng Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, 108-8639, Japan Department of Computational Biology and Medical Sciences, The University of Tokyo, Kashiwa, 277-8563, Japan
Chao Pang School of Software, Shandong University, Jinan, 250101, China Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
Yi Jiang School of Software, Shandong University, Jinan, 250101, China Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
Zhongshen Li School of Software, Shandong University, Jinan, 250101, China Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
Yutong Dai Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, 108-8639, Japan Department of Computational Biology and Medical Sciences, The University of Tokyo, Kashiwa, 277-8563, Japan
Ran Su College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China
Quan Zou Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, China
Kenta Nakai Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, 108-8639, Japan. Department of Computational Biology and Medical Sciences, The University of Tokyo, Kashiwa, 277-8563, Japan.
Leyi Wei School of Software, Shandong University, Jinan, 250101, China. Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China.

Collapse

316

PD-BertEDL: An Ensemble Deep Learning Method Using BERT and Multivariate Representation to Predict Peptide Detectability. Int J Mol Sci 2022;23:ijms232012385. [PMID: 36293242 PMCID: PMC9604182 DOI: 10.3390/ijms232012385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 10/11/2022] [Accepted: 10/12/2022] [Indexed: 12/03/2022] Open

317

Towards a better understanding of TF-DNA binding prediction from genomic features. Comput Biol Med 2022;149:105993. [DOI: 10.1016/j.compbiomed.2022.105993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 07/12/2022] [Accepted: 08/14/2022] [Indexed: 11/17/2022]

318

JLCRB: A unified multi-view-based joint representation learning for CircRNA binding sites prediction. J Biomed Inform 2022;136:104231. [DOI: 10.1016/j.jbi.2022.104231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Revised: 10/14/2022] [Accepted: 10/14/2022] [Indexed: 11/07/2022]

319

Zhang P, Zhang H, Wu H. iPro-WAEL: a comprehensive and robust framework for identifying promoters in multiple species. Nucleic Acids Res 2022;50:10278-10289. [PMID: 36161334 PMCID: PMC9561371 DOI: 10.1093/nar/gkac824] [Citation(s) in RCA: 46] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Revised: 08/24/2022] [Accepted: 09/14/2022] [Indexed: 11/27/2022] Open

320

Grønbæk C, Liang Y, Elliott D, Krogh A. Context dependent prediction in DNA sequence using neural networks. PeerJ 2022;10:e13666. [PMID: 36157058 PMCID: PMC9504454 DOI: 10.7717/peerj.13666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Accepted: 06/10/2022] [Indexed: 01/17/2023] Open

Abstract

One way to better understand the structure in DNA is by learning to predict the sequence. Here, we trained a model to predict the missing base at any given position, given its left and right flanking contexts. Our best-performing model was a neural network that obtained an accuracy close to 54% on the human genome, which is 2% points better than modelling the data using a Markov model. In likelihood-ratio tests, the neural network performed significantly better than any of the alternative models by a large margin. We report on where the accuracy was obtained, first observing that the performance appeared to be uniform over the chromosomes. The models performed best in repetitive sequences, as expected, although their performance far from random in the more difficult coding sections, the proportions being ~70:40%. We further explored the sources of the accuracy, Fourier transforming the predictions revealed weak but clear periodic signals. In the human genome the characteristic periods hinted at connections to nucleosome positioning. We found similar periodic signals in GC/AT content in the human genome, which to the best of our knowledge have not been reported before. On other large genomes similarly high accuracy was found, while lower predictive accuracy was observed on smaller genomes. Only in the mouse genome did we see periodic signals in the same range as in the human genome, though weaker and of a different type. This indicates that the sources of these signals are other or more than nucleosome arrangement. Interestingly, applying a model trained on the mouse genome to the human genome resulted in a performance far below that of the human model, except in the difficult coding regions. Despite the clear outcomes of the likelihood-ratio tests, there is currently a limited superiority of the neural network methods over the Markov model. We expect, however, that there is great potential for better modelling DNA using different neural network architectures.

Collapse

321

Kshirsagar M, Yuan H, Ferres JL, Leslie C. BindVAE: Dirichlet variational autoencoders for de novo motif discovery from accessible chromatin. Genome Biol 2022;23:174. [PMID: 35971180 PMCID: PMC9380350 DOI: 10.1186/s13059-022-02723-w] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Accepted: 06/28/2022] [Indexed: 11/10/2022] Open

322

A Neural Networks Approach for the Analysis of Reproducible Ribo–Seq Profiles. ALGORITHMS 2022. [DOI: 10.3390/a15080274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]

323

Bai Z, Zhang YZ, Miyano S, Yamaguchi R, Fujimoto K, Uematsu S, Imoto S. Identification of bacteriophage genome sequences with representation learning. Bioinformatics 2022;38:4264-4270. [PMID: 35920769 PMCID: PMC9477532 DOI: 10.1093/bioinformatics/btac509] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 05/20/2022] [Indexed: 12/24/2022] Open

324

Gwak HJ, Rho M. ViBE: a hierarchical BERT model to identify eukaryotic viruses using metagenome sequencing data. Brief Bioinform 2022;23:6603436. [PMID: 35667011 DOI: 10.1093/bib/bbac204] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 05/02/2022] [Accepted: 05/04/2022] [Indexed: 11/13/2022] Open

325

Serna Garcia G, Leone M, Bernasconi A, Carman MJ. GeMI: interactive interface for transformer-based Genomic Metadata Integration. Database (Oxford) 2022;2022:6600540. [PMID: 35657113 PMCID: PMC9216561 DOI: 10.1093/database/baac036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 03/26/2022] [Accepted: 04/26/2022] [Indexed: 11/15/2022]

326

Lu AX, Lu AX, Pritišanac I, Zarin T, Forman-Kay JD, Moses AM. Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning. PLoS Comput Biol 2022;18:e1010238. [PMID: 35767567 PMCID: PMC9275697 DOI: 10.1371/journal.pcbi.1010238] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Revised: 07/12/2022] [Accepted: 05/23/2022] [Indexed: 02/07/2023] Open

327

Nakai K, Wei L. Recent Advances in the Prediction of Subcellular Localization of Proteins and Related Topics. FRONTIERS IN BIOINFORMATICS 2022;2:910531. [PMID: 36304291 PMCID: PMC9580943 DOI: 10.3389/fbinf.2022.910531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 04/25/2022] [Indexed: 11/13/2022] Open

328

Yang M, Huang L, Huang H, Tang H, Zhang N, Yang H, Wu J, Mu F. Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. Nucleic Acids Res 2022;50:e81. [PMID: 35536244 PMCID: PMC9371931 DOI: 10.1093/nar/gkac326] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 02/22/2022] [Accepted: 05/09/2022] [Indexed: 12/12/2022] Open

Abstract

Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotating biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpret non-coding regions. Here, we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only two self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of the unlabelled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against the fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based DNA language model. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution.

Collapse

329

Abdalla M, Abdalla M. A general framework for predicting the transcriptomic consequences of non-coding variation and small molecules. PLoS Comput Biol 2022;18:e1010028. [PMID: 35421087 PMCID: PMC9041867 DOI: 10.1371/journal.pcbi.1010028] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 04/26/2022] [Accepted: 03/16/2022] [Indexed: 11/18/2022] Open

330

Yamada K, Hamada M. Prediction of RNA-protein interactions using a nucleotide language model. BIOINFORMATICS ADVANCES 2022;2:vbac023. [PMID: 36699410 PMCID: PMC9710633 DOI: 10.1093/bioadv/vbac023] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Revised: 02/28/2022] [Accepted: 04/05/2022] [Indexed: 01/28/2023]

331

Perez Martell RI, Ziesel A, Jabbari H, Stege U. Supervised promoter recognition: a benchmark framework. BMC Bioinformatics 2022;23:118. [PMID: 35366794 PMCID: PMC8976979 DOI: 10.1186/s12859-022-04647-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 03/16/2022] [Indexed: 11/10/2022] Open

332

Lin K, Quan X, Jin C, Shi Z, Yang J. An Interpretable Double-Scale Attention Model for Enzyme Protein Class Prediction Based on Transformer Encoders and Multi-Scale Convolutions. Front Genet 2022;13:885627. [PMID: 35432476 PMCID: PMC9012241 DOI: 10.3389/fgene.2022.885627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 03/07/2022] [Indexed: 12/01/2022] Open

333

Wang JX, Li Y, Li X, Lu ZH. Alzheimer's Disease Classification Through Imaging Genetic Data With IGnet. Front Neurosci 2022;16:846638. [PMID: 35310099 PMCID: PMC8927016 DOI: 10.3389/fnins.2022.846638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Accepted: 02/02/2022] [Indexed: 11/13/2022] Open

334

Yang Y, Hou Z, Wang Y, Ma H, Sun P, Ma Z, Wong KC, Li X. HCRNet: high-throughput circRNA-binding event identification from CLIP-seq data using deep temporal convolutional network. Brief Bioinform 2022;23:6533504. [PMID: 35189638 DOI: 10.1093/bib/bbac027] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2021] [Revised: 01/03/2022] [Accepted: 01/17/2022] [Indexed: 01/11/2023] Open

335

Context dependency of nucleotide probabilities and variants in human DNA. BMC Genomics 2022;23:87. [PMID: 35100973 PMCID: PMC8802520 DOI: 10.1186/s12864-021-08246-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Accepted: 12/10/2021] [Indexed: 12/20/2022] Open

Abstract

Background

Genomic DNA has been shaped by mutational processes through evolution. The cellular machinery for error correction and repair has left its marks in the nucleotide composition along with structural and functional constraints. Therefore, the probability of observing a base in a certain position in the human genome is highly context-dependent.

Results

Here we develop context-dependent nucleotide models. We first investigate models of nucleotides conditioned on sequence context. We develop a bidirectional Markov model that use an average of the probability from a Markov model applied to both strands of the sequence and thus depends on up to 14 bases to each side of the nucleotide. We show how the genome predictability varies across different types of genomic regions. Surprisingly, this model can predict a base from its context with an average of more than 50% accuracy. For somatic variants we show a tendency towards higher probability for the variant base than for the reference base. Inspired by DNA substitution models, we develop a model of mutability that estimates a mutation matrix (called the alpha matrix) on top of the nucleotide distribution. The alpha matrix can be estimated from a much smaller context than the nucleotide model, but the final model will still depend on the full context of the nucleotide model. With the bidirectional Markov model of order 14 and an alpha matrix dependent on just one base to each side, we obtain a model that compares well with a model of mutability that estimates mutation probabilities directly conditioned on three nucleotides to each side. For somatic variants in particular, our model fits better than the simpler model. Interestingly, the model is not very sensitive to the size of the context for the alpha matrix.

Conclusions

Our study found strong context dependencies of nucleotides in the human genome. The best model uses a context of 14 nucleotides to each side. Based on these models, a substitution model was constructed that separates into the context model and a matrix dependent on a small context. The model fit somatic variants particularly well.

Supplementary Information

The online version contains supplementary material available at (10.1186/s12864-021-08246-1).

Collapse

336

Kashiwagi S, Sato K, Sakakibara Y. A Max-Margin Model for Predicting Residue-Base Contacts in Protein-RNA Interactions. Life (Basel) 2021;11:1135. [PMID: 34833011 PMCID: PMC8624843 DOI: 10.3390/life11111135] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Revised: 10/21/2021] [Accepted: 10/22/2021] [Indexed: 11/17/2022] Open

337

Avsec Ž, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, Taylor KR, Assael Y, Jumper J, Kohli P, Kelley DR. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods 2021;18:1196-1203. [PMID: 34608324 PMCID: PMC8490152 DOI: 10.1038/s41592-021-01252-x] [Citation(s) in RCA: 460] [Impact Index Per Article: 115.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 07/27/2021] [Indexed: 02/08/2023]

338

Schreiber J, Singh R. Machine learning for profile prediction in genomics. Curr Opin Chem Biol 2021;65:35-41. [PMID: 34107341 DOI: 10.1016/j.cbpa.2021.04.008] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 04/21/2021] [Accepted: 04/24/2021] [Indexed: 02/08/2023]

339

Iuchi H, Matsutani T, Yamada K, Iwano N, Sumi S, Hosoda S, Zhao S, Fukunaga T, Hamada M. Representation learning applications in biological sequence analysis. Comput Struct Biotechnol J 2021;19:3198-3208. [PMID: 34141139 PMCID: PMC8190442 DOI: 10.1016/j.csbj.2021.05.039] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 05/10/2021] [Accepted: 05/20/2021] [Indexed: 12/16/2022] Open

Affiliation(s)

Hitoshi Iuchi Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
Taro Matsutani Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
Keisuke Yamada School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
Natsuki Iwano Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
Shunsuke Sumi Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan Department of Life Science Frontiers, Center for iPS Cell Research and Application, Kyoto University, Kyoto 606-8507, Japan
Shion Hosoda Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
Shitao Zhao Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
Tsukasa Fukunaga Waseda Institute for Advanced Study, Waseda University, Tokyo 169-0051, Japan Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-0032, Japan
Michiaki Hamada Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan Graduate School of Medicine, Nippon Medical School, Tokyo 113-8602, Japan

Collapse

340

Ofer D, Brandes N, Linial M. The language of proteins: NLP, machine learning & protein sequences. Comput Struct Biotechnol J 2021;19:1750-1758. [PMID: 33897979 PMCID: PMC8050421 DOI: 10.1016/j.csbj.2021.03.022] [Citation(s) in RCA: 145] [Impact Index Per Article: 36.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Revised: 03/19/2021] [Accepted: 03/19/2021] [Indexed: 12/12/2022] Open