251
|
Zhang J, Liu B, Wu J, Wang Z, Li J. DeepCAC: a deep learning approach on DNA transcription factors classification based on multi-head self-attention and concatenate convolutional neural network. BMC Bioinformatics 2023; 24:345. [PMID: 37723425 PMCID: PMC10506269 DOI: 10.1186/s12859-023-05469-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Accepted: 09/06/2023] [Indexed: 09/20/2023] Open
Abstract
Understanding gene expression processes necessitates the accurate classification and identification of transcription factors, which is supported by high-throughput sequencing technologies. However, these techniques suffer from inherent limitations such as time consumption and high costs. To address these challenges, the field of bioinformatics has increasingly turned to deep learning technologies for analyzing gene sequences. Nevertheless, the pursuit of improved experimental results has led to the inclusion of numerous complex analysis function modules, resulting in models with a growing number of parameters. To overcome these limitations, it is proposed a novel approach for analyzing DNA transcription factor sequences, which is named as DeepCAC. This method leverages deep convolutional neural networks with a multi-head self-attention mechanism. By employing convolutional neural networks, it can effectively capture local hidden features in the sequences. Simultaneously, the multi-head self-attention mechanism enhances the identification of hidden features with long-distant dependencies. This approach reduces the overall number of parameters in the model while harnessing the computational power of sequence data from multi-head self-attention. Through training with labeled data, experiments demonstrate that this approach significantly improves performance while requiring fewer parameters compared to existing methods. Additionally, the effectiveness of our approach is validated in accurately predicting DNA transcription factor sequences.
Collapse
Affiliation(s)
- Jidong Zhang
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Bo Liu
- School of Mathematical and Computational Sciences, Massey University, Auckland, 0745, New Zealand.
| | - Jiahui Wu
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Zhihan Wang
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Jianqiang Li
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| |
Collapse
|
252
|
Chatterjee S, Bhattacharya M, Lee SS, Chakraborty C. Can artificial intelligence-strengthened ChatGPT or other large language models transform nucleic acid research? MOLECULAR THERAPY. NUCLEIC ACIDS 2023; 33:205-207. [PMID: 37727444 PMCID: PMC10505907 DOI: 10.1016/j.omtn.2023.06.019] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 09/21/2023]
Affiliation(s)
- Srijan Chatterjee
- Institute for Skeletal Aging & Orthopaedic Surgery, Hallym University-Chuncheon Sacred Heart Hospital, Chuncheon-si, Gangwon-do 24252, Republic of Korea
| | - Manojit Bhattacharya
- Department of Zoology, Fakir Mohan University, Vyasa Vihar, Balasore, Odisha 756020, India
| | - Sang-Soo Lee
- Institute for Skeletal Aging & Orthopaedic Surgery, Hallym University-Chuncheon Sacred Heart Hospital, Chuncheon-si, Gangwon-do 24252, Republic of Korea
| | - Chiranjib Chakraborty
- Department of Biotechnology, School of Life Science and Biotechnology, Adamas University, Kolkata, West Bengal 700126, India
| |
Collapse
|
253
|
Gündüz HA, Binder M, To XY, Mreches R, Bischl B, McHardy AC, Münch PC, Rezaei M. A self-supervised deep learning method for data-efficient training in genomics. Commun Biol 2023; 6:928. [PMID: 37696966 PMCID: PMC10495322 DOI: 10.1038/s42003-023-05310-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Accepted: 09/01/2023] [Indexed: 09/13/2023] Open
Abstract
Deep learning in bioinformatics is often limited to problems where extensive amounts of labeled data are available for supervised classification. By exploiting unlabeled data, self-supervised learning techniques can improve the performance of machine learning models in the presence of limited labeled data. Although many self-supervised learning methods have been suggested before, they have failed to exploit the unique characteristics of genomic data. Therefore, we introduce Self-GenomeNet, a self-supervised learning technique that is custom-tailored for genomic data. Self-GenomeNet leverages reverse-complement sequences and effectively learns short- and long-term dependencies by predicting targets of different lengths. Self-GenomeNet performs better than other self-supervised methods in data-scarce genomic tasks and outperforms standard supervised training with ~10 times fewer labeled training data. Furthermore, the learned representations generalize well to new datasets and tasks. These findings suggest that Self-GenomeNet is well suited for large-scale, unlabeled genomic datasets and could substantially improve the performance of genomic models.
Collapse
Affiliation(s)
- Hüseyin Anil Gündüz
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Martin Binder
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Xiao-Yin To
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - René Mreches
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Bernd Bischl
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Alice C McHardy
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Philipp C Münch
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany.
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany.
- German Center for Infection Research (DZIF), partner site Hannover Braunschweig, Braunschweig, Germany.
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA.
| | - Mina Rezaei
- Department of Statistics, LMU Munich, Munich, Germany.
- Munich Center for Machine Learning, Munich, Germany.
| |
Collapse
|
254
|
Tan W, Shen Y. Multimodal learning of noncoding variant effects using genome sequence and chromatin structure. Bioinformatics 2023; 39:btad541. [PMID: 37669132 PMCID: PMC10502240 DOI: 10.1093/bioinformatics/btad541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 08/28/2023] [Accepted: 09/04/2023] [Indexed: 09/07/2023] Open
Abstract
MOTIVATION A growing amount of noncoding genetic variants, including single-nucleotide polymorphisms, are found to be associated with complex human traits and diseases. Their mechanistic interpretation is relatively limited and can use the help from computational prediction of their effects on epigenetic profiles. However, current models often focus on local, 1D genome sequence determinants and disregard global, 3D chromatin structure that critically affects epigenetic events. RESULTS We find that noncoding variants of unexpected high similarity in epigenetic profiles, with regards to their relatively low similarity in local sequences, can be largely attributed to their proximity in chromatin structure. Accordingly, we have developed a multimodal deep learning scheme that incorporates both data of 1D genome sequence and 3D chromatin structure for predicting noncoding variant effects. Specifically, we have integrated convolutional and recurrent neural networks for sequence embedding and graph neural networks for structure embedding despite the resolution gap between the two types of data, while utilizing recent DNA language models. Numerical results show that our models outperform competing sequence-only models in predicting epigenetic profiles and their use of long-range interactions complement sequence-only models in extracting regulatory motifs. They prove to be excellent predictors for noncoding variant effects in gene expression and pathogenicity, whether in unsupervised "zero-shot" learning or supervised "few-shot" learning. AVAILABILITY AND IMPLEMENTATION Codes and data can be accessed at https://github.com/Shen-Lab/ncVarPred-1D3D and https://zenodo.org/record/7975777.
Collapse
Affiliation(s)
- Wuwei Tan
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, United States
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, United States
- Department of Computer Science and Engineering, Texas A&M University, College Station, TX 77843, United States
- Institute of Biosciences and Technology and Department of Translational Medical Sciences, College of Medicine, Texas A&M University, Houston, TX 77030, United States
| |
Collapse
|
255
|
Lentzen M, Linden T, Veeranki S, Madan S, Kramer D, Leodolter W, Frohlich H. A Transformer-Based Model Trained on Large Scale Claims Data for Prediction of Severe COVID-19 Disease Progression. IEEE J Biomed Health Inform 2023; 27:4548-4558. [PMID: 37347632 DOI: 10.1109/jbhi.2023.3288768] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/24/2023]
Abstract
In situations like the COVID-19 pandemic, healthcare systems are under enormous pressure as they can rapidly collapse under the burden of the crisis. Machine learning (ML) based risk models could lift the burden by identifying patients with a high risk of severe disease progression. Electronic Health Records (EHRs) provide crucial sources of information to develop these models because they rely on routinely collected healthcare data. However, EHR data is challenging for training ML models because it contains irregularly timestamped diagnosis, prescription, and procedure codes. For such data, transformer-based models are promising. We extended the previously published Med-BERT model by including age, sex, medications, quantitative clinical measures, and state information. After pre-training on approximately 988 million EHRs from 3.5 million patients, we developed models to predict Acute Respiratory Manifestations (ARM) risk using the medical history of 80,211 COVID-19 patients. Compared to Random Forests, XGBoost, and RETAIN, our transformer-based models more accurately forecast the risk of developing ARM after COVID-19 infection. We used Integrated Gradients and Bayesian networks to understand the link between the essential features of our model. Finally, we evaluated adapting our model to Austrian in-patient data. Our study highlights the promise of predictive transformer-based models for precision medicine.
Collapse
|
256
|
Li Z, Jin J, Long W, Wei L. PLPMpro: Enhancing promoter sequence prediction with prompt-learning based pre-trained language model. Comput Biol Med 2023; 164:107260. [PMID: 37557052 DOI: 10.1016/j.compbiomed.2023.107260] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 06/27/2023] [Accepted: 07/16/2023] [Indexed: 08/11/2023]
Abstract
The promoter region, positioned proximal to the transcription start sites, exerts control over the initiation of gene transcription by modulating the interaction with RNA polymerase. Consequently, the accurate recognition of promoter regions represents a critical focus within the bioinformatics domain. Although some methods leveraging pre-trained language models (PLMs) for promoter prediction have been proposed, the full potential of such PLMs remains largely untapped. In this study, we introduce PLPMpro, a model that capitalizes on prompt-learning and the pre-trained language model to enhance the prediction of promoter sequences. PLPMpro effectively harnesses the prompt learning paradigm to fully exploit the inherent capacities of the PLM, resulting in substantial improvements in prediction performance. Experiment results unequivocally demonstrate the efficacy of prompt learning in bolstering the capabilities of the pre-trained model. Consequently, PLPMpro surpasses both typical pre-trained model-based methods for promoter prediction and typical deep learning methods. Furthermore, we conduct various experiments to meticulously scrutinize the effects of different prompt learning settings and different numbers of soft modules on the model performance. More importantly, the interpretation experiment reveals that the pre-trained model captures biological semantics. Collectively, this research imparts a novel perspective on the optimal utilization of PLMs for addressing biological problems.
Collapse
Affiliation(s)
- Zhongshen Li
- School of Software, Shandong University, Jinan 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Junru Jin
- School of Software, Shandong University, Jinan 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Wentao Long
- School of Software, Shandong University, Jinan 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China.
| |
Collapse
|
257
|
Zhang Y, Ge F, Li F, Yang X, Song J, Yu DJ. Prediction of Multiple Types of RNA Modifications via Biological Language Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3205-3214. [PMID: 37289599 DOI: 10.1109/tcbb.2023.3283985] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
It has been demonstrated that RNA modifications play essential roles in multiple biological processes. Accurate identification of RNA modifications in the transcriptome is critical for providing insights into the biological functions and mechanisms. Many tools have been developed for predicting RNA modifications at single-base resolution, which employ conventional feature engineering methods that focus on feature design and feature selection processes that require extensive biological expertise and may introduce redundant information. With the rapid development of artificial intelligence technologies, end-to-end methods are favorably received by researchers. Nevertheless, each well-trained model is only suitable for a specific RNA methylation modification type for nearly all of these approaches. In this study, we present MRM-BERT by feeding task-specific sequences into the powerful BERT (Bidirectional Encoder Representations from Transformers) model and implementing fine-tuning, which exhibits competitive performance to the state-of-the-art methods. MRM-BERT avoids repeated de novo training of the model and can predict multiple RNA modifications such as pseudouridine, m6A, m5C, and m1A in Mus musculus, Arabidopsis thaliana, and Saccharomyces cerevisiae. In addition, we analyse the attention heads to provide high attention regions for the prediction, and conduct saturated in silico mutagenesis of the input sequences to discover potential changes of RNA modifications, which can better assist researchers in their follow-up research.
Collapse
|
258
|
Liang S, Zhao Y, Jin J, Qiao J, Wang D, Wang Y, Wei L. Rm-LR: A long-range-based deep learning model for predicting multiple types of RNA modifications. Comput Biol Med 2023; 164:107238. [PMID: 37515874 DOI: 10.1016/j.compbiomed.2023.107238] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Revised: 06/16/2023] [Accepted: 07/07/2023] [Indexed: 07/31/2023]
Abstract
Recent research has highlighted the pivotal role of RNA post-transcriptional modifications in the regulation of RNA expression and function. Accurate identification of RNA modification sites is important for understanding RNA function. In this study, we propose a novel RNA modification prediction method, namely Rm-LR, which leverages a long-range-based deep learning approach to accurately predict multiple types of RNA modifications using RNA sequences only. Rm-LR incorporates two large-scale RNA language pre-trained models to capture discriminative sequential information and learn local important features, which are subsequently integrated through a bilinear attention network. Rm-LR supports a total of ten RNA modification types (m6A, m1A, m5C, m5U, m6Am, Ψ, Am, Cm, Gm, and Um) and significantly outperforms the state-of-the-art methods in terms of predictive capability on benchmark datasets. Experimental results show the effectiveness and superiority of Rm-LR in prediction of various RNA modifications, demonstrating the strong adaptability and robustness of our proposed model. We demonstrate that RNA language pretrained models enable to learn dense biological sequential representations from large-scale long-range RNA corpus, and meanwhile enhance the interpretability of the models. This work contributes to the development of accurate and reliable computational models for RNA modification prediction, providing insights into the complex landscape of RNA modifications.
Collapse
Affiliation(s)
- Sirui Liang
- School of Software, Shandong University, Jinan, 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Yanxi Zhao
- School of Software, Shandong University, Jinan, 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Junru Jin
- School of Software, Shandong University, Jinan, 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Jianbo Qiao
- School of Software, Shandong University, Jinan, 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Ding Wang
- School of Software, Shandong University, Jinan, 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Yu Wang
- School of Software, Shandong University, Jinan, 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan, 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China.
| |
Collapse
|
259
|
Wichmann A, Buschong E, Müller A, Jünger D, Hildebrandt A, Hankeln T, Schmidt B. MetaTransformer: deep metagenomic sequencing read classification using self-attention models. NAR Genom Bioinform 2023; 5:lqad082. [PMID: 37705831 PMCID: PMC10495543 DOI: 10.1093/nargab/lqad082] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Revised: 07/14/2023] [Accepted: 08/30/2023] [Indexed: 09/15/2023] Open
Abstract
Deep learning has emerged as a paradigm that revolutionizes numerous domains of scientific research. Transformers have been utilized in language modeling outperforming previous approaches. Therefore, the utilization of deep learning as a tool for analyzing the genomic sequences is promising, yielding convincing results in fields such as motif identification and variant calling. DeepMicrobes, a machine learning-based classifier, has recently been introduced for taxonomic prediction at species and genus level. However, it relies on complex models based on bidirectional long short-term memory cells resulting in slow runtimes and excessive memory requirements, hampering its effective usability. We present MetaTransformer, a self-attention-based deep learning metagenomic analysis tool. Our transformer-encoder-based models enable efficient parallelization while outperforming DeepMicrobes in terms of species and genus classification abilities. Furthermore, we investigate approaches to reduce memory consumption and boost performance using different embedding schemes. As a result, we are able to achieve 2× to 5× speedup for inference compared to DeepMicrobes while keeping a significantly smaller memory footprint. MetaTransformer can be trained in 9 hours for genus and 16 hours for species prediction. Our results demonstrate performance improvements due to self-attention models and the impact of embedding schemes in deep learning on metagenomic sequencing data.
Collapse
Affiliation(s)
- Alexander Wichmann
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| | - Etienne Buschong
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| | - André Müller
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| | - Daniel Jünger
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| | - Andreas Hildebrandt
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| | - Thomas Hankeln
- Institute of Organic and Molecular Evolution (iomE), Johannes Gutenberg University, J.-J. Becher-Weg 30A, 55128 Mainz, Rhineland-Palatinate, Germany
| | - Bertil Schmidt
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| |
Collapse
|
260
|
Strutt JPB, Natarajan M, Lee E, Teo DBL, Sin WX, Cheung KW, Chew M, Thazin K, Barone PW, Wolfrum JM, Williams RBH, Rice SA, Springs SL. Machine learning-based detection of adventitious microbes in T-cell therapy cultures using long-read sequencing. Microbiol Spectr 2023; 11:e0135023. [PMID: 37646508 PMCID: PMC10580871 DOI: 10.1128/spectrum.01350-23] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Accepted: 07/03/2023] [Indexed: 09/01/2023] Open
Abstract
Assuring that cell therapy products are safe before releasing them for use in patients is critical. Currently, compendial sterility testing for bacteria and fungi can take 7-14 days. The goal of this work was to develop a rapid untargeted approach for the sensitive detection of microbial contaminants at low abundance from low volume samples during the manufacturing process of cell therapies. We developed a long-read sequencing methodology using Oxford Nanopore Technologies MinION platform with 16S and 18S amplicon sequencing to detect USP <71> organisms and other microbial species. Reads are classified metagenomically to predict the microbial species. We used an extreme gradient boosting machine learning algorithm (XGBoost) to first assess if a sample is contaminated, and second, determine whether the predicted contaminant is correctly classified or misclassified. The model was used to make a final decision on the sterility status of the input sample. An optimized experimental and bioinformatics pipeline starting from spiked species through to sequenced reads allowed for the detection of microbial samples at 10 colony-forming units (CFU)/mL using metagenomic classification. Machine learning can be coupled with long-read sequencing to detect and identify sample sterility status and microbial species present in T-cell cultures, including the USP <71> organisms to 10 CFU/mL. IMPORTANCE This research presents a novel method for rapidly and accurately detecting microbial contaminants in cell therapy products, which is essential for ensuring patient safety. Traditional testing methods are time-consuming, taking 7-14 days, while our approach can significantly reduce this time. By combining advanced long-read nanopore sequencing techniques and machine learning, we can effectively identify the presence and types of microbial contaminants at low abundance levels. This breakthrough has the potential to improve the safety and efficiency of cell therapy manufacturing, leading to better patient outcomes and a more streamlined production process.
Collapse
Affiliation(s)
- James P. B. Strutt
- Singapore-MIT Alliance for Research and Technology, Singapore, Singapore
| | | | - Elizabeth Lee
- Singapore-MIT Alliance for Research and Technology, Singapore, Singapore
| | - Denise Bei Lin Teo
- Singapore-MIT Alliance for Research and Technology, Singapore, Singapore
| | - Wei-Xiang Sin
- Singapore-MIT Alliance for Research and Technology, Singapore, Singapore
| | - Ka-Wai Cheung
- Singapore-MIT Alliance for Research and Technology, Singapore, Singapore
| | - Marvin Chew
- Singapore-MIT Alliance for Research and Technology, Singapore, Singapore
| | - Khaing Thazin
- Singapore-MIT Alliance for Research and Technology, Singapore, Singapore
| | - Paul W. Barone
- MIT Center for Biomedical Innovation, Massachusetts Institute of Technology, Boston, USA
| | - Jacqueline M. Wolfrum
- MIT Center for Biomedical Innovation, Massachusetts Institute of Technology, Boston, USA
| | - Rohan B. H. Williams
- Singapore-MIT Alliance for Research and Technology, Singapore, Singapore
- Singapore Centre for Environmental Life Sciences Engineering, Life Sciences Institute, National University of Singapore, Singapore, Singapore
- Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore, Singapore
| | - Scott A. Rice
- Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore, Singapore
- CSIRO Microbiomes for One Systems Health, Agriculture and Food, Westmead, Australia
| | - Stacy L. Springs
- Singapore-MIT Alliance for Research and Technology, Singapore, Singapore
- MIT Center for Biomedical Innovation, Massachusetts Institute of Technology, Boston, USA
| |
Collapse
|
261
|
Tang X, Shang J, Ji Y, Sun Y. PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer. Nucleic Acids Res 2023; 51:e83. [PMID: 37427782 PMCID: PMC10450166 DOI: 10.1093/nar/gkad578] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Revised: 06/19/2023] [Accepted: 06/26/2023] [Indexed: 07/11/2023] Open
Abstract
Plasmids are mobile genetic elements that carry important accessory genes. Cataloging plasmids is a fundamental step to elucidate their roles in promoting horizontal gene transfer between bacteria. Next generation sequencing (NGS) is the main source for discovering new plasmids today. However, NGS assembly programs tend to return contigs, making plasmid detection difficult. This problem is particularly grave for metagenomic assemblies, which contain short contigs of heterogeneous origins. Available tools for plasmid contig detection still suffer from some limitations. In particular, alignment-based tools tend to miss diverged plasmids while learning-based tools often have lower precision. In this work, we develop a plasmid detection tool PLASMe that capitalizes on the strength of alignment and learning-based methods. Closely related plasmids can be easily identified using the alignment component in PLASMe while diverged plasmids can be predicted using order-specific Transformer models. By encoding plasmid sequences as a language defined on the protein cluster-based token set, Transformer can learn the importance of proteins and their correlation through positionally token embedding and the attention mechanism. We compared PLASMe and other tools on detecting complete plasmids, plasmid contigs, and contigs assembled from CAMI2 simulated data. PLASMe achieved the highest F1-score. After validating PLASMe on data with known labels, we also tested it on real metagenomic and plasmidome data. The examination of some commonly used marker genes shows that PLASMe exhibits more reliable performance than other tools.
Collapse
Affiliation(s)
- Xubo Tang
- Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong SAR, China
| | - Jiayu Shang
- Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong SAR, China
| | - Yongxin Ji
- Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong SAR, China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong SAR, China
| |
Collapse
|
262
|
Ju H, Bai J, Jiang J, Che Y, Chen X. Comparative evaluation and analysis of DNA N4-methylcytosine methylation sites using deep learning. Front Genet 2023; 14:1254827. [PMID: 37671040 PMCID: PMC10476523 DOI: 10.3389/fgene.2023.1254827] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 07/31/2023] [Indexed: 09/07/2023] Open
Abstract
DNA N4-methylcytosine (4mC) is significantly involved in biological processes, such as DNA expression, repair, and replication. Therefore, accurate prediction methods are urgently needed. Deep learning methods have transformed applications that previously require sequencing expertise into engineering challenges that do not require expertise to solve. Here, we compare a variety of state-of-the-art deep learning models on six benchmark datasets to evaluate their performance in 4mC methylation site detection. We visualize the statistical analysis of the datasets and the performance of different deep-learning models. We conclude that deep learning can greatly expand the potential of methylation site prediction.
Collapse
Affiliation(s)
- Hong Ju
- Heilongjiang Agricultural Engineering Vocational College, Harbin, China
| | - Jie Bai
- Engineering Research Center of Integration and Application of Digital Learning Technology, Ministry of Education, Hangzhou, China
| | - Jing Jiang
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Yusheng Che
- Heilongjiang Agricultural Engineering Vocational College, Harbin, China
| | - Xin Chen
- Department of Neurosurgical Laboratory, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| |
Collapse
|
263
|
Abstract
Following the widespread use of deep learning for genomics, deep generative modeling is also becoming a viable methodology for the broad field. Deep generative models (DGMs) can learn the complex structure of genomic data and allow researchers to generate novel genomic instances that retain the real characteristics of the original dataset. Aside from data generation, DGMs can also be used for dimensionality reduction by mapping the data space to a latent space, as well as for prediction tasks via exploitation of this learned mapping or supervised/semi-supervised DGM designs. In this review, we briefly introduce generative modeling and two currently prevailing architectures, we present conceptual applications along with notable examples in functional and evolutionary genomics, and we provide our perspective on potential challenges and future directions.
Collapse
Affiliation(s)
- Burak Yelmen
- Laboratoire Interdisciplinaire des Sciences du Numérique, CNRS UMR 9015, INRIA, Université Paris-Saclay, Orsay, France;
- Institute of Genomics, University of Tartu, Tartu, Estonia
| | - Flora Jay
- Laboratoire Interdisciplinaire des Sciences du Numérique, CNRS UMR 9015, INRIA, Université Paris-Saclay, Orsay, France;
| |
Collapse
|
264
|
Aspromonte MC, Conte AD, Zhu S, Tan W, Shen Y, Zhang Y, Li Q, Wang MH, Babbi G, Bovo S, Martelli PL, Casadio R, Althagafi A, Toonsi S, Kulmanov M, Hoehndorf R, Katsonis P, Williams A, Lichtarge O, Xian S, Surento W, Pejaver V, Mooney SD, Sunderam U, Srinivasan R, Murgia A, Piovesan D, Tosatto SCE, Leonardi E. CAGI6 ID-Challenge: Assessment of phenotype and variant predictions in 415 children with Neurodevelopmental Disorders (NDDs). RESEARCH SQUARE 2023:rs.3.rs-3209168. [PMID: 37577579 PMCID: PMC10418555 DOI: 10.21203/rs.3.rs-3209168/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
In the context of the Critical Assessment of the Genome Interpretation, 6th edition (CAGI6), the Genetics of Neurodevelopmental Disorders Lab in Padua proposed a new ID-challenge to give the opportunity of developing computational methods for predicting patient's phenotype and the causal variants. Eight research teams and 30 models had access to the phenotype details and real genetic data, based on the sequences of 74 genes (VCF format) in 415 pediatric patients affected by Neurodevelopmental Disorders (NDDs). NDDs are clinically and genetically heterogeneous conditions, with onset in infant age. In this study we evaluate the ability and accuracy of computational methods to predict comorbid phenotypes based on clinical features described in each patient and causal variants. Finally, we asked to develop a method to find new possible genetic causes for patients without a genetic diagnosis. As already done for the CAGI5, seven clinical features (ID, ASD, ataxia, epilepsy, microcephaly, macrocephaly, hypotonia), and variants (causative, putative pathogenic and contributing factors) were provided. Considering the overall clinical manifestation of our cohort, we give out the variant data and phenotypic traits of the 150 patients from CAGI5 ID-Challenge as training and validation for the prediction methods development.
Collapse
Affiliation(s)
| | | | - Shaowen Zhu
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843
| | - Wuwei Tan
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843
| | | | - Qi Li
- CUHK Shenzhen Research Institute, Shenzhen
| | | | - Giulia Babbi
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna
| | - Samuele Bovo
- Department of Agricultural and Food Sciences, University of Bologna
| | - Pier Luigi Martelli
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna
| | - Rita Casadio
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna
| | - Azza Althagafi
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23
| | - Sumyyah Toonsi
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23
| | - Maxat Kulmanov
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23
| | - Robert Hoehndorf
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23
| | - Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030
| | - Amanda Williams
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030
| | - Su Xian
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA 98195
| | - Wesley Surento
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA 98195
| | - Vikas Pejaver
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029
| | - Sean D Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA 98195
| | - Uma Sunderam
- Innovation Labs, Tata Consultancy Services, Hyderabad
| | | | | | | | | | | |
Collapse
|
265
|
Myronov A, Mazzocco G, Król P, Plewczynski D. BERTrand-peptide:TCR binding prediction using Bidirectional Encoder Representations from Transformers augmented with random TCR pairing. Bioinformatics 2023; 39:btad468. [PMID: 37535685 PMCID: PMC10444968 DOI: 10.1093/bioinformatics/btad468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Revised: 06/28/2023] [Accepted: 08/01/2023] [Indexed: 08/05/2023] Open
Abstract
MOTIVATION The advent of T-cell receptor (TCR) sequencing experiments allowed for a significant increase in the amount of peptide:TCR binding data available and a number of machine-learning models appeared in recent years. High-quality prediction models for a fixed epitope sequence are feasible, provided enough known binding TCR sequences are available. However, their performance drops significantly for previously unseen peptides. RESULTS We prepare the dataset of known peptide:TCR binders and augment it with negative decoys created using healthy donors' T-cell repertoires. We employ deep learning methods commonly applied in Natural Language Processing to train part a peptide:TCR binding model with a degree of cross-peptide generalization (0.69 AUROC). We demonstrate that BERTrand outperforms the published methods when evaluated on peptide sequences not used during model training. AVAILABILITY AND IMPLEMENTATION The datasets and the code for model training are available at https://github.com/SFGLab/bertrand.
Collapse
Affiliation(s)
- Alexander Myronov
- Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
- Ardigen, Krakow, Poland
| | | | | | - Dariusz Plewczynski
- Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
| |
Collapse
|
266
|
Li L, Xue Z, Du X. ASCRB: Multi-view based attentional feature selection for CircRNA-binding site prediction. Comput Biol Med 2023; 162:107077. [PMID: 37290390 DOI: 10.1016/j.compbiomed.2023.107077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 05/15/2023] [Accepted: 05/27/2023] [Indexed: 06/10/2023]
Abstract
CircRNA is a non-coding RNA with a special circular structure, which plays a key role in a variety of life activities by interacting with RNA-binding proteins through CircRNA binding sites. Therefore, accurately identifying CircRNA binding sites is of great importance for gene regulation. In previous studies, most of the methods are based on single-view or multi-view features. Considering that single-view methods provide less effective information, the current mainstream methods mainly focus on extracting rich relevant features by constructing multiple views. However, the increasing number of views leads to a large amount of redundant information, which is detrimental to the detection of CircRNA binding sites. Therefore, to solve this problem, we propose to use the channel attention mechanism to further obtain useful multi-view features by filtering out invalid information in each view. First, we use five feature encoding schemes to construct multi-view. Then, we calibrate the features by generating the global representation of each view, filtering out redundant information to retain important feature information. Finally, features obtained from multiple views are fused to detect RNA binding sites. To validate the effectiveness of the method, we compared its performance on 37 CircRNA-RBP datasets with existing methods. Experimental results show that the average AUC performance of our method is 93.85%, which is better than the current state-of-the-art methods. We also provide the source code, which can be accessed at https://github.com/dxqllp/ASCRB for access.
Collapse
Affiliation(s)
- Lei Li
- Department of Neurology, Shuyang Hospital Affiliated to Yangzhou University School of Medicine (Shuyang Hospital of Traditional Chinese Medicine, Suqian, China
| | - Zhigang Xue
- School of Computer Science and Technology, Anhui University, Hefei, China
| | - Xiuquan Du
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei, China; School of Computer Science and Technology, Anhui University, Hefei, China.
| |
Collapse
|
267
|
Chao KH, Mao A, Salzberg SL, Pertea M. Splam: a deep-learning-based splice site predictor that improves spliced alignments. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.27.550754. [PMID: 37546880 PMCID: PMC10402160 DOI: 10.1101/2023.07.27.550754] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
The process of splicing messenger RNA to remove introns plays a central role in creating genes and gene variants. Here we describe Splam, a novel method for predicting splice junctions in DNA based on deep residual convolutional neural networks. Unlike some previous models, Splam looks at a relatively limited window of 400 base pairs flanking each splice site, motivated by the observation that the biological process of splicing relies primarily on signals within this window. Additionally, Splam introduces the idea of training the network on donor and acceptor pairs together, based on the principle that the splicing machinery recognizes both ends of each intron at once. We compare Splam's accuracy to recent state-of-the-art splice site prediction methods, particularly SpliceAI, another method that uses deep neural networks. Our results show that Splam is consistently more accurate than SpliceAI, with an overall accuracy of 96% at predicting human splice junctions. Splam generalizes even to non-human species, including distant ones like the flowering plant Arabidopsis thaliana. Finally, we demonstrate the use of Splam on a novel application: processing the spliced alignments of RNA-seq data to identify and eliminate errors. We show that when used in this manner, Splam yields substantial improvements in the accuracy of downstream transcriptome analysis of both poly(A) and ribo-depleted RNA-seq libraries. Overall, Splam offers a faster and more accurate approach to detecting splice junctions, while also providing a reliable and efficient solution for cleaning up erroneous spliced alignments.
Collapse
Affiliation(s)
- Kuan-Hao Chao
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Alan Mao
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Steven L Salzberg
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21211, USA
| | - Mihaela Pertea
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
268
|
Zhu C, Baumgarten N, Wu M, Wang Y, Das AP, Kaur J, Ardakani FB, Duong TT, Pham MD, Duda M, Dimmeler S, Yuan T, Schulz MH, Krishnan J. CVD-associated SNPs with regulatory potential reveal novel non-coding disease genes. Hum Genomics 2023; 17:69. [PMID: 37491351 PMCID: PMC10369730 DOI: 10.1186/s40246-023-00513-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2023] [Accepted: 07/12/2023] [Indexed: 07/27/2023] Open
Abstract
BACKGROUND Cardiovascular diseases (CVDs) are the leading cause of death worldwide. Genome-wide association studies (GWAS) have identified many single nucleotide polymorphisms (SNPs) appearing in non-coding genomic regions in CVDs. The SNPs may alter gene expression by modifying transcription factor (TF) binding sites and lead to functional consequences in cardiovascular traits or diseases. To understand the underlying molecular mechanisms, it is crucial to identify which variations are involved and how they affect TF binding. METHODS The SNEEP (SNP exploration and analysis using epigenomics data) pipeline was used to identify regulatory SNPs, which alter the binding behavior of TFs and link GWAS SNPs to their potential target genes for six CVDs. The human-induced pluripotent stem cells derived cardiomyocytes (hiPSC-CMs), monoculture cardiac organoids (MCOs) and self-organized cardiac organoids (SCOs) were used in the study. Gene expression, cardiomyocyte size and cardiac contractility were assessed. RESULTS By using our integrative computational pipeline, we identified 1905 regulatory SNPs in CVD GWAS data. These were associated with hundreds of genes, half of them non-coding RNAs (ncRNAs), suggesting novel CVD genes. We experimentally tested 40 CVD-associated non-coding RNAs, among them RP11-98F14.11, RPL23AP92, IGBP1P1, and CTD-2383I20.1, which were upregulated in hiPSC-CMs, MCOs and SCOs under hypoxic conditions. Further experiments showed that IGBP1P1 depletion rescued expression of hypertrophic marker genes, reduced hypoxia-induced cardiomyocyte size and improved hypoxia-reduced cardiac contractility in hiPSC-CMs and MCOs. CONCLUSIONS IGBP1P1 is a novel ncRNA with key regulatory functions in modulating cardiomyocyte size and cardiac function in our disease models. Our data suggest ncRNA IGBP1P1 as a potential therapeutic target to improve cardiac function in CVDs.
Collapse
Affiliation(s)
- Chaonan Zhu
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany
| | - Nina Baumgarten
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590, Frankfurt Am Main, Germany
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany
| | - Meiqian Wu
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
| | - Yue Wang
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
| | - Arka Provo Das
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany
| | - Jaskiran Kaur
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
| | - Fatemeh Behjati Ardakani
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590, Frankfurt Am Main, Germany
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany
| | - Thanh Thuy Duong
- Genome Biologics, Theodor-Stern-Kai 7, 60590, Frankfurt Am Main, Germany
| | - Minh Duc Pham
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany
- Department of Medicine III, Cardiology/Angiology/ Nephrology, Goethe University Hospital, Frankfurt, Germany
- Genome Biologics, Theodor-Stern-Kai 7, 60590, Frankfurt Am Main, Germany
| | - Maria Duda
- Genome Biologics, Theodor-Stern-Kai 7, 60590, Frankfurt Am Main, Germany
| | - Stefanie Dimmeler
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590, Frankfurt Am Main, Germany
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany
| | - Ting Yuan
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany.
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany.
- Department of Medicine III, Cardiology/Angiology/ Nephrology, Goethe University Hospital, Frankfurt, Germany.
| | - Marcel H Schulz
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany.
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590, Frankfurt Am Main, Germany.
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany.
| | - Jaya Krishnan
- Institute for Cardiovascular Regeneration, Goethe University, 60590, Frankfurt Am Main, Germany.
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590, Frankfurt Am Main, Germany.
- Cardio-Pulmonary Institute, Goethe University Hospital, 60590, Frankfurt Am Main, Germany.
- Department of Medicine III, Cardiology/Angiology/ Nephrology, Goethe University Hospital, Frankfurt, Germany.
| |
Collapse
|
269
|
Li H, He X, Kurowski L, Zhang R, Zhao D, Zeng J. Improving comparative analyses of Hi-C data via contrastive self-supervised learning. Brief Bioinform 2023; 24:bbad193. [PMID: 37287135 DOI: 10.1093/bib/bbad193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 04/12/2023] [Accepted: 04/27/2023] [Indexed: 06/09/2023] Open
Abstract
Hi-C is a widely applied chromosome conformation capture (3C)-based technique, which has produced a large number of genomic contact maps with high sequencing depths for a wide range of cell types, enabling comprehensive analyses of the relationships between biological functionalities (e.g. gene regulation and expression) and the three-dimensional genome structure. Comparative analyses play significant roles in Hi-C data studies, which are designed to make comparisons between Hi-C contact maps, thus evaluating the consistency of replicate Hi-C experiments (i.e. reproducibility measurement) and detecting statistically differential interacting regions with biological significance (i.e. differential chromatin interaction detection). However, due to the complex and hierarchical nature of Hi-C contact maps, it remains challenging to conduct systematic and reliable comparative analyses of Hi-C data. Here, we proposed sslHiC, a contrastive self-supervised representation learning framework, for precisely modeling the multi-level features of chromosome conformation and automatically producing informative feature embeddings for genomic loci and their interactions to facilitate comparative analyses of Hi-C contact maps. Comprehensive computational experiments on both simulated and real datasets demonstrated that our method consistently outperformed the state-of-the-art baseline methods in providing reliable measurements of reproducibility and detecting differential interactions with biological meanings.
Collapse
Affiliation(s)
- Han Li
- Institute for Interdisciplinary Information Sciences, Tsinghua University, 100084 Beijing, China
| | - Xuan He
- Machine Learning Department, Silexon AI Technology Co., Ltd., 210000 Nanjing, China
| | - Lawrence Kurowski
- Institute for Interdisciplinary Information Sciences, Tsinghua University, 100084 Beijing, China
| | - Ruotian Zhang
- Institute for Interdisciplinary Information Sciences, Tsinghua University, 100084 Beijing, China
| | - Dan Zhao
- Institute for Interdisciplinary Information Sciences, Tsinghua University, 100084 Beijing, China
| | - Jianyang Zeng
- Institute for Interdisciplinary Information Sciences, Tsinghua University, 100084 Beijing, China
| |
Collapse
|
270
|
Prakash A, Banerjee M. An interpretable block-attention network for identifying regulatory feature interactions. Brief Bioinform 2023; 24:bbad250. [PMID: 37401370 DOI: 10.1093/bib/bbad250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2022] [Revised: 05/15/2023] [Accepted: 06/16/2023] [Indexed: 07/05/2023] Open
Abstract
The importance of regulatory features in health and disease is increasing, making it crucial to identify the hallmarks of these features. Self-attention networks (SAN) have given rise to numerous models for the prediction of complex phenomena. But the potential of SANs in biological models was limited because of high memory requirement proportional to input token length and lack of interpretability of self-attention scores. To overcome these constraints, we propose a deep learning model named Interpretable Self-Attention Network for REGulatory interactions (ISANREG) that combines both block self-attention and attention-attribution mechanisms. This model predicts transcription factor-bound motif instances and DNA-mediated TF-TF interactions using self-attention attribution scores derived from the network, overcoming the limitations of previous deep learning models. ISANREG will serve as a framework for other biological models in interpreting the contribution of the input with single-nucleotide resolution.
Collapse
Affiliation(s)
- Anil Prakash
- Human Molecular Genetics Lab, Neurobiology and Genetics Division, Rajiv Gandhi Centre for Biotechnology, Thiruvananthapuram, Kerala, 695014, India
- Department of Biotechnology, University of Kerala, Kariavattom, Thiruvananthapuram, Kerala, India
| | - Moinak Banerjee
- Human Molecular Genetics Lab, Neurobiology and Genetics Division, Rajiv Gandhi Centre for Biotechnology, Thiruvananthapuram, Kerala, 695014, India
| |
Collapse
|
271
|
Wu KE, Zou JY, Chang H. Machine learning modeling of RNA structures: methods, challenges and future perspectives. Brief Bioinform 2023; 24:bbad210. [PMID: 37280185 DOI: 10.1093/bib/bbad210] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 05/12/2023] [Accepted: 05/17/2023] [Indexed: 06/08/2023] Open
Abstract
The three-dimensional structure of RNA molecules plays a critical role in a wide range of cellular processes encompassing functions from riboswitches to epigenetic regulation. These RNA structures are incredibly dynamic and can indeed be described aptly as an ensemble of structures that shifts in distribution depending on different cellular conditions. Thus, the computational prediction of RNA structure poses a unique challenge, even as computational protein folding has seen great advances. In this review, we focus on a variety of machine learning-based methods that have been developed to predict RNA molecules' secondary structure, as well as more complex tertiary structures. We survey commonly used modeling strategies, and how many are inspired by or incorporate thermodynamic principles. We discuss the shortcomings that various design decisions entail and propose future directions that could build off these methods to yield more robust, accurate RNA structure predictions.
Collapse
Affiliation(s)
- Kevin E Wu
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
- Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94305, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - James Y Zou
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Howard Chang
- Howard Hughes Medical Institute, Stanford University, Stanford, CA 94305, USA
- Department of Biomedical Data Science, Stanford University School of Medicine, Stanford, CA 94305, USA
| |
Collapse
|
272
|
Liu R, Hu YF, Huang JD, Fan X. A Bayesian approach to estimate MHC-peptide binding threshold. Brief Bioinform 2023; 24:bbad208. [PMID: 37279464 DOI: 10.1093/bib/bbad208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Revised: 05/08/2023] [Accepted: 05/16/2023] [Indexed: 06/08/2023] Open
Abstract
Major histocompatibility complex (MHC)-peptide binding is a critical step in enabling a peptide to serve as an antigen for T-cell recognition. Accurate prediction of this binding can facilitate various applications in immunotherapy. While many existing methods offer good predictive power for the binding affinity of a peptide to a specific MHC, few models attempt to infer the binding threshold that distinguishes binding sequences. These models often rely on experience-based ad hoc criteria, such as 500 or 1000nM. However, different MHCs may have different binding thresholds. As such, there is a need for an automatic, data-driven method to determine an accurate binding threshold. In this study, we proposed a Bayesian model that jointly infers core locations (binding sites), the binding affinity and the binding threshold. Our model provided the posterior distribution of the binding threshold, enabling accurate determination of an appropriate threshold for each MHC. To evaluate the performance of our method under different scenarios, we conducted simulation studies with varying dominant levels of motif distributions and proportions of random sequences. These simulation studies showed desirable estimation accuracy and robustness of our model. Additionally, when applied to real data, our results outperformed commonly used thresholds.
Collapse
Affiliation(s)
- Ran Liu
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Ye-Fan Hu
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, 3/F, Laboratory Block, 21 Sassoon Road, Hong Kong SAR, China
- Department of Medicine, Li Ka Shing Faculty of Medicine, The University of Hong Kong, 4/F Professional Block, Queen Mary Hospital, 102 Pokfulam Road, Hong Kong SAR, China
- BayVax Biotech Limited, Hong Kong Science Park, Pak Shek Kok, New Territories, Hong Kong SAR, China
| | - Jian-Dong Huang
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, 3/F, Laboratory Block, 21 Sassoon Road, Hong Kong SAR, China
- CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
- Clinical Oncology Center, Shenzhen Key Laboratory for Cancer Metastasis and Personalized Therapy, The University of Hong Kong-Shenzhen Hospital, Shenzhen 518053, China
- Guangdong-Hong Kong Joint Laboratory for RNA Medicine, Sun Yat-Sen University, Guangzhou 510120, China
- State Key Laboratory of Cognitive and Brain Research, The University of Hong Kong, Hong Kong SAR, China
| | - Xiaodan Fan
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
273
|
Zhang Z, Feng F, Qiu Y, Liu J. A generalizable framework to comprehensively predict epigenome, chromatin organization, and transcriptome. Nucleic Acids Res 2023; 51:5931-5947. [PMID: 37224527 PMCID: PMC10325920 DOI: 10.1093/nar/gkad436] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 03/31/2023] [Accepted: 05/09/2023] [Indexed: 05/26/2023] Open
Abstract
Many deep learning approaches have been proposed to predict epigenetic profiles, chromatin organization, and transcription activity. While these approaches achieve satisfactory performance in predicting one modality from another, the learned representations are not generalizable across predictive tasks or across cell types. In this paper, we propose a deep learning approach named EPCOT which employs a pre-training and fine-tuning framework, and is able to accurately and comprehensively predict multiple modalities including epigenome, chromatin organization, transcriptome, and enhancer activity for new cell types, by only requiring cell-type specific chromatin accessibility profiles. Many of these predicted modalities, such as Micro-C and ChIA-PET, are quite expensive to get in practice, and the in silico prediction from EPCOT should be quite helpful. Furthermore, this pre-training and fine-tuning framework allows EPCOT to identify generic representations generalizable across different predictive tasks. Interpreting EPCOT models also provides biological insights including mapping between different genomic modalities, identifying TF sequence binding patterns, and analyzing cell-type specific TF impacts on enhancer activity.
Collapse
Affiliation(s)
- Zhenhao Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
| | - Fan Feng
- Department of Computational Medicine and Bioinformatics, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
| | - Yiyang Qiu
- Department of Computer Science and Engineering, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
| | - Jie Liu
- Department of Computational Medicine and Bioinformatics, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
- Department of Computer Science and Engineering, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
| |
Collapse
|
274
|
Umerenkov D, Herbert A, Konovalov D, Danilova A, Beknazarov N, Kokh V, Fedorov A, Poptsova M. Z-flipon variants reveal the many roles of Z-DNA and Z-RNA in health and disease. Life Sci Alliance 2023; 6:e202301962. [PMID: 37164635 PMCID: PMC10172764 DOI: 10.26508/lsa.202301962] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 04/25/2023] [Accepted: 04/28/2023] [Indexed: 05/12/2023] Open
Abstract
Identifying roles for Z-DNA remains challenging given their dynamic nature. Here, we perform genome-wide interrogation with the DNABERT transformer algorithm trained on experimentally identified Z-DNA forming sequences (Z-flipons). The algorithm yields large performance enhancements (F1 = 0.83) over existing approaches and implements computational mutagenesis to assess the effects of base substitution on Z-DNA formation. We show Z-flipons are enriched in promoters and telomeres, overlapping quantitative trait loci for RNA expression, RNA editing, splicing, and disease-associated variants. We cross-validate across a number of orthogonal databases and define BZ junction motifs. Surprisingly, many effects we delineate are likely mediated through Z-RNA formation. A shared Z-RNA motif is identified in SCARF2, SMAD1, and CACNA1 transcripts, whereas other motifs are present in noncoding RNAs. We provide evidence for a Z-RNA fold that promotes adaptive immunity through alternative splicing of KRAB domain zinc finger proteins. An analysis of OMIM and presumptive gnomAD loss-of-function datasets reveals an overlap of Z-flipons with disease-causing variants in 8.6% and 2.9% of Mendelian disease genes, respectively, greatly extending the range of phenotypes mapped to Z-flipons.
Collapse
Affiliation(s)
| | - Alan Herbert
- Laboratory of Bioinformatics, Faculty of Computer Science, HSE University, Moscow, Russia
- InsideOutBio, Charlestown, MA, USA
| | - Dmitrii Konovalov
- Laboratory of Bioinformatics, Faculty of Computer Science, HSE University, Moscow, Russia
| | - Anna Danilova
- Laboratory of Bioinformatics, Faculty of Computer Science, HSE University, Moscow, Russia
| | - Nazar Beknazarov
- Laboratory of Bioinformatics, Faculty of Computer Science, HSE University, Moscow, Russia
| | | | - Aleksandr Fedorov
- Laboratory of Bioinformatics, Faculty of Computer Science, HSE University, Moscow, Russia
| | - Maria Poptsova
- Laboratory of Bioinformatics, Faculty of Computer Science, HSE University, Moscow, Russia
| |
Collapse
|
275
|
Biharie K, Michielsen L, Reinders MJT, Mahfouz A. Cell type matching across species using protein embeddings and transfer learning. Bioinformatics 2023; 39:i404-i412. [PMID: 37387141 PMCID: PMC10311290 DOI: 10.1093/bioinformatics/btad248] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Knowing the relation between cell types is crucial for translating experimental results from mice to humans. Establishing cell type matches, however, is hindered by the biological differences between the species. A substantial amount of evolutionary information between genes that could be used to align the species is discarded by most of the current methods since they only use one-to-one orthologous genes. Some methods try to retain the information by explicitly including the relation between genes, however, not without caveats. RESULTS In this work, we present a model to transfer and align cell types in cross-species analysis (TACTiCS). First, TACTiCS uses a natural language processing model to match genes using their protein sequences. Next, TACTiCS employs a neural network to classify cell types within a species. Afterward, TACTiCS uses transfer learning to propagate cell type labels between species. We applied TACTiCS on scRNA-seq data of the primary motor cortex of human, mouse, and marmoset. Our model can accurately match and align cell types on these datasets. Moreover, our model outperforms Seurat and the state-of-the-art method SAMap. Finally, we show that our gene matching method results in better cell type matches than BLAST in our model. AVAILABILITY AND IMPLEMENTATION The implementation is available on GitHub (https://github.com/kbiharie/TACTiCS). The preprocessed datasets and trained models can be downloaded from Zenodo (https://doi.org/10.5281/zenodo.7582460).
Collapse
Affiliation(s)
- Kirti Biharie
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628XE, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
| | - Lieke Michielsen
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628XE, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
| | - Marcel J T Reinders
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628XE, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
| | - Ahmed Mahfouz
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628XE, The Netherlands
- Department of Human Genetics, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333ZC, The Netherlands
| |
Collapse
|
276
|
Valeri JA, Soenksen LR, Collins KM, Ramesh P, Cai G, Powers R, Angenent-Mari NM, Camacho DM, Wong F, Lu TK, Collins JJ. BioAutoMATED: An end-to-end automated machine learning tool for explanation and design of biological sequences. Cell Syst 2023; 14:525-542.e9. [PMID: 37348466 PMCID: PMC10700034 DOI: 10.1016/j.cels.2023.05.007] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Revised: 02/17/2023] [Accepted: 05/22/2023] [Indexed: 06/24/2023]
Abstract
The design choices underlying machine-learning (ML) models present important barriers to entry for many biologists who aim to incorporate ML in their research. Automated machine-learning (AutoML) algorithms can address many challenges that come with applying ML to the life sciences. However, these algorithms are rarely used in systems and synthetic biology studies because they typically do not explicitly handle biological sequences (e.g., nucleotide, amino acid, or glycan sequences) and cannot be easily compared with other AutoML algorithms. Here, we present BioAutoMATED, an AutoML platform for biological sequence analysis that integrates multiple AutoML methods into a unified framework. Users are automatically provided with relevant techniques for analyzing, interpreting, and designing biological sequences. BioAutoMATED predicts gene regulation, peptide-drug interactions, and glycan annotation, and designs optimized synthetic biology components, revealing salient sequence characteristics. By automating sequence modeling, BioAutoMATED allows life scientists to incorporate ML more readily into their work.
Collapse
Affiliation(s)
- Jacqueline A Valeri
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Luis R Soenksen
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Department of Mechanical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA
| | - Katherine M Collins
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Department of Engineering, University of Cambridge, Trumpington St, Cambridge CB2 1PZ, UK
| | - Pradeep Ramesh
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - George Cai
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Rani Powers
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Pluto Biosciences, Golden, CO 80402, USA
| | - Nicolaas M Angenent-Mari
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Diogo M Camacho
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Felix Wong
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Timothy K Lu
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Synthetic Biology Group, Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - James J Collins
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA 02139, USA; Abdul Latif Jameel Clinic for Machine Learning in Health, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| |
Collapse
|
277
|
Luo H, Li Y, Liu H, Ding P, Yu Y, Luo L. SENet: A deep learning framework for discriminating super- and typical enhancers by sequence information. Comput Biol Chem 2023; 105:107905. [PMID: 37348298 DOI: 10.1016/j.compbiolchem.2023.107905] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2023] [Revised: 05/08/2023] [Accepted: 06/09/2023] [Indexed: 06/24/2023]
Abstract
Super-enhancers are large domains on the genome where multiple short typical enhancers within a specific genomic distance are stitched together. Typically, they are cell type-specific and responsible for defining cell identity and regulating gene transcription. Numerous studies have demonstrated that super-enhancers are enriched for trait-associated variants, and mutations in super-enhancers are possibly related to known diseases. Recently, several machine learning-based methods have been used to distinguish super-enhancers from typical enhancers by using high-throughput data from various experimental methods. The acquisition of such experimental data is usually costly and time-consuming. In this paper, we innovatively proposed SENet, a groundbreaking method based on a deep neural network model, for discriminating between the two categories solely utilizing sequence information. SENet employs dna2vec feature embedding, convolution for local feature extraction, attention pooling for refined feature retention, and Transformer for contextual information extraction. Experiments demonstrate that SENet outperforms all current state-of-the-art computational methods and shows satisfactory performance in cross-species validation. Our method pioneers the distinction between super-enhancers and typical ones using only sequence information. The source code and datasets are stored in https://github.com/lhy0322/SENet.
Collapse
Affiliation(s)
- Hanyu Luo
- School of Computer Sciences, University of South China, Hengyang 421001, China
| | - Ye Li
- School of Computer Sciences, University of South China, Hengyang 421001, China
| | - Huan Liu
- School of Computer Sciences, University of South China, Hengyang 421001, China
| | - Pingjian Ding
- School of Computer Sciences, University of South China, Hengyang 421001, China
| | - Ying Yu
- School of Computer Sciences, University of South China, Hengyang 421001, China
| | - Lingyun Luo
- School of Computer Sciences, University of South China, Hengyang 421001, China.
| |
Collapse
|
278
|
Zhu H, Liu T, Wang Z. scHiMe: predicting single-cell DNA methylation levels based on single-cell Hi-C data. Brief Bioinform 2023:7193585. [PMID: 37302805 PMCID: PMC10359091 DOI: 10.1093/bib/bbad223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 05/10/2023] [Accepted: 05/23/2023] [Indexed: 06/13/2023] Open
Abstract
Recently a biochemistry experiment named methyl-3C was developed to simultaneously capture the chromosomal conformations and DNA methylation levels on individual single cells. However, the number of data sets generated from this experiment is still small in the scientific community compared with the greater amount of single-cell Hi-C data generated from separate single cells. Therefore, a computational tool to predict single-cell methylation levels based on single-cell Hi-C data on the same individual cells is needed. We developed a graph transformer named scHiMe to accurately predict the base-pair-specific (bp-specific) methylation levels based on both single-cell Hi-C data and DNA nucleotide sequences. We benchmarked scHiMe for predicting the bp-specific methylation levels on all of the promoters of the human genome, all of the promoter regions together with the corresponding first exon and intron regions, and random regions on the whole genome. Our evaluation showed a high consistency between the predicted and methyl-3C-detected methylation levels. Moreover, the predicted DNA methylation levels resulted in accurate classifications of cells into different cell types, which indicated that our algorithm successfully captured the cell-to-cell variability in the single-cell Hi-C data. scHiMe is freely available at http://dna.cs.miami.edu/scHiMe/.
Collapse
Affiliation(s)
- Hao Zhu
- Department of Computer Science, University of Miami, 330M Ungar Building, 1365 Memorial Drive, Coral Gables, 33124-4245, FL, USA
| | - Tong Liu
- Department of Computer Science, University of Miami, 330M Ungar Building, 1365 Memorial Drive, Coral Gables, 33124-4245, FL, USA
| | - Zheng Wang
- Department of Computer Science, University of Miami, 330M Ungar Building, 1365 Memorial Drive, Coral Gables, 33124-4245, FL, USA
| |
Collapse
|
279
|
Yang R, Das A, Gao VR, Karbalayghareh A, Noble WS, Bilmes JA, Leslie CS. Epiphany: predicting Hi-C contact maps from 1D epigenomic signals. Genome Biol 2023; 24:134. [PMID: 37280678 PMCID: PMC10242996 DOI: 10.1186/s13059-023-02934-9] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Accepted: 04/06/2023] [Indexed: 06/08/2023] Open
Abstract
Recent deep learning models that predict the Hi-C contact map from DNA sequence achieve promising accuracy but cannot generalize to new cell types and or even capture differences among training cell types. We propose Epiphany, a neural network to predict cell-type-specific Hi-C contact maps from widely available epigenomic tracks. Epiphany uses bidirectional long short-term memory layers to capture long-range dependencies and optionally a generative adversarial network architecture to encourage contact map realism. Epiphany shows excellent generalization to held-out chromosomes within and across cell types, yields accurate TAD and interaction calls, and predicts structural changes caused by perturbations of epigenomic signals.
Collapse
Affiliation(s)
- Rui Yang
- Memorial Sloan Kettering Cancer Center, New York, USA
| | - Arnav Das
- University of Washington, Seattle, USA
| | - Vianne R Gao
- Memorial Sloan Kettering Cancer Center, New York, USA
| | | | | | | | | |
Collapse
|
280
|
Zhou M, Zhang H, Baii Z, Mann-Krzisnik D, Wang F, Li Y. Single-cell multi-omic topic embedding reveals cell-type-specific and COVID-19 severity-related immune signatures. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.31.526312. [PMID: 36778483 PMCID: PMC9915637 DOI: 10.1101/2023.01.31.526312] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
The advent of single-cell multi-omics sequencing technology makes it possible for re-searchers to leverage multiple modalities for individual cells and explore cell heterogeneity. However, the high dimensional, discrete, and sparse nature of the data make the downstream analysis particularly challenging. Most of the existing computational methods for single-cell data analysis are either limited to single modality or lack flexibility and interpretability. In this study, we propose an interpretable deep learning method called multi-omic embedded topic model (moETM) to effectively perform integrative analysis of high-dimensional single-cell multimodal data. moETM integrates multiple omics data via a product-of-experts in the encoder for efficient variational inference and then employs multiple linear decoders to learn the multi-omic signatures of the gene regulatory programs. Through comprehensive experiments on public single-cell transcriptome and chromatin accessibility data (i.e., scRNA+scATAC), as well as scRNA and proteomic data (i.e., CITE-seq), moETM demonstrates superior performance compared with six state-of-the-art single-cell data analysis methods on seven publicly available datasets. By applying moETM to the scRNA+scATAC data in human bone marrow mononuclear cells (BMMCs), we identified sequence motifs corresponding to the transcription factors that regulate immune gene signatures. Applying moETM analysis to CITE-seq data from the COVID-19 patients revealed not only known immune cell-type-specific signatures but also composite multi-omic biomarkers of critical conditions due to COVID-19, thus providing insights from both biological and clinical perspectives.
Collapse
Affiliation(s)
- Manqi Zhou
- Department of Computational Biology, Cornell University
- Institute of Artificial Intelligence for Digital Health, Weill Cornell Medicine
| | - Hao Zhang
- Division of Health Informatics, Department of Population Health Sciences, Weill Cornell Medicine
| | - Zilong Baii
- Institute of Artificial Intelligence for Digital Health, Weill Cornell Medicine
- Division of Health Informatics, Department of Population Health Sciences, Weill Cornell Medicine
| | | | - Fei Wang
- Institute of Artificial Intelligence for Digital Health, Weill Cornell Medicine
- Division of Health Informatics, Department of Population Health Sciences, Weill Cornell Medicine
| | - Yue Li
- Quantitative Life Science, McGill University
- School of Computer Science, McGill University
- Mila - Quebec AI Institute
| |
Collapse
|
281
|
Cao C, Yang S, Li M, Li C. CircSSNN: circRNA-binding site prediction via sequence self-attention neural networks with pre-normalization. BMC Bioinformatics 2023; 24:220. [PMID: 37254080 DOI: 10.1186/s12859-023-05352-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Accepted: 05/25/2023] [Indexed: 06/01/2023] Open
Abstract
BACKGROUND Circular RNAs (circRNAs) play a significant role in some diseases by acting as transcription templates. Therefore, analyzing the interaction mechanism between circRNA and RNA-binding proteins (RBPs) has far-reaching implications for the prevention and treatment of diseases. Existing models for circRNA-RBP identification usually adopt convolution neural network (CNN), recurrent neural network (RNN), or their variants as feature extractors. Most of them have drawbacks such as poor parallelism, insufficient stability, and inability to capture long-term dependencies. METHODS In this paper, we propose a new method completely using the self-attention mechanism to capture deep semantic features of RNA sequences. On this basis, we construct a CircSSNN model for the cirRNA-RBP identification. The proposed model constructs a feature scheme by fusing circRNA sequence representations with statistical distributions, static local contexts, and dynamic global contexts. With a stable and efficient network architecture, the distance between any two positions in a sequence is reduced to a constant, so CircSSNN can quickly capture the long-term dependencies and extract the deep semantic features. RESULTS Experiments on 37 circRNA datasets show that the proposed model has overall advantages in stability, parallelism, and prediction performance. Keeping the network structure and hyperparameters unchanged, we directly apply the CircSSNN to linRNA datasets. The favorable results show that CircSSNN can be transformed simply and efficiently without task-oriented tuning. CONCLUSIONS In conclusion, CircSSNN can serve as an appealing circRNA-RBP identification tool with good identification performance, excellent scalability, and wide application scope without the need for task-oriented fine-tuning of parameters, which is expected to reduce the professional threshold required for hyperparameter tuning in bioinformatics analysis.
Collapse
Affiliation(s)
- Chao Cao
- School of Computer Science and Technology, Guangxi University of Science and Technology, Liuzhou, China
| | - Shuhong Yang
- Key Laboratory of Guangxi Universities on Intelligent Computing and Distributed Information Processing, Guangxi University of Science and Technology, Liuzhou, China.
| | - Mengli Li
- School of Technology, Guilin University, Guilin, China
| | - Chungui Li
- School of Computer Science and Technology, Guangxi University of Science and Technology, Liuzhou, China.
| |
Collapse
|
282
|
Shen L, Feng H, Qiu Y, Wei GW. SVSBI: sequence-based virtual screening of biomolecular interactions. Commun Biol 2023; 6:536. [PMID: 37202415 PMCID: PMC10195826 DOI: 10.1038/s42003-023-04866-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Accepted: 04/24/2023] [Indexed: 05/20/2023] Open
Abstract
Virtual screening (VS) is a critical technique in understanding biomolecular interactions, particularly in drug design and discovery. However, the accuracy of current VS models heavily relies on three-dimensional (3D) structures obtained through molecular docking, which is often unreliable due to the low accuracy. To address this issue, we introduce a sequence-based virtual screening (SVS) as another generation of VS models that utilize advanced natural language processing (NLP) algorithms and optimized deep K-embedding strategies to encode biomolecular interactions without relying on 3D structure-based docking. We demonstrate that SVS outperforms state-of-the-art performance for four regression datasets involving protein-ligand binding, protein-protein, protein-nucleic acid binding, and ligand inhibition of protein-protein interactions and five classification datasets for protein-protein interactions in five biological species. SVS has the potential to transform current practices in drug discovery and protein engineering.
Collapse
Affiliation(s)
- Li Shen
- Department of Mathematics, Michigan State University, East Lansing, MI, 48824, USA
| | - Hongsong Feng
- Department of Mathematics, Michigan State University, East Lansing, MI, 48824, USA
| | - Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, MI, 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI, 48824, USA.
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI, 48824, USA.
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, 48824, USA.
| |
Collapse
|
283
|
Smith GD, Ching WH, Cornejo-Páramo P, Wong ES. Decoding enhancer complexity with machine learning and high-throughput discovery. Genome Biol 2023; 24:116. [PMID: 37173718 PMCID: PMC10176946 DOI: 10.1186/s13059-023-02955-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 04/28/2023] [Indexed: 05/15/2023] Open
Abstract
Enhancers are genomic DNA elements controlling spatiotemporal gene expression. Their flexible organization and functional redundancies make deciphering their sequence-function relationships challenging. This article provides an overview of the current understanding of enhancer organization and evolution, with an emphasis on factors that influence these relationships. Technological advancements, particularly in machine learning and synthetic biology, are discussed in light of how they provide new ways to understand this complexity. Exciting opportunities lie ahead as we continue to unravel the intricacies of enhancer function.
Collapse
Affiliation(s)
- Gabrielle D Smith
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Wan Hern Ching
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
| | - Paola Cornejo-Páramo
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Emily S Wong
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia.
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia.
| |
Collapse
|
284
|
Mourad R. Semi-supervised learning improves regulatory sequence prediction with unlabeled sequences. BMC Bioinformatics 2023; 24:186. [PMID: 37147561 PMCID: PMC10163727 DOI: 10.1186/s12859-023-05303-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Accepted: 04/25/2023] [Indexed: 05/07/2023] Open
Abstract
MOTIVATION Genome-wide association studies have systematically identified thousands of single nucleotide polymorphisms (SNPs) associated with complex genetic diseases. However, the majority of those SNPs were found in non-coding genomic regions, preventing the understanding of the underlying causal mechanism. Predicting molecular processes based on the DNA sequence represents a promising approach to understand the role of those non-coding SNPs. Over the past years, deep learning was successfully applied to regulatory sequence prediction using supervised learning. Supervised learning required DNA sequences associated with functional data for training, whose amount is strongly limited by the finite size of the human genome. Conversely, the amount of mammalian DNA sequences is exponentially increasing due to ongoing large sequencing projects, but without functional data in most cases. RESULTS To alleviate the limitations of supervised learning, we propose a paradigm shift with semi-supervised learning, which does not only exploit labeled sequences (e.g. human genome with ChIP-seq experiment), but also unlabeled sequences available in much larger amounts (e.g. from other species without ChIP-seq experiment, such as chimpanzee). Our approach is flexible and can be plugged into any neural architecture including shallow and deep networks, and shows strong predictive performance improvements compared to supervised learning in most cases (up to [Formula: see text]). AVAILABILITY AND IMPLEMENTATION https://forgemia.inra.fr/raphael.mourad/deepgnn .
Collapse
Affiliation(s)
- Raphaël Mourad
- MIAT, INRAE, 31320, Castanet-Tolosan, France.
- University of Toulouse, UPS, 31062, Toulouse, France.
| |
Collapse
|
285
|
Grešová K, Martinek V, Čechák D, Šimeček P, Alexiou P. Genomic benchmarks: a collection of datasets for genomic sequence classification. BMC Genom Data 2023; 24:25. [PMID: 37127596 PMCID: PMC10150520 DOI: 10.1186/s12863-023-01123-8] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Accepted: 03/31/2023] [Indexed: 05/03/2023] Open
Abstract
BACKGROUND Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. RESULTS Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package 'genomic-benchmarks', and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks . CONCLUSIONS Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.
Collapse
Affiliation(s)
- Katarína Grešová
- Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
- National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
| | - Vlastimil Martinek
- Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
- National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
| | - David Čechák
- Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
- National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czechia
| | - Petr Šimeček
- Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia.
| | - Panagiotis Alexiou
- Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia
| |
Collapse
|
286
|
Rios-Martinez C, Bhattacharya N, Amini AP, Crawford L, Yang KK. Deep self-supervised learning for biosynthetic gene cluster detection and product classification. PLoS Comput Biol 2023; 19:e1011162. [PMID: 37220151 PMCID: PMC10241353 DOI: 10.1371/journal.pcbi.1011162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2022] [Revised: 06/05/2023] [Accepted: 05/07/2023] [Indexed: 05/25/2023] Open
Abstract
Natural products are chemical compounds that form the basis of many therapeutics used in the pharmaceutical industry. In microbes, natural products are synthesized by groups of colocalized genes called biosynthetic gene clusters (BGCs). With advances in high-throughput sequencing, there has been an increase of complete microbial isolate genomes and metagenomes, from which a vast number of BGCs are undiscovered. Here, we introduce a self-supervised learning approach designed to identify and characterize BGCs from such data. To do this, we represent BGCs as chains of functional protein domains and train a masked language model on these domains. We assess the ability of our approach to detect BGCs and characterize BGC properties in bacterial genomes. We also demonstrate that our model can learn meaningful representations of BGCs and their constituent domains, detect BGCs in microbial genomes, and predict BGC product classes. These results highlight self-supervised neural networks as a promising framework for improving BGC prediction and classification.
Collapse
Affiliation(s)
- Carolina Rios-Martinez
- Microsoft Research New England, Cambridge, Massachusetts, United States of America
- Department of Bioengineering, Stanford University, Stanford, California, United States of America
| | - Nicholas Bhattacharya
- Microsoft Research New England, Cambridge, Massachusetts, United States of America
- Department of Mathematics, University of California, Berkeley, Berkeley, California, United States of America
| | - Ava P. Amini
- Microsoft Research New England, Cambridge, Massachusetts, United States of America
| | - Lorin Crawford
- Microsoft Research New England, Cambridge, Massachusetts, United States of America
| | - Kevin K. Yang
- Microsoft Research New England, Cambridge, Massachusetts, United States of America
| |
Collapse
|
287
|
Soylu NN, Sefer E. BERT2OME: Prediction of 2'-O-Methylation Modifications From RNA Sequence by Transformer Architecture Based on BERT. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2177-2189. [PMID: 37819796 DOI: 10.1109/tcbb.2023.3237769] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/13/2023]
Abstract
Recent work on language models has resulted in state-of-the-art performance on various language tasks. Among these, Bidirectional Encoder Representations from Transformers (BERT) has focused on contextualizing word embeddings to extract context and semantics of the words. On the other hand, post-transcriptional 2'-O-methylation (Nm) RNA modification is important in various cellular tasks and related to a number of diseases. The existing high-throughput experimental techniques take longer time to detect these modifications, and costly in exploring these functional processes. Here, to deeply understand the associated biological processes faster, we come up with an efficient method Bert2Ome to infer 2'-O-methylation RNA modification sites from RNA sequences. Bert2Ome combines BERT-based model with convolutional neural networks (CNN) to infer the relationship between the modification sites and RNA sequence content. Unlike the methods proposed so far, Bert2Ome assumes each given RNA sequence as a text and focuses on improving the modification prediction performance by integrating the pretrained deep learning-based language model BERT. Additionally, our transformer-based approach could infer modification sites across multiple species. According to 5-fold cross-validation, human and mouse accuracies were 99.15% and 94.35% respectively. Similarly, ROC AUC scores were 0.99, 0.94 for the same species. Detailed results show that Bert2Ome reduces the time consumed in biological experiments and outperforms the existing approaches across different datasets and species over multiple metrics. Additionally, deep learning approaches such as 2D CNNs are more promising in learning BERT attributes than more conventional machine learning methods.
Collapse
|
288
|
Nikolados EM, Oyarzún DA. Deep learning for optimization of protein expression. Curr Opin Biotechnol 2023; 81:102941. [PMID: 37087839 DOI: 10.1016/j.copbio.2023.102941] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 02/02/2023] [Accepted: 03/17/2023] [Indexed: 04/25/2023]
Abstract
Recent progress in high-throughput DNA synthesis and sequencing has enabled the development of massively parallel reporter assays for strain characterization. These datasets map a large number of DNA sequences to protein expression levels, sparking increased interest in data-driven methods for sequence-to-expression modeling. Here, we highlight advances in deep learning models of protein expression and their potential for optimizing strains engineered to produce recombinant proteins. We review recent works that built highly accurate models and discuss challenges that hinder adoption by end users. There is a need to better align this technology with the constraints encountered in strain engineering, particularly the cost of acquiring large amounts of data and the requirement for interpretable models that generalize beyond the training data. Overcoming these barriers will help to incentivize academic and industrial laboratories to tap into a new era of data-centric strain engineering.
Collapse
Affiliation(s)
| | - Diego A Oyarzún
- School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JH, UK; School of Informatics, University of Edinburgh, Edinburgh EH8 9AB, UK; The Alan Turing Institute, London NW1 2DB, UK.
| |
Collapse
|
289
|
Kravchuk EV, Ashniev GA, Gladkova MG, Orlov AV, Vasileva AV, Boldyreva AV, Burenin AG, Skirda AM, Nikitin PI, Orlova NN. Experimental Validation and Prediction of Super-Enhancers: Advances and Challenges. Cells 2023; 12:cells12081191. [PMID: 37190100 DOI: 10.3390/cells12081191] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 04/07/2023] [Accepted: 04/14/2023] [Indexed: 05/17/2023] Open
Abstract
Super-enhancers (SEs) are cis-regulatory elements of the human genome that have been widely discussed since the discovery and origin of the term. Super-enhancers have been shown to be strongly associated with the expression of genes crucial for cell differentiation, cell stability maintenance, and tumorigenesis. Our goal was to systematize research studies dedicated to the investigation of structure and functions of super-enhancers as well as to define further perspectives of the field in various applications, such as drug development and clinical use. We overviewed the fundamental studies which provided experimental data on various pathologies and their associations with particular super-enhancers. The analysis of mainstream approaches for SE search and prediction allowed us to accumulate existing data and propose directions for further algorithmic improvements of SEs' reliability levels and efficiency. Thus, here we provide the description of the most robust algorithms such as ROSE, imPROSE, and DEEPSEN and suggest their further use for various research and development tasks. The most promising research direction, which is based on topic and number of published studies, are cancer-associated super-enhancers and prospective SE-targeted therapy strategies, most of which are discussed in this review.
Collapse
Affiliation(s)
- Ekaterina V Kravchuk
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
- Faculty of Biology, Lomonosov Moscow State University, Leninskiye Gory, MSU, 1-12, 119991 Moscow, Russia
| | - German A Ashniev
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
- Faculty of Biology, Lomonosov Moscow State University, Leninskiye Gory, MSU, 1-12, 119991 Moscow, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, GSP-1, Leninskiye Gory, MSU, 1-73, 119234 Moscow, Russia
| | - Marina G Gladkova
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, GSP-1, Leninskiye Gory, MSU, 1-73, 119234 Moscow, Russia
| | - Alexey V Orlov
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
| | - Anastasiia V Vasileva
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
| | - Anna V Boldyreva
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
| | - Alexandr G Burenin
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
| | - Artemiy M Skirda
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
| | - Petr I Nikitin
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
| | - Natalia N Orlova
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
| |
Collapse
|
290
|
Barbero-Aparicio JA, Olivares-Gil A, Díez-Pastor JF, García-Osorio C. Deep learning and support vector machines for transcription start site identification. PeerJ Comput Sci 2023; 9:e1340. [PMID: 37346545 PMCID: PMC10280436 DOI: 10.7717/peerj-cs.1340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Accepted: 03/21/2023] [Indexed: 06/23/2023]
Abstract
Recognizing transcription start sites is key to gene identification. Several approaches have been employed in related problems such as detecting translation initiation sites or promoters, many of the most recent ones based on machine learning. Deep learning methods have been proven to be exceptionally effective for this task, but their use in transcription start site identification has not yet been explored in depth. Also, the very few existing works do not compare their methods to support vector machines (SVMs), the most established technique in this area of study, nor provide the curated dataset used in the study. The reduced amount of published papers in this specific problem could be explained by this lack of datasets. Given that both support vector machines and deep neural networks have been applied in related problems with remarkable results, we compared their performance in transcription start site predictions, concluding that SVMs are computationally much slower, and deep learning methods, specially long short-term memory neural networks (LSTMs), are best suited to work with sequences than SVMs. For such a purpose, we used the reference human genome GRCh38. Additionally, we studied two different aspects related to data processing: the proper way to generate training examples and the imbalanced nature of the data. Furthermore, the generalization performance of the models studied was also tested using the mouse genome, where the LSTM neural network stood out from the rest of the algorithms. To sum up, this article provides an analysis of the best architecture choices in transcription start site identification, as well as a method to generate transcription start site datasets including negative instances on any species available in Ensembl. We found that deep learning methods are better suited than SVMs to solve this problem, being more efficient and better adapted to long sequences and large amounts of data. We also create a transcription start site (TSS) dataset large enough to be used in deep learning experiments.
Collapse
Affiliation(s)
| | - Alicia Olivares-Gil
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| | - José F. Díez-Pastor
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| | - César García-Osorio
- Departamento de Ingeniería Informática, Universidad de Burgos, Burgos, Spain
| |
Collapse
|
291
|
Joiret M, Leclercq M, Lambrechts G, Rapino F, Close P, Louppe G, Geris L. Cracking the genetic code with neural networks. Front Artif Intell 2023; 6:1128153. [PMID: 37091301 PMCID: PMC10117997 DOI: 10.3389/frai.2023.1128153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Accepted: 03/21/2023] [Indexed: 04/09/2023] Open
Abstract
The genetic code is textbook scientific knowledge that was soundly established without resorting to Artificial Intelligence (AI). The goal of our study was to check whether a neural network could re-discover, on its own, the mapping links between codons and amino acids and build the complete deciphering dictionary upon presentation of transcripts proteins data training pairs. We compared different Deep Learning neural network architectures and estimated quantitatively the size of the required human transcriptomic training set to achieve the best possible accuracy in the codon-to-amino-acid mapping. We also investigated the effect of a codon embedding layer assessing the semantic similarity between codons on the rate of increase of the training accuracy. We further investigated the benefit of quantifying and using the unbalanced representations of amino acids within real human proteins for a faster deciphering of rare amino acids codons. Deep neural networks require huge amount of data to train them. Deciphering the genetic code by a neural network is no exception. A test accuracy of 100% and the unequivocal deciphering of rare codons such as the tryptophan codon or the stop codons require a training dataset of the order of 4–22 millions cumulated pairs of codons with their associated amino acids presented to the neural network over around 7–40 training epochs, depending on the architecture and settings. We confirm that the wide generic capacities and modularity of deep neural networks allow them to be customized easily to learn the deciphering task of the genetic code efficiently.
Collapse
Affiliation(s)
- Marc Joiret
- Biomechanics Research Unit, GIGA in Silico Medicine, Liège University, Liège, Belgium
- *Correspondence: Marc Joiret
| | - Marine Leclercq
- Cancer Signaling, GIGA Stem Cells, Liège University, Liège, Belgium
| | - Gaspard Lambrechts
- Department of Electrical Engineering and Computer Science, Artificial Intelligence and Deep Learning, Montefiore Institute, Liège University, Liège, Belgium
| | - Francesca Rapino
- Cancer Signaling, GIGA Stem Cells, Liège University, Liège, Belgium
| | - Pierre Close
- Cancer Signaling, GIGA Stem Cells, Liège University, Liège, Belgium
| | - Gilles Louppe
- Department of Electrical Engineering and Computer Science, Artificial Intelligence and Deep Learning, Montefiore Institute, Liège University, Liège, Belgium
| | - Liesbet Geris
- Biomechanics Research Unit, GIGA in Silico Medicine, Liège University, Liège, Belgium
- Skeletal Biology and Engineering Research Center, KU Leuven, Leuven, Belgium
- Biomechanics Section, KU Leuven, Heverlee, Belgium
| |
Collapse
|
292
|
Rozowsky J, Gao J, Borsari B, Yang YT, Galeev T, Gürsoy G, Epstein CB, Xiong K, Xu J, Li T, Liu J, Yu K, Berthel A, Chen Z, Navarro F, Sun MS, Wright J, Chang J, Cameron CJF, Shoresh N, Gaskell E, Drenkow J, Adrian J, Aganezov S, Aguet F, Balderrama-Gutierrez G, Banskota S, Corona GB, Chee S, Chhetri SB, Cortez Martins GC, Danyko C, Davis CA, Farid D, Farrell NP, Gabdank I, Gofin Y, Gorkin DU, Gu M, Hecht V, Hitz BC, Issner R, Jiang Y, Kirsche M, Kong X, Lam BR, Li S, Li B, Li X, Lin KZ, Luo R, Mackiewicz M, Meng R, Moore JE, Mudge J, Nelson N, Nusbaum C, Popov I, Pratt HE, Qiu Y, Ramakrishnan S, Raymond J, Salichos L, Scavelli A, Schreiber JM, Sedlazeck FJ, See LH, Sherman RM, Shi X, Shi M, Sloan CA, Strattan JS, Tan Z, Tanaka FY, Vlasova A, Wang J, Werner J, Williams B, Xu M, Yan C, Yu L, Zaleski C, Zhang J, Ardlie K, Cherry JM, Mendenhall EM, Noble WS, Weng Z, Levine ME, Dobin A, Wold B, Mortazavi A, Ren B, Gillis J, Myers RM, Snyder MP, Choudhary J, Milosavljevic A, Schatz MC, Bernstein BE, et alRozowsky J, Gao J, Borsari B, Yang YT, Galeev T, Gürsoy G, Epstein CB, Xiong K, Xu J, Li T, Liu J, Yu K, Berthel A, Chen Z, Navarro F, Sun MS, Wright J, Chang J, Cameron CJF, Shoresh N, Gaskell E, Drenkow J, Adrian J, Aganezov S, Aguet F, Balderrama-Gutierrez G, Banskota S, Corona GB, Chee S, Chhetri SB, Cortez Martins GC, Danyko C, Davis CA, Farid D, Farrell NP, Gabdank I, Gofin Y, Gorkin DU, Gu M, Hecht V, Hitz BC, Issner R, Jiang Y, Kirsche M, Kong X, Lam BR, Li S, Li B, Li X, Lin KZ, Luo R, Mackiewicz M, Meng R, Moore JE, Mudge J, Nelson N, Nusbaum C, Popov I, Pratt HE, Qiu Y, Ramakrishnan S, Raymond J, Salichos L, Scavelli A, Schreiber JM, Sedlazeck FJ, See LH, Sherman RM, Shi X, Shi M, Sloan CA, Strattan JS, Tan Z, Tanaka FY, Vlasova A, Wang J, Werner J, Williams B, Xu M, Yan C, Yu L, Zaleski C, Zhang J, Ardlie K, Cherry JM, Mendenhall EM, Noble WS, Weng Z, Levine ME, Dobin A, Wold B, Mortazavi A, Ren B, Gillis J, Myers RM, Snyder MP, Choudhary J, Milosavljevic A, Schatz MC, Bernstein BE, Guigó R, Gingeras TR, Gerstein M. The EN-TEx resource of multi-tissue personal epigenomes & variant-impact models. Cell 2023; 186:1493-1511.e40. [PMID: 37001506 PMCID: PMC10074325 DOI: 10.1016/j.cell.2023.02.018] [Show More Authors] [Citation(s) in RCA: 40] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Revised: 10/16/2022] [Accepted: 02/10/2023] [Indexed: 04/03/2023]
Abstract
Understanding how genetic variants impact molecular phenotypes is a key goal of functional genomics, currently hindered by reliance on a single haploid reference genome. Here, we present the EN-TEx resource of 1,635 open-access datasets from four donors (∼30 tissues × ∼15 assays). The datasets are mapped to matched, diploid genomes with long-read phasing and structural variants, instantiating a catalog of >1 million allele-specific loci. These loci exhibit coordinated activity along haplotypes and are less conserved than corresponding, non-allele-specific ones. Surprisingly, a deep-learning transformer model can predict the allele-specific activity based only on local nucleotide-sequence context, highlighting the importance of transcription-factor-binding motifs particularly sensitive to variants. Furthermore, combining EN-TEx with existing genome annotations reveals strong associations between allele-specific and GWAS loci. It also enables models for transferring known eQTLs to difficult-to-profile tissues (e.g., from skin to heart). Overall, EN-TEx provides rich data and generalizable models for more accurate personal functional genomics.
Collapse
Affiliation(s)
- Joel Rozowsky
- Section on Biomedical Informatics and Data Science, Yale University, New Haven, CT, USA; Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Jiahao Gao
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Beatrice Borsari
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA; Centre for Genomic Regulation, The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
| | - Yucheng T Yang
- Institute of Science and Technology for Brain-Inspired Intelligence; MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence; MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China; Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Timur Galeev
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Gamze Gürsoy
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | | | - Kun Xiong
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Jinrui Xu
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Tianxiao Li
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Jason Liu
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Keyang Yu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Ana Berthel
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Zhanlin Chen
- Department of Statistics and Data Science, Yale University, New Haven, CT, USA
| | - Fabio Navarro
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Maxwell S Sun
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | | | - Justin Chang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Christopher J F Cameron
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Noam Shoresh
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | - Jorg Drenkow
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Jessika Adrian
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Sergey Aganezov
- Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA
| | | | | | | | | | - Sora Chee
- Ludwig Institute for Cancer Research, University of California, San Diego, La Jolla, CA, USA
| | - Surya B Chhetri
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA
| | - Gabriel Conte Cortez Martins
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Cassidy Danyko
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Carrie A Davis
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Daniel Farid
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | | | - Idan Gabdank
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Yoel Gofin
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - David U Gorkin
- Ludwig Institute for Cancer Research, University of California, San Diego, La Jolla, CA, USA
| | - Mengting Gu
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Vivian Hecht
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Benjamin C Hitz
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Robbyn Issner
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Yunzhe Jiang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Melanie Kirsche
- Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Xiangmeng Kong
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Bonita R Lam
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Shantao Li
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Bian Li
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Xiqi Li
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Khine Zin Lin
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Hong Kong, CHN
| | - Mark Mackiewicz
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA
| | - Ran Meng
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Jill E Moore
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA, USA
| | - Jonathan Mudge
- European Bioinformatics Institute, Cambridge, Cambridgeshire, GB
| | | | - Chad Nusbaum
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Ioann Popov
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Henry E Pratt
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA, USA
| | - Yunjiang Qiu
- Ludwig Institute for Cancer Research, University of California, San Diego, La Jolla, CA, USA
| | - Srividya Ramakrishnan
- Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Joe Raymond
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Leonidas Salichos
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA; Department of Biological and Chemical Sciences, New York Institute of Technology, Old Westbury, NY, USA
| | - Alexandra Scavelli
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Jacob M Schreiber
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Fritz J Sedlazeck
- Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA; Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Lei Hoon See
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Rachel M Sherman
- Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Xu Shi
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Minyi Shi
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Cricket Alicia Sloan
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - J Seth Strattan
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Zhen Tan
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Forrest Y Tanaka
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | - Anna Vlasova
- Centre for Genomic Regulation, The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain; Comparative Genomics Group, Life Science Programme, Barcelona Supercomputing Centre, Barcelona, Spain; Institute of Research in Biomedicine, Barcelona, Spain
| | - Jun Wang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Jonathan Werner
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Brian Williams
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Min Xu
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Chengfei Yan
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA
| | - Lu Yu
- Institute of Cancer Research, London, UK
| | - Christopher Zaleski
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Jing Zhang
- Department of Computer Science, University of California, Irvine, Irvine, CA, USA
| | | | - J Michael Cherry
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | | | - William S Noble
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Zhiping Weng
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA, USA
| | - Morgan E Levine
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Pathology, Yale University School of Medicine, New Haven, CT, USA
| | - Alexander Dobin
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Barbara Wold
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Ali Mortazavi
- Department of Developmental and Cell Biology, University of California, Irvine, Irvine, CA, USA
| | - Bing Ren
- Ludwig Institute for Cancer Research, University of California, San Diego, La Jolla, CA, USA
| | - Jesse Gillis
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA; Department of Physiology, University of Toronto, Toronto, ON, Canada
| | - Richard M Myers
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA
| | - Michael P Snyder
- Department of Genetics, School of Medicine, Stanford University, Palo Alto, CA, USA
| | | | | | - Michael C Schatz
- Departments of Computer Science and Biology, Johns Hopkins University, Baltimore, MD, USA; Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| | - Bradley E Bernstein
- Broad Institute of MIT and Harvard, Cambridge, MA, USA; Department of Cancer Biology, Dana-Farber Cancer Institute, Boston, MA, USA.
| | - Roderic Guigó
- Centre for Genomic Regulation, The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain; Universitat Pompeu Fabra, Barcelona, Catalonia, Spain.
| | - Thomas R Gingeras
- Functional Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| | - Mark Gerstein
- Section on Biomedical Informatics and Data Science, Yale University, New Haven, CT, USA; Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA; Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USA; Department of Statistics and Data Science, Yale University, New Haven, CT, USA; Department of Computer Science, Yale University, New Haven, CT, USA.
| |
Collapse
|
293
|
Wang L, Sun J, Ma S, Xia J, Li X. PredDSMC: A predictor for driver synonymous mutations in human cancers. Front Genet 2023; 14:1164593. [PMID: 37051593 PMCID: PMC10083435 DOI: 10.3389/fgene.2023.1164593] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Accepted: 03/09/2023] [Indexed: 03/29/2023] Open
Abstract
Introduction: Driver mutations play a critical role in the occurrence and development of human cancers. Most studies have focused on missense mutations that function as drivers in cancer. However, accumulating experimental evidence indicates that synonymous mutations can also act as driver mutations.Methods: Here, we proposed a computational method called PredDSMC to accurately predict driver synonymous mutations in human cancers. We first systematically explored four categories of multimodal features, including sequence features, splicing features, conservation scores, and functional scores. Further feature selection was carried out to remove redundant features and improve the model performance. Finally, we utilized the random forest classifier to build PredDSMC.Results: The results of two independent test sets indicated that PredDSMC outperformed the state-of-the-art methods in differentiating driver synonymous mutations from passenger mutations.Discussion: In conclusion, we expect that PredDSMC, as a driver synonymous mutation prediction method, will be a valuable method for gaining a deeper understanding of synonymous mutations in human cancers.
Collapse
|
294
|
Huang Z, Wang J, Lu X, Mohd Zain A, Yu G. scGGAN: single-cell RNA-seq imputation by graph-based generative adversarial network. Brief Bioinform 2023; 24:7024714. [PMID: 36733262 DOI: 10.1093/bib/bbad040] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 12/21/2022] [Accepted: 01/18/2023] [Indexed: 02/04/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) data are typically with a large number of missing values, which often results in the loss of critical gene signaling information and seriously limit the downstream analysis. Deep learning-based imputation methods often can better handle scRNA-seq data than shallow ones, but most of them do not consider the inherent relations between genes, and the expression of a gene is often regulated by other genes. Therefore, it is essential to impute scRNA-seq data by considering the regional gene-to-gene relations. We propose a novel model (named scGGAN) to impute scRNA-seq data that learns the gene-to-gene relations by Graph Convolutional Networks (GCN) and global scRNA-seq data distribution by Generative Adversarial Networks (GAN). scGGAN first leverages single-cell and bulk genomics data to explore inherent relations between genes and builds a more compact gene relation network to jointly capture the homogeneous and heterogeneous information. Then, it constructs a GCN-based GAN model to integrate the scRNA-seq, gene sequencing data and gene relation network for generating scRNA-seq data, and trains the model through adversarial learning. Finally, it utilizes data generated by the trained GCN-based GAN model to impute scRNA-seq data. Experiments on simulated and real scRNA-seq datasets show that scGGAN can effectively identify dropout events, recover the biologically meaningful expressions, determine subcellular states and types, improve the differential expression analysis and temporal dynamics analysis. Ablation experiments confirm that both the gene relation network and gene sequence data help the imputation of scRNA-seq data.
Collapse
Affiliation(s)
- Zimo Huang
- MEng student at School of Software, Shandong University, China
| | - Jun Wang
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, China
| | - Xudong Lu
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, China
| | | | - Guoxian Yu
- School of Software, Shandong University, China
| |
Collapse
|
295
|
M6A-BERT-Stacking: A Tissue-Specific Predictor for Identifying RNA N6-Methyladenosine Sites Based on BERT and Stacking Strategy. Symmetry (Basel) 2023. [DOI: 10.3390/sym15030731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2023] Open
Abstract
As the most abundant RNA methylation modification, N6-methyladenosine (m6A) could regulate asymmetric and symmetric division of hematopoietic stem cells and play an important role in various diseases. Therefore, the precise identification of m6A sites around the genomes of different species is a critical step to further revealing their biological functions and influence on these diseases. However, the traditional wet-lab experimental methods for identifying m6A sites are often laborious and expensive. In this study, we proposed an ensemble deep learning model called m6A-BERT-Stacking, a powerful predictor for the detection of m6A sites in various tissues of three species. First, we utilized two encoding methods, i.e., di ribonucleotide index of RNA (DiNUCindex_RNA) and k-mer word segmentation, to extract RNA sequence features. Second, two encoding matrices together with the original sequences were respectively input into three different deep learning models in parallel to train three sub-models, namely residual networks with convolutional block attention module (Resnet-CBAM), bidirectional long short-term memory with attention (BiLSTM-Attention), and pre-trained bidirectional encoder representations from transformers model for DNA-language (DNABERT). Finally, the outputs of all sub-models were ensembled based on the stacking strategy to obtain the final prediction of m6A sites through the fully connected layer. The experimental results demonstrated that m6A-BERT-Stacking outperformed most of the existing methods based on the same independent datasets.
Collapse
|
296
|
Luo H, Shan W, Chen C, Ding P, Luo L. Improving language model of human genome for DNA-protein binding prediction based on task-specific pre-training. Interdiscip Sci 2023; 15:32-43. [PMID: 36136096 DOI: 10.1007/s12539-022-00537-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 08/30/2022] [Accepted: 09/07/2022] [Indexed: 11/27/2022]
Abstract
The DNA-protein binding plays a pivotal role in regulating gene expression and evolution, and computational identification of DNA-protein has drawn more and more attention in bioinformatics. Recently, variants of BERT are also used to capture the semantic information of DNA sequences for predicting DNA-protein bindings. In this study, we leverage a task-specific pre-training strategy on BERT using large-scale multi-source DNA-protein binding data and present TFBert. TFBert treats DNA sequences as natural sentences and k-mer nucleotides as words. It can effectively extract upstream and downstream nucleotide context information by pre-training the 690 unlabeled ChIP-seq datasets. Experiments show that the pre-trained model can achieve promising performance on every single dataset in the 690 ChIP-seq datasets after simple fine tuning, especially on small datasets. The average AUC is 94.7%, outperforming existing popular methods. In conclusion, this study provides a variant of BERT based on pre-training and achieved state-of-the-art results in predicting DNA-protein bindings. We believe that TFBert can provide insights into other biological sequence classification problems.
Collapse
Affiliation(s)
- Hanyu Luo
- School of Computer Science, University of South China, Hengyang, Hunan, 421001, People's Republic of China
| | - Wenyu Shan
- School of Computer Science, University of South China, Hengyang, Hunan, 421001, People's Republic of China
| | - Cheng Chen
- School of Computer Science, University of South China, Hengyang, Hunan, 421001, People's Republic of China
| | - Pingjian Ding
- School of Computer Science, University of South China, Hengyang, Hunan, 421001, People's Republic of China
| | - Lingyun Luo
- School of Computer Science, University of South China, Hengyang, Hunan, 421001, People's Republic of China. .,Hunan Medical Big Data International Science and Technology Innovation Cooperation Base, Hengyang, Hunan, 421001, People's Republic of China.
| |
Collapse
|
297
|
Wang X, Zhang M, Long C, Yao L, Zhu M. Self-Attention Based Neural Network for Predicting RNA-Protein Binding Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1469-1479. [PMID: 36067103 DOI: 10.1109/tcbb.2022.3204661] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Proteins binding to Ribonucleic Acid (RNA) inside cells are called RNA-binding proteins (RBP), which play a crucial role in gene regulation. The identification of RNA-protein binding sites helps to understand the function of RBP better. Although many computational methods have been developed to predict RNA-protein binding sites, their prediction accuracy on small sample datasets needs improvement. To overcome this limitation, we propose a novel model called SA-Net, which utilizes k-mer embedding to encode RNA sequences and a self-attention-based neural network to extract sequence features. K-mer embedding assists the model to discover significant subsequence fragments associated with binding sites. The self-attention mechanism captures contextual information from the entire input sequence globally, performing well in small sample sequence learning. Experimental results demonstrate that SA-Net attains state-of-the-art results on the RBP-24 dataset. We find that 4-mer embedding aids the model to achieve optimal performance. We also show that the self-attention network outperforms the commonly used CNN and CNN-BLSTM models in sequence feature extraction.
Collapse
|
298
|
Hwang H, Chang HR, Baek D. Determinants of Functional MicroRNA Targeting. Mol Cells 2023; 46:21-32. [PMID: 36697234 PMCID: PMC9880601 DOI: 10.14348/molcells.2023.2157] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Revised: 11/09/2022] [Accepted: 11/15/2022] [Indexed: 01/27/2023] Open
Abstract
MicroRNAs (miRNAs) play cardinal roles in regulating biological pathways and processes, resulting in significant physiological effects. To understand the complex regulatory network of miRNAs, previous studies have utilized massivescale datasets of miRNA targeting and attempted to computationally predict the functional targets of miRNAs. Many miRNA target prediction tools have been developed and are widely used by scientists from various fields of biology and medicine. Most of these tools consider seed pairing between miRNAs and their mRNA targets and additionally consider other determinants to improve prediction accuracy. However, these tools exhibit limited prediction accuracy and high false positive rates. The utilization of additional determinants, such as RNA modifications and RNA-binding protein binding sites, may further improve miRNA target prediction. In this review, we discuss the determinants of functional miRNA targeting that are currently used in miRNA target prediction and the potentially predictive but unappreciated determinants that may improve prediction accuracy.
Collapse
Affiliation(s)
- Hyeonseo Hwang
- School of Biological Sciences, Seoul National University, Seoul 08826, Korea
| | - Hee Ryung Chang
- School of Biological Sciences, Seoul National University, Seoul 08826, Korea
| | - Daehyun Baek
- School of Biological Sciences, Seoul National University, Seoul 08826, Korea
| |
Collapse
|
299
|
Jonnakuti VS, Wagner EJ, Maletić-Savatić M, Liu Z, Yalamanchili HK. PolyAMiner-Bulk: A Machine Learning Based Bioinformatics Algorithm to Infer and Decode Alternative Polyadenylation Dynamics from bulk RNA-seq data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.23.523471. [PMID: 36747700 PMCID: PMC9900750 DOI: 10.1101/2023.01.23.523471] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
More than half of human genes exercise alternative polyadenylation (APA) and generate mRNA transcripts with varying 3' untranslated regions (UTR). However, current computational approaches for identifying cleavage and polyadenylation sites (C/PASs) and quantifying 3'UTR length changes from bulk RNA-seq data fail to unravel tissue- and disease-specific APA dynamics. Here, we developed a next-generation bioinformatics algorithm and application, PolyAMiner-Bulk, that utilizes an attention-based machine learning architecture and an improved vector projection-based engine to infer differential APA dynamics accurately. When applied to earlier studies, PolyAMiner-Bulk accurately identified more than twice the number of APA changes in an RBM17 knockdown bulk RNA-seq dataset compared to current generation tools. Moreover, on a separate dataset, PolyAMiner-Bulk revealed novel APA dynamics and pathways in scleroderma pathology and identified differential APA in a gene that was identified as being involved in scleroderma pathogenesis in an independent study. Lastly, we used PolyAMiner-Bulk to analyze the RNA-seq data of post-mortem prefrontal cortexes from the ROSMAP data consortium and unraveled novel APA dynamics in Alzheimer's Disease. Our method, PolyAMiner-Bulk, creates a paradigm for future alternative polyadenylation analysis from bulk RNA-seq data.
Collapse
Affiliation(s)
- Venkata Soumith Jonnakuti
- Department of Pediatrics, Baylor College of Medicine, Houston, TX, 77030, USA
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston, TX, 77030, USA
- Program in Quantitative and Computational Biology, Baylor College of Medicine, Houston, TX, 77030, USA
- Medical Scientist Training Program, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Eric J. Wagner
- Department of Biochemistry and Biophysics, University of Rochester School of Medicine and Dentistry, Rochester, NY 14642, USA
| | - Mirjana Maletić-Savatić
- Department of Pediatrics, Baylor College of Medicine, Houston, TX, 77030, USA
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston, TX, 77030, USA
| | - Zhandong Liu
- Department of Pediatrics, Baylor College of Medicine, Houston, TX, 77030, USA
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston, TX, 77030, USA
- Program in Quantitative and Computational Biology, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Hari Krishna Yalamanchili
- Department of Pediatrics, Baylor College of Medicine, Houston, TX, 77030, USA
- Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital, Houston, TX, 77030, USA
| |
Collapse
|
300
|
Li Z, Gao E, Zhou J, Han W, Xu X, Gao X. Applications of deep learning in understanding gene regulation. CELL REPORTS METHODS 2023; 3:100384. [PMID: 36814848 PMCID: PMC9939384 DOI: 10.1016/j.crmeth.2022.100384] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Gene regulation is a central topic in cell biology. Advances in omics technologies and the accumulation of omics data have provided better opportunities for gene regulation studies than ever before. For this reason deep learning, as a data-driven predictive modeling approach, has been successfully applied to this field during the past decade. In this article, we aim to give a brief yet comprehensive overview of representative deep-learning methods for gene regulation. Specifically, we discuss and compare the design principles and datasets used by each method, creating a reference for researchers who wish to replicate or improve existing methods. We also discuss the common problems of existing approaches and prospectively introduce the emerging deep-learning paradigms that will potentially alleviate them. We hope that this article will provide a rich and up-to-date resource and shed light on future research directions in this area.
Collapse
Affiliation(s)
- Zhongxiao Li
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Elva Gao
- The KAUST School, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Juexiao Zhou
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Wenkai Han
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xiaopeng Xu
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| |
Collapse
|