1
|
Lei X, Wang X, Chen G, Liang C, Li Q, Jiang H, Xiong W. Combining diffusion and transformer models for enhanced promoter synthesis and strength prediction in deep learning. mSystems 2025; 10:e0018325. [PMID: 40105319 PMCID: PMC12013266 DOI: 10.1128/msystems.00183-25] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2025] [Accepted: 02/13/2025] [Indexed: 03/20/2025] Open
Abstract
In the field of synthetic biology, the engineering of synthetic promoters that outperform their natural counterparts is of paramount importance, which can optimize the expression of exogenous genes, enhance the efficiency of metabolic pathways, and possess substantial commercial value. Research indicates that some synthetic promoters have higher transcriptional activity compared to strong natural promoters. However, with the exponential increase in complexity due to the 4n potential combinations in a promoter sequence of length n, identifying effective synthetic promoters remains a formidable challenge. Deep learning models, by adaptively learning from extensive data sets, have become instrumental in analyzing biological data. This study introduces a diffusion model-based approach for designing promoters viable in model bacteria such as Escherichia coli and cyanobacteria. This model proficiently assimilates and utilizes inherent biological features from natural promoter sequences to engineer synthetic variants. Additionally, we employed a transformer model to evaluate the efficacy of these synthetic promoters, aiming at screening those with high performance. The experimental findings suggest that the synthetic promoters by the diffusion model not only share key biological features with their natural counterparts but also demonstrate greater similarity to natural promoters than those generated by a variational autoencoder. In predicting promoter strength, the transformer model demonstrated improved performance over the convolutional neural network. Finally, we developed an integrated platform for generating promoters and predicting their strength. IMPORTANCE We demonstrated that diffusion models are superior in accomplishing the promoter synthesis task compared to other state-of-the-art deep learning models. The effectiveness of our method was validated using data sets of Escherichia coli and cyanobacteria promoters, showing more stable and prompt convergence and more natural-like promoters than the variational autoencoder model. We extracted sequence information, dimer information, and position information from promoters and combined them with a transformer model to predict promoter strength. Our prediction results were more accurate than those obtained with a convolutional neural network model. Our in silico experiments systematically introduced mutations in promoter sequences and explored their contribution to promoter strength, highlighting the depth of learning in our model.
Collapse
Affiliation(s)
- Xin Lei
- School of Future Technology, South China University of Technology, Guangzhou, Guangdong, China
| | - Xing Wang
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong, China
| | - Guanlin Chen
- School of Future Technology, South China University of Technology, Guangzhou, Guangdong, China
| | - Ce Liang
- Research Projects Department, Guangdong Artificial Intelligence and Digital Economy Laboratory (Guangzhou), Guangzhou, Guangdong, China
| | - Quhuan Li
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong, China
| | - Huaiguang Jiang
- School of Future Technology, South China University of Technology, Guangzhou, Guangdong, China
- Research Projects Department, Guangdong Artificial Intelligence and Digital Economy Laboratory (Guangzhou), Guangzhou, Guangdong, China
| | - Wei Xiong
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong, China
| |
Collapse
|
2
|
Asim MN, Ibrahim MA, Zaib A, Dengel A. DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models. Front Med (Lausanne) 2025; 12:1503229. [PMID: 40265190 PMCID: PMC12011883 DOI: 10.3389/fmed.2025.1503229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Accepted: 03/10/2025] [Indexed: 04/24/2025] Open
Abstract
Deoxyribonucleic acid (DNA) serves as fundamental genetic blueprint that governs development, functioning, growth, and reproduction of all living organisms. DNA can be altered through germline and somatic mutations. Germline mutations underlie hereditary conditions, while somatic mutations can be induced by various factors including environmental influences, chemicals, lifestyle choices, and errors in DNA replication and repair mechanisms which can lead to cancer. DNA sequence analysis plays a pivotal role in uncovering the intricate information embedded within an organism's genetic blueprint and understanding the factors that can modify it. This analysis helps in early detection of genetic diseases and the design of targeted therapies. Traditional wet-lab experimental DNA sequence analysis through traditional wet-lab experimental methods is costly, time-consuming, and prone to errors. To accelerate large-scale DNA sequence analysis, researchers are developing AI applications that complement wet-lab experimental methods. These AI approaches can help generate hypotheses, prioritize experiments, and interpret results by identifying patterns in large genomic datasets. Effective integration of AI methods with experimental validation requires scientists to understand both fields. Considering the need of a comprehensive literature that bridges the gap between both fields, contributions of this paper are manifold: It presents diverse range of DNA sequence analysis tasks and AI methodologies. It equips AI researchers with essential biological knowledge of 44 distinct DNA sequence analysis tasks and aligns these tasks with 3 distinct AI-paradigms, namely, classification, regression, and clustering. It streamlines the integration of AI into DNA sequence analysis tasks by consolidating information of 36 diverse biological databases that can be used to develop benchmark datasets for 44 different DNA sequence analysis tasks. To ensure performance comparisons between new and existing AI predictors, it provides insights into 140 benchmark datasets related to 44 distinct DNA sequence analysis tasks. It presents word embeddings and language models applications across 44 distinct DNA sequence analysis tasks. It streamlines the development of new predictors by providing a comprehensive survey of 39 word embeddings and 67 language models based predictive pipeline performance values as well as top performing traditional sequence encoding-based predictors and their performances across 44 DNA sequence analysis tasks.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| | - Arooj Zaib
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| | - Andreas Dengel
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| |
Collapse
|
3
|
Cheng S, Wei Y, Zhou Y, Xu Z, Wright DN, Liu J, Peng Y. Deciphering genomic codes using advanced natural language processing techniques: a scoping review. J Am Med Inform Assoc 2025; 32:761-772. [PMID: 39998912 PMCID: PMC12005631 DOI: 10.1093/jamia/ocaf029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2024] [Revised: 01/20/2025] [Accepted: 02/05/2025] [Indexed: 02/27/2025] Open
Abstract
OBJECTIVES The vast and complex nature of human genomic sequencing data presents challenges for effective analysis. This review aims to investigate the application of natural language processing (NLP) techniques, particularly large language models (LLMs) and transformer architectures, in deciphering genomic codes, focusing on tokenization, transformer models, and regulatory annotation prediction. The goal of this review is to assess data and model accessibility in the most recent literature, gaining a better understanding of the existing capabilities and constraints of these tools in processing genomic sequencing data. MATERIALS AND METHODS Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, our scoping review was conducted across PubMed, Medline, Scopus, Web of Science, Embase, and ACM Digital Library. Studies were included if they focused on NLP methodologies applied to genomic sequencing data analysis, without restrictions on publication date or article type. RESULTS A total of 26 studies published between 2021 and April 2024 were selected for review. The review highlights that tokenization and transformer models enhance the processing and understanding of genomic data, with applications in predicting regulatory annotations like transcription-factor binding sites and chromatin accessibility. DISCUSSION The application of NLP and LLMs to genomic sequencing data interpretation is a promising field that can help streamline the processing of large-scale genomic data while also providing a better understanding of its complex structures. It has the potential to drive advancements in personalized medicine by offering more efficient and scalable solutions for genomic analysis. Further research is also needed to discuss and overcome current limitations, enhancing model transparency and applicability. CONCLUSION This review highlights the growing role of NLP, particularly LLMs, in genomic sequencing data analysis. While these models improve data processing and regulatory annotation prediction, challenges remain in accessibility and interpretability. Further research is needed to refine their application in genomics.
Collapse
Affiliation(s)
- Shuyan Cheng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| | - Yishu Wei
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| | - Yiliang Zhou
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| | - Zihan Xu
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| | - Drew N Wright
- Samuel J. Wood Library & C.V. Starr Biomedical Information Center, Weill Cornell Medicine, New York, NY 10065, United States
| | - Jinze Liu
- School of Public Health, Virginia Commonwealth University, Richmond, VA 23219, United States
| | - Yifan Peng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| |
Collapse
|
4
|
Asim MN, Asif T, Mehmood F, Dengel A. Peptide classification landscape: An in-depth systematic literature review on peptide types, databases, datasets, predictors architectures and performance. Comput Biol Med 2025; 188:109821. [PMID: 39987697 DOI: 10.1016/j.compbiomed.2025.109821] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Revised: 02/03/2025] [Accepted: 02/05/2025] [Indexed: 02/25/2025]
Abstract
Peptides are gaining significant attention in diverse fields such as the pharmaceutical market has seen a steady rise in peptide-based therapeutics over the past six decades. Peptides have been utilized in the development of distinct applications including inhibitors of SARS-COV-2 and treatments for conditions like cancer and diabetes. Distinct types of peptides possess unique characteristics, and development of peptide-specific applications require the discrimination of one peptide type from others. To the best of our knowledge, approximately 230 Artificial Intelligence (AI) driven applications have been developed for 22 distinct types of peptides, yet there remains significant room for development of new predictors. A Comprehensive review addresses the critical gap by providing a consolidated platform for the development of AI-driven peptide classification applications. This paper offers several key contributions, including presenting the biological foundations of 22 unique peptide types and categorizes them into four main classes: Regulatory, Therapeutic, Nutritional, and Delivery Peptides. It offers an in-depth overview of 47 databases that have been used to develop peptide classification benchmark datasets. It summarizes details of 288 benchmark datasets that are used in development of diverse types AI-driven peptide classification applications. It provides a detailed summary of 197 sequence representation learning methods and 94 classifiers that have been used to develop 230 distinct AI-driven peptide classification applications. Across 22 distinct types peptide classification tasks related to 288 benchmark datasets, it demonstrates performance values of 230 AI-driven peptide classification applications. It summarizes experimental settings and various evaluation measures that have been employed to assess the performance of AI-driven peptide classification applications. The primary focus of this manuscript is to consolidate scattered information into a single comprehensive platform. This resource will greatly assist researchers who are interested in developing new AI-driven peptide classification applications.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence, Kaiserslautern, 67663, Germany; Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany.
| | - Tayyaba Asif
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
| | - Faiza Mehmood
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; Institute of Data Sciences, University of Engineering and Technology, Lahore, Pakistan
| | - Andreas Dengel
- German Research Center for Artificial Intelligence, Kaiserslautern, 67663, Germany; Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| |
Collapse
|
5
|
De Waele G, Menschaert G, Vandamme P, Waegeman W. Pre-trained Maldi Transformers improve MALDI-TOF MS-based prediction. Comput Biol Med 2025; 186:109695. [PMID: 39847945 DOI: 10.1016/j.compbiomed.2025.109695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 01/10/2025] [Accepted: 01/13/2025] [Indexed: 01/25/2025]
Abstract
For the last decade, matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) has been the reference method for species identification in clinical microbiology. Hampered by a historical lack of open data, machine learning research towards models specifically adapted to MALDI-TOF MS remains in its infancy. Given the growing complexity of available datasets (such as large-scale antimicrobial resistance prediction), a need for models that (1) are specifically designed for MALDI-TOF MS data, and (2) have high representational capacity, presents itself. Here, we introduce Maldi Transformer, an adaptation of the state-of-the-art transformer architecture to the MALDI-TOF mass spectral domain. We propose the first self-supervised pre-training technique specifically designed for mass spectra. The technique is based on shuffling peaks across spectra, and pre-training the transformer as a peak discriminator. Extensive benchmarks confirm the efficacy of this novel design. The final result is a model exhibiting state-of-the-art (or competitive) performance on downstream prediction tasks. In addition, we show that Maldi Transformer's identification of noisy spectra may be leveraged towards higher predictive performance. All code supporting this study is distributed on PyPI and is packaged under: https://github.com/gdewael/maldi-nn.
Collapse
Affiliation(s)
- Gaetan De Waele
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, Ghent, 9000, Belgium.
| | - Gerben Menschaert
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, Ghent, 9000, Belgium
| | - Peter Vandamme
- Laboratory of Microbiology, Ghent University, K. L. Ledeganckstraat 35, Ghent, 9000, Belgium
| | - Willem Waegeman
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, Ghent, 9000, Belgium
| |
Collapse
|
6
|
Abbasi AF, Asim MN, Dengel A. Transitioning from wet lab to artificial intelligence: a systematic review of AI predictors in CRISPR. J Transl Med 2025; 23:153. [PMID: 39905452 DOI: 10.1186/s12967-024-06013-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Accepted: 12/18/2024] [Indexed: 02/06/2025] Open
Abstract
The revolutionary CRISPR-Cas9 system leverages a programmable guide RNA (gRNA) and Cas9 proteins to precisely cleave problematic regions within DNA sequences. This groundbreaking technology holds immense potential for the development of targeted therapies for a wide range of diseases, including cancers, genetic disorders, and hereditary diseases. CRISPR-Cas9 based genome editing is a multi-step process such as designing a precise gRNA, selecting the appropriate Cas protein, and thoroughly evaluating both on-target and off-target activity of the Cas9-gRNA complex. To ensure the accuracy and effectiveness of CRISPR-Cas9 system, after the targeted DNA cleavage, the process requires careful analysis of the resultant outcomes such as indels and deletions. Following the success of artificial intelligence (AI) in various fields, researchers are now leveraging AI algorithms to catalyze and optimize the multi-step process of CRISPR-Cas9 system. To achieve this goal AI-driven applications are being integrated into each step, but existing AI predictors have limited performance and many steps still rely on expensive and time-consuming wet-lab experiments. The primary reason behind low performance of AI predictors is the gap between CRISPR and AI fields. Effective integration of AI into multi-step CRISPR-Cas9 system demands comprehensive knowledge of both domains. This paper bridges the knowledge gap between AI and CRISPR-Cas9 research. It offers a unique platform for AI researchers to grasp deep understanding of the biological foundations behind each step in the CRISPR-Cas9 multi-step process. Furthermore, it provides details of 80 available CRISPR-Cas9 system-related datasets that can be utilized to develop AI-driven applications. Within the landscape of AI predictors in CRISPR-Cas9 multi-step process, it provides insights of representation learning methods, machine and deep learning methods trends, and performance values of existing 50 predictive pipelines. In the context of representation learning methods and classifiers/regressors, a thorough analysis of existing predictive pipelines is utilized for recommendations to develop more robust and precise predictive pipelines.
Collapse
Affiliation(s)
- Ahtisham Fazeel Abbasi
- Smart Data and Knowledge Services, German Research Center for Artificial Intelligence, 67663, Kaiserslautern, Germany.
- Department of Computer Science, Rhineland-Palatinate Technical University Kaiserslautern-Landau, 67663, Kaiserslautern, Germany.
| | - Muhammad Nabeel Asim
- Department of Computer Science, Rhineland-Palatinate Technical University Kaiserslautern-Landau, 67663, Kaiserslautern, Germany
| | - Andreas Dengel
- Smart Data and Knowledge Services, German Research Center for Artificial Intelligence, 67663, Kaiserslautern, Germany
- Department of Computer Science, Rhineland-Palatinate Technical University Kaiserslautern-Landau, 67663, Kaiserslautern, Germany
| |
Collapse
|
7
|
Asim MN, Ibrahim MA, Asif T, Dengel A. RNA sequence analysis landscape: A comprehensive review of task types, databases, datasets, word embedding methods, and language models. Heliyon 2025; 11:e41488. [PMID: 39897847 PMCID: PMC11783440 DOI: 10.1016/j.heliyon.2024.e41488] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Revised: 12/23/2024] [Accepted: 12/24/2024] [Indexed: 02/04/2025] Open
Abstract
Deciphering information of RNA sequences reveals their diverse roles in living organisms, including gene regulation and protein synthesis. Aberrations in RNA sequence such as dysregulation and mutations can drive a diverse spectrum of diseases including cancers, genetic disorders, and neurodegenerative conditions. Furthermore, researchers are harnessing RNA's therapeutic potential for transforming traditional treatment paradigms into personalized therapies through the development of RNA-based drugs and gene therapies. To gain insights of biological functions and to detect diseases at early stages and develop potent therapeutics, researchers are performing diverse types RNA sequence analysis tasks. RNA sequence analysis through conventional wet-lab methods is expensive, time-consuming and error prone. To enable large-scale RNA sequence analysis, empowerment of wet-lab experimental methods with Artificial Intelligence (AI) applications necessitates scientists to have a comprehensive knowledge of both DNA and AI fields. While molecular biologists encounter challenges in understanding AI methods, computer scientists often lack basic foundations of RNA sequence analysis tasks. Considering the absence of a comprehensive literature that bridges this research gap and promotes the development of AI-driven RNA sequence analysis applications, the contributions of this manuscript are manifold: It equips AI researchers with biological foundations of 47 distinct RNA sequence analysis tasks. It sets a stage for development of benchmark datasets related to 47 distinct RNA sequence analysis tasks by facilitating cruxes of 64 different biological databases. It presents word embeddings and language models applications across 47 distinct RNA sequence analysis tasks. It streamlines the development of new predictors by providing a comprehensive survey of 58 word embeddings and 70 language models based predictive pipelines performance values as well as top performing traditional sequence encoding based predictors and their performances across 47 RNA sequence analysis tasks.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
| | - Tayyaba Asif
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
| | - Andreas Dengel
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
| |
Collapse
|
8
|
Kwon J, Hwang J, Sung JE, Im CH. Speech synthesis from three-axis accelerometer signals using conformer-based deep neural network. Comput Biol Med 2024; 182:109090. [PMID: 39232406 DOI: 10.1016/j.compbiomed.2024.109090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 08/23/2024] [Accepted: 08/29/2024] [Indexed: 09/06/2024]
Abstract
Silent speech interfaces (SSIs) have emerged as innovative non-acoustic communication methods, and our previous study demonstrated the significant potential of three-axis accelerometer-based SSIs to identify silently spoken words with high classification accuracy. The developed accelerometer-based SSI with only four accelerometers and a small training dataset outperformed a conventional surface electromyography (sEMG)-based SSI. In this study, motivated by the promising initial results, we investigated the feasibility of synthesizing spoken speech from three-axis accelerometer signals. This exploration aimed to assess the potential of accelerometer-based SSIs for practical silent communication applications. Nineteen healthy individuals participated in our experiments. Five accelerometers were attached to the face to acquire speech-related facial movements while the participants read 270 Korean sentences aloud. For the speech synthesis, we used a convolution-augmented Transformer (Conformer)-based deep neural network model to convert the accelerometer signals into a Mel spectrogram, from which an audio waveform was synthesized using HiFi-GAN. To evaluate the quality of the generated Mel spectrograms, ten-fold cross-validation was performed, and the Mel cepstral distortion (MCD) was chosen as the evaluation metric. As a result, an average MCD of 5.03 ± 0.65 was achieved using four optimized accelerometers based on our previous study. Furthermore, the quality of generated Mel spectrograms was significantly enhanced by adding one more accelerometer attached under the chin, achieving an average MCD of 4.86 ± 0.65 (p < 0.001, Wilcoxon signed-rank test). Although an objective comparison is difficult, these results surpass those obtained using conventional SSIs based on sEMG, electromagnetic articulography, and electropalatography with the fewest sensors and a similar or smaller number of sentences to train the model. Our proposed approach will contribute to the widespread adoption of accelerometer-based SSIs, leveraging the advantages of accelerometers like low power consumption, invulnerability to physiological artifacts, and high portability.
Collapse
Affiliation(s)
- Jinuk Kwon
- Department of Electronic Engineering, Hanyang University, Seoul, South Korea.
| | - Jihun Hwang
- Department of Electronic Engineering, Hanyang University, Seoul, South Korea.
| | - Jee Eun Sung
- Department of Communication Disorders, Ewha Womans University, Seoul, South Korea.
| | - Chang-Hwan Im
- Department of Electronic Engineering, Hanyang University, Seoul, South Korea; Department of Biomedical Engineering, Hanyang University, Seoul, South Korea; Department of Artificial Intelligence, Hanyang University, Seoul, South Korea; Department of HY-KIST Bio-Convergence, Hanyang University, Seoul, South Korea.
| |
Collapse
|
9
|
John C, Sahoo J, Sajan IK, Madhavan M, Mathew OK. CNN-BLSTM based deep learning framework for eukaryotic kinome classification: An explainability based approach. Comput Biol Chem 2024; 112:108169. [PMID: 39137619 DOI: 10.1016/j.compbiolchem.2024.108169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 05/08/2024] [Accepted: 08/03/2024] [Indexed: 08/15/2024]
Abstract
Classification of protein families from their sequences is an enduring task in Proteomics and related studies. Numerous deep-learning models have been moulded to tackle this challenge, but due to the black-box character, they still fall short in reliability. Here, we present a novel explainability pipeline that explains the pivotal decisions of the deep learning model on the classification of the Eukaryotic kinome. Based on a comparative and experimental analysis of the most cutting-edge deep learning algorithms, the best deep learning model CNN-BLSTM was chosen to classify the eight eukaryotic kinase sequences to their corresponding families. As a substitution for the conventional class activation map-based interpretation of CNN-based models in the domain, we have cascaded the GRAD CAM and Integrated Gradient (IG) explainability modus operandi for improved and responsible results. To ensure the trustworthiness of the classifier, we have masked the kinase domain traces, identified from the explainability pipeline and observed a class-specific drop in F1-score from 0.96 to 0.76. In compliance with the Explainable AI paradigm, our results are promising and contribute to enhancing the trustworthiness of deep learning models for biological sequence-associated studies.
Collapse
Affiliation(s)
- Chinju John
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India.
| | - Jayakrushna Sahoo
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India
| | - Irish K Sajan
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India
| | - Manu Madhavan
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India
| | - Oommen K Mathew
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India
| |
Collapse
|
10
|
Madan S, Lentzen M, Brandt J, Rueckert D, Hofmann-Apitius M, Fröhlich H. Transformer models in biomedicine. BMC Med Inform Decis Mak 2024; 24:214. [PMID: 39075407 PMCID: PMC11287876 DOI: 10.1186/s12911-024-02600-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Accepted: 07/08/2024] [Indexed: 07/31/2024] Open
Abstract
Deep neural networks (DNN) have fundamentally revolutionized the artificial intelligence (AI) field. The transformer model is a type of DNN that was originally used for the natural language processing tasks and has since gained more and more attention for processing various kinds of sequential data, including biological sequences and structured electronic health records. Along with this development, transformer-based models such as BioBERT, MedBERT, and MassGenie have been trained and deployed by researchers to answer various scientific questions originating in the biomedical domain. In this paper, we review the development and application of transformer models for analyzing various biomedical-related datasets such as biomedical textual data, protein sequences, medical structured-longitudinal data, and biomedical images as well as graphs. Also, we look at explainable AI strategies that help to comprehend the predictions of transformer-based models. Finally, we discuss the limitations and challenges of current models, and point out emerging novel research directions.
Collapse
Affiliation(s)
- Sumit Madan
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany.
- Institute of Computer Science, University of Bonn, Bonn, 53115, Germany.
| | - Manuel Lentzen
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany
| | - Johannes Brandt
- School of Medicine, Klinikum Rechts der Isar, Technical University Munich, Munich, Germany
| | - Daniel Rueckert
- School of Medicine, Klinikum Rechts der Isar, Technical University Munich, Munich, Germany
- School of Computation, Information and Technology, Technical University Munich, Munich, Germany
- Department of Computing, Imperial College London, London, UK
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany
| | - Holger Fröhlich
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany.
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany.
| |
Collapse
|
11
|
Zhang Y, Liu C, Liu M, Liu T, Lin H, Huang CB, Ning L. Attention is all you need: utilizing attention in AI-enabled drug discovery. Brief Bioinform 2023; 25:bbad467. [PMID: 38189543 PMCID: PMC10772984 DOI: 10.1093/bib/bbad467] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 11/03/2023] [Accepted: 11/25/2023] [Indexed: 01/09/2024] Open
Abstract
Recently, attention mechanism and derived models have gained significant traction in drug development due to their outstanding performance and interpretability in handling complex data structures. This review offers an in-depth exploration of the principles underlying attention-based models and their advantages in drug discovery. We further elaborate on their applications in various aspects of drug development, from molecular screening and target binding to property prediction and molecule generation. Finally, we discuss the current challenges faced in the application of attention mechanisms and Artificial Intelligence technologies, including data quality, model interpretability and computational resource constraints, along with future directions for research. Given the accelerating pace of technological advancement, we believe that attention-based models will have an increasingly prominent role in future drug discovery. We anticipate that these models will usher in revolutionary breakthroughs in the pharmaceutical domain, significantly accelerating the pace of drug development.
Collapse
Affiliation(s)
- Yang Zhang
- Innovative Institute of Chinese Medicine and Pharmacy, Academy for Interdiscipline, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Caiqi Liu
- Department of Gastrointestinal Medical Oncology, Harbin Medical University Cancer Hospital, No.150 Haping Road, Nangang District, Harbin, Heilongjiang 150081, China
- Key Laboratory of Molecular Oncology of Heilongjiang Province, No.150 Haping Road, Nangang District, Harbin, Heilongjiang 150081, China
| | - Mujiexin Liu
- Chongqing Key Laboratory of Sichuan-Chongqing Co-construction for Diagnosis and Treatment of Infectious Diseases Integrated Traditional Chinese and Western Medicine, College of Medical Technology, Chengdu University of Traditional Chinese Medicine, Chengdu, China
| | - Tianyuan Liu
- Graduate School of Science and Technology, University of Tsukuba, Tsukuba, Japan
| | - Hao Lin
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Cheng-Bing Huang
- School of Computer Science and Technology, Aba Teachers University, Aba, China
| | - Lin Ning
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
- School of Healthcare Technology, Chengdu Neusoft University, Chengdu 611844, China
| |
Collapse
|
12
|
Park H, Lim SJ, Cosme J, O'Connell K, Sandeep J, Gayanilo F, Cutter Jr. GR, Montes E, Nitikitpaiboon C, Fisher S, Moustahfid H, Thompson LR. Investigation of machine learning algorithms for taxonomic classification of marine metagenomes. Microbiol Spectr 2023; 11:e0523722. [PMID: 37695074 PMCID: PMC10580933 DOI: 10.1128/spectrum.05237-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Accepted: 06/30/2023] [Indexed: 09/12/2023] Open
Abstract
IMPORTANCE Taxonomic profiling of microbial communities is essential to model microbial interactions and inform habitat conservation. This work develops approaches in constructing training/testing data sets from publicly available marine metagenomes and evaluates the performance of machine learning (ML) approaches in read-based taxonomic classification of marine metagenomes. Predictions from two models are used to test accuracy in metagenomic classification and to guide improvements in ML approaches. Our study provides insights on the methods, results, and challenges of deep learning on marine microbial metagenomic data sets. Future machine learning approaches can be improved by rectifying genome coverage and class imbalance in the training data sets, developing alternative models, and increasing the accessibility of computational resources for model training and refinement.
Collapse
Affiliation(s)
- Helen Park
- Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua-Peking Center for Life Sciences, Tsinghua University, Beijing, China
- EPSRC/BBSRC Future Biomanufacturing Research Hub, EPSRC Synthetic Biology Research Centre SYNBIOCHEM Manchester Institute of Biotechnology and School of Chemistry, The University of Manchester, Manchester, United Kingdom
| | - Shen Jean Lim
- Cooperative Institute for Marine and Atmospheric Studies, Rosenstiel School of Marine, Atmospheric, and Earth Science, University of Miami, Miami, Florida, USA
- Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, Florida, USA
- College of Marine Science, University of South Florida, St Petersburg, Florida, USA
| | | | - Kyle O'Connell
- Deloitte Consulting LLP, Biomedical Data Science Team, Arlington, Virginia, USA
- Department of Vertebrate Zoology, National Museum of Natural History, Smithsonian Institution, Northwest, Washington, DC, USA
| | - Jilla Sandeep
- Harte Research Institute, Texas A&M University-Corpus Christi, Corpus Christi, Texas, USA
| | - Felimon Gayanilo
- Harte Research Institute, Texas A&M University-Corpus Christi, Corpus Christi, Texas, USA
| | - George R. Cutter Jr.
- Southwest Fisheries Science Center, Antarctic Ecosystem Research Division, National Oceanic and Atmospheric Administration, La Jolla, California, USA
| | - Enrique Montes
- Cooperative Institute for Marine and Atmospheric Studies, Rosenstiel School of Marine, Atmospheric, and Earth Science, University of Miami, Miami, Florida, USA
- Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, Florida, USA
| | - Chotinan Nitikitpaiboon
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
| | - Sam Fisher
- Deloitte Consulting LLP, Biomedical Data Science Team, Arlington, Virginia, USA
| | - Hassan Moustahfid
- NOAA/US Integrated Ocean Observing System (IOOS), Silver Spring, Maryland, USA
| | - Luke R. Thompson
- Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, Florida, USA
- Northern Gulf Institute, Mississippi State University, Mississippi, USA
| |
Collapse
|
13
|
Gurdo N, Volke DC, McCloskey D, Nikel PI. Automating the design-build-test-learn cycle towards next-generation bacterial cell factories. N Biotechnol 2023; 74:1-15. [PMID: 36736693 DOI: 10.1016/j.nbt.2023.01.002] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2022] [Revised: 01/15/2023] [Accepted: 01/22/2023] [Indexed: 02/04/2023]
Abstract
Automation is playing an increasingly significant role in synthetic biology. Groundbreaking technologies, developed over the past 20 years, have enormously accelerated the construction of efficient microbial cell factories. Integrating state-of-the-art tools (e.g. for genome engineering and analytical techniques) into the design-build-test-learn cycle (DBTLc) will shift the metabolic engineering paradigm from an almost artisanal labor towards a fully automated workflow. Here, we provide a perspective on how a fully automated DBTLc could be harnessed to construct the next-generation bacterial cell factories in a fast, high-throughput fashion. Innovative toolsets and approaches that pushed the boundaries in each segment of the cycle are reviewed to this end. We also present the most recent efforts on automation of the DBTLc, which heralds a fully autonomous pipeline for synthetic biology in the near future.
Collapse
Affiliation(s)
- Nicolás Gurdo
- The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kongens, Lyngby, Denmark
| | - Daniel C Volke
- The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kongens, Lyngby, Denmark
| | - Douglas McCloskey
- The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kongens, Lyngby, Denmark
| | - Pablo Iván Nikel
- The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kongens, Lyngby, Denmark.
| |
Collapse
|
14
|
Ranjan A, Fahad MS, Fernandez-Baca D, Tripathi S, Deepak A. MCWS-Transformers: Towards an Efficient Modeling of Protein Sequences via Multi Context-Window Based Scaled Self-Attention. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1188-1199. [PMID: 35536815 DOI: 10.1109/tcbb.2022.3173789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
This paper advances the self-attention mechanism in the standard transformer network specific to the modeling of the protein sequences. We introduce a novel context-window based scaled self-attention mechanism for processing protein sequences that is based on the notion of (i) local context and (ii) large contextual pattern. Both notions are essential to building a good representation for protein sequences. The proposed context-window based scaled self-attention mechanism is further used to build the multi context-window based scaled (MCWS) transformer network for the protein function prediction task at the protein sub-sequence level. Overall, the proposed MCWS transformer network produced improved predictive performances, outperforming existing state-of-the-art approaches by substantial margins. With respect to the standard transformer network, the proposed network produced improvements in F1-score of +2.30% and +2.08% on the biological process (BP) and molecular function (MF) datasets, respectively. The corresponding improvements over the state-of-the-art ProtVecGen-Plus+ProtVecGen-Ensemble approach are +3.38% (BP) and +2.86% (MF). Equally important, robust performances were obtained across protein sequences of different lengths.
Collapse
|
15
|
Rahardja S, Wang M, Nguyen BP, Fränti P, Rahardja S. A lightweight classification of adaptor proteins using transformer networks. BMC Bioinformatics 2022; 23:461. [PMID: 36333658 PMCID: PMC9635127 DOI: 10.1186/s12859-022-05000-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Accepted: 09/13/2022] [Indexed: 11/06/2022] Open
Abstract
BACKGROUND Adaptor proteins play a key role in intercellular signal transduction, and dysfunctional adaptor proteins result in diseases. Understanding its structure is the first step to tackling the associated conditions, spurring ongoing interest in research into adaptor proteins with bioinformatics and computational biology. Our study aims to introduce a small, new, and superior model for protein classification, pushing the boundaries with new machine learning algorithms. RESULTS We propose a novel transformer based model which includes convolutional block and fully connected layer. We input protein sequences from a database, extract PSSM features, then process it via our deep learning model. The proposed model is efficient and highly compact, achieving state-of-the-art performance in terms of area under the receiver operating characteristic curve, Matthew's Correlation Coefficient and Receiver Operating Characteristics curve. Despite merely 20 hidden nodes translating to approximately 1% of the complexity of previous best known methods, the proposed model is still superior in results and computational efficiency. CONCLUSIONS The proposed model is the first transformer model used for recognizing adaptor protein, and outperforms all existing methods, having PSSM profiles as inputs that comprises convolutional blocks, transformer and fully connected layers for the use of classifying adaptor proteins.
Collapse
Affiliation(s)
- Sylwan Rahardja
- grid.9668.10000 0001 0726 2490School of Computing, University of Eastern Finland, Joensuu, Finland
| | - Mou Wang
- grid.440588.50000 0001 0307 1240School of Marine Science and Technology, Northwestern Polytechnical University and Singapore Institute of Technology, 710072 Xi’an, China
| | - Binh P. Nguyen
- grid.267827.e0000 0001 2292 3111School of Mathematics and Statistics, Victoria University of Wellington, Wellington, New Zealand
| | - Pasi Fränti
- grid.9668.10000 0001 0726 2490School of Computing, University of Eastern Finland, Joensuu, Finland
| | - Susanto Rahardja
- grid.440588.50000 0001 0307 1240School of Marine Science and Technology, Northwestern Polytechnical University and Singapore Institute of Technology, 710072 Xi’an, China ,grid.486188.b0000 0004 1790 4399Singapore Institute of Technology, Singapore, 138683 Singapore
| |
Collapse
|
16
|
Iuchi H, Matsutani T, Yamada K, Iwano N, Sumi S, Hosoda S, Zhao S, Fukunaga T, Hamada M. Representation learning applications in biological sequence analysis. Comput Struct Biotechnol J 2021; 19:3198-3208. [PMID: 34141139 PMCID: PMC8190442 DOI: 10.1016/j.csbj.2021.05.039] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 05/10/2021] [Accepted: 05/20/2021] [Indexed: 12/16/2022] Open
Abstract
Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains a critical hurdle. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention. In this method, biological sequences are regarded as sentences while the single nucleic acids/amino acids or k-mers in these sequences represent the words. Embedding is an essential step in NLP, which performs the conversion of these words into vectors. Specifically, representation learning is an approach used for this transformation process, which can be applied to biological sequences. Vectorized biological sequences can then be applied for function and structure estimation, or as input for other probabilistic models. Considering the importance and growing trend for the application of representation learning to biological research, in the present study, we have reviewed the existing knowledge in representation learning for biological sequence analysis.
Collapse
Affiliation(s)
- Hitoshi Iuchi
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
| | - Taro Matsutani
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Keisuke Yamada
- School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Natsuki Iwano
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Shunsuke Sumi
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Department of Life Science Frontiers, Center for iPS Cell Research and Application, Kyoto University, Kyoto 606-8507, Japan
| | - Shion Hosoda
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Shitao Zhao
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Tsukasa Fukunaga
- Waseda Institute for Advanced Study, Waseda University, Tokyo 169-0051, Japan
- Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-0032, Japan
| | - Michiaki Hamada
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Graduate School of Medicine, Nippon Medical School, Tokyo 113-8602, Japan
| |
Collapse
|
17
|
Clauwaert J, Menschaert G, Waegeman W. Explainability in transformer models for functional genomics. Brief Bioinform 2021; 22:6214646. [PMID: 33834200 PMCID: PMC8425421 DOI: 10.1093/bib/bbab060] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Revised: 01/28/2021] [Accepted: 02/05/2021] [Indexed: 11/16/2022] Open
Abstract
The effectiveness of deep learning methods can be largely attributed to the automated extraction of relevant features from raw data. In the field of functional genomics, this generally concerns the automatic selection of relevant nucleotide motifs from DNA sequences. To benefit from automated learning methods, new strategies are required that unveil the decision-making process of trained models. In this paper, we present a new approach that has been successful in gathering insights on the transcription process in Escherichia coli. This work builds upon a transformer-based neural network framework designed for prokaryotic genome annotation purposes. We find that the majority of subunits (attention heads) of the model are specialized towards identifying transcription factors and are able to successfully characterize both their binding sites and consensus sequences, uncovering both well-known and potentially novel elements involved in the initiation of the transcription process. With the specialization of the attention heads occurring automatically, we believe transformer models to be of high interest towards the creation of explainable neural networks in this field.
Collapse
Affiliation(s)
- Jim Clauwaert
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, 9000 Gent, Belgium
| | - Gerben Menschaert
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, 9000 Gent, Belgium
| | - Willem Waegeman
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, 9000 Gent, Belgium
| |
Collapse
|
18
|
Yuan L, Yang Y. DeCban: Prediction of circRNA-RBP Interaction Sites by Using Double Embeddings and Cross-Branch Attention Networks. Front Genet 2021; 11:632861. [PMID: 33552144 PMCID: PMC7862712 DOI: 10.3389/fgene.2020.632861] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Accepted: 12/23/2020] [Indexed: 12/17/2022] Open
Abstract
Circular RNAs (circRNAs), as a rising star in the RNA world, play important roles in various biological processes. Understanding the interactions between circRNAs and RNA binding proteins (RBPs) can help reveal the functions of circRNAs. For the past decade, the emergence of high-throughput experimental data, like CLIP-Seq, has made the computational identification of RNA-protein interactions (RPIs) possible based on machine learning methods. However, as the underlying mechanisms of RPIs have not been fully understood yet and the information sources of circRNAs are limited, the computational tools for predicting circRNA-RBP interactions have been very few. In this study, we propose a deep learning method to identify circRNA-RBP interactions, called DeCban, which is featured by hybrid double embeddings for representing RNA sequences and a cross-branch attention neural network for classification. To capture more information from RNA sequences, the double embeddings include pre-trained embedding vectors for both RNA segments and their converted amino acids. Meanwhile, the cross-branch attention network aims to address the learning of very long sequences by integrating features of different scales and focusing on important information. The experimental results on 37 benchmark datasets show that both double embeddings and the cross-branch attention model contribute to the improvement of performance. DeCban outperforms the mainstream deep learning-based methods on not only prediction accuracy but also computational efficiency. The data sets and source code of this study are freely available at: https://github.com/AaronYll/DECban.
Collapse
Affiliation(s)
- Liangliang Yuan
- Department of Computer Science and Engineering, Center for Brain-Like Computing and Machine Intelligence, Shanghai Jiao Tong University, Shanghai, China
| | - Yang Yang
- Department of Computer Science and Engineering, Center for Brain-Like Computing and Machine Intelligence, Shanghai Jiao Tong University, Shanghai, China.,Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, China
| |
Collapse
|