1
|
Crauwels C, Díaz A, Vranken W. GPCRchimeraDB: A Database of Chimeric G Protein-Coupled Receptors (GPCRs) to Assist Their Design. J Mol Biol 2025; 437:169164. [PMID: 40268234 DOI: 10.1016/j.jmb.2025.169164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2024] [Revised: 04/11/2025] [Accepted: 04/16/2025] [Indexed: 04/25/2025]
Abstract
G protein-coupled receptors (GPCRs) are membrane proteins crucial to numerous diseases, yet many remain poorly characterized and untargeted by drugs. Chimeric GPCRs have emerged as valuable tools for elucidating GPCR function by facilitating the identification of signaling pathways, resolving structures, and discovering novel ligands of poorly understood GPCRs. Such chimeric GPCRs are obtained by merging a well- and less-well-characterized GPCR at the intracellular limits of their transmembrane regions or intracellular loops, leveraging knowledge transfer from the well-characterized GPCR. However, despite the engineering of over 200 chimeric GPCRs to date, the design process remains largely trial-and-error and lacks a standardized approach. To address this gap, we introduce GPCRchimeraDB (https://www.bio2byte.be/gpcrchimeradb/), the first comprehensive database dedicated to chimeric GPCRs. It catalogs 212 chimeric receptors, identified through literature review, and includes 1,755 class A natural GPCRs, enabling connections between chimeras and their parent receptors while facilitating the exploration of novel parent combinations. Both chimeric and natural GPCR entries are extensively described at the sequence, structural, and biophysical level through a range of visualization tools, with annotations from resources like UniProt and GPCRdb and predictions from AlphaFold2 and b2btools. Additionally, GPCRchimeraDB offers a GPCR sequence aligner and a feature comparator to investigate differences between natural and chimeric receptors. It also provides design guidelines to support rational chimera engineering. GPCRchimeraDB is therefore a resource to facilitate and optimize the design of new chimeras, so helping to gain insights into poorly characterized receptors and contributing to advances in GPCR therapeutic development.
Collapse
Affiliation(s)
- Charlotte Crauwels
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium; Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium; AI Lab, Vrije Universiteit Brussel, Brussels, Belgium
| | - Adrián Díaz
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium; Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium; AI Lab, Vrije Universiteit Brussel, Brussels, Belgium
| | - Wim Vranken
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Brussels, Belgium; Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium; AI Lab, Vrije Universiteit Brussel, Brussels, Belgium; Chemistry Department, Vrije Universiteit Brussel, Brussels, Belgium; Biomedical Sciences, Vrije Universiteit Brussel, Brussels, Belgium.
| |
Collapse
|
2
|
Iovino BG, Tang H, Ye Y. Protein domain embeddings for fast and accurate similarity search. Genome Res 2024; 34:1434-1444. [PMID: 39237301 PMCID: PMC11529836 DOI: 10.1101/gr.279127.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 09/03/2024] [Indexed: 09/07/2024]
Abstract
Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation (DCT) to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins; however, limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins with single domains but not multidomain proteins. Here, we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the DCT to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, uses predicted contact maps from ESM-2 for domain segmentation, which is formulated as a domain segmentation problem and can be solved using a recursive cut algorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We show such domain-level contextual vectors (termed as DCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark show that the DCTdomain is able to detect distant homologs by leveraging the structural information in the contextual embeddings.
Collapse
Affiliation(s)
- Benjamin Giovanni Iovino
- Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana 47408, USA
| | - Haixu Tang
- Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana 47408, USA
| | - Yuzhen Ye
- Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana 47408, USA
| |
Collapse
|
3
|
Ao C, Jiao S, Wang Y, Yu L, Zou Q. Biological Sequence Classification: A Review on Data and General Methods. RESEARCH (WASHINGTON, D.C.) 2022; 2022:0011. [PMID: 39285948 PMCID: PMC11404319 DOI: 10.34133/research.0011] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 10/25/2022] [Indexed: 09/19/2024]
Abstract
With the rapid development of biotechnology, the number of biological sequences has grown exponentially. The continuous expansion of biological sequence data promotes the application of machine learning in biological sequences to construct predictive models for mining biological sequence information. There are many branches of biological sequence classification research. In this review, we mainly focus on the function and modification classification of biological sequences based on machine learning. Sequence-based prediction and analysis are the basic tasks to understand the biological functions of DNA, RNA, proteins, and peptides. However, there are hundreds of classification models developed for biological sequences, and the quite varied specific methods seem dizzying at first glance. Here, we aim to establish a long-term support website (http://lab.malab.cn/~acy/BioseqData/home.html), which provides readers with detailed information on the classification method and download links to relevant datasets. We briefly introduce the steps to build an effective model framework for biological sequence data. In addition, a brief introduction to single-cell sequencing data analysis methods and applications in biology is also included. Finally, we discuss the current challenges and future perspectives of biological sequence classification research.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
4
|
Barukab O, Ali F, Khan SA. DBP-GAPred: An intelligent method for prediction of DNA-binding proteins types by enhanced evolutionary profile features with ensemble learning. J Bioinform Comput Biol 2021; 19:2150018. [PMID: 34291709 DOI: 10.1142/s0219720021500189] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
DNA-binding proteins (DBPs) perform an influential role in diverse biological activities like DNA replication, slicing, repair, and transcription. Some DBPs are indispensable for understanding many types of human cancers (i.e. lung, breast, and liver cancer) and chronic diseases (i.e. AIDS/HIV, asthma), while other kinds are involved in antibiotics, steroids, and anti-inflammatory drugs designing. These crucial processes are closely related to DBPs types. DBPs are categorized into single-stranded DNA-binding proteins (ssDBPs) and double-stranded DNA-binding proteins (dsDBPs). Few computational predictors have been reported for discriminating ssDBPs and dsDBPs. However, due to the limitations of the existing methods, an intelligent computational system is still highly desirable. In this work, features from protein sequences are discovered by extending the notion of dipeptide composition (DPC), evolutionary difference formula (EDF), and K-separated bigram (KSB) into the position-specific scoring matrix (PSSM). The highly intrinsic information was encoded by a compression approach named discrete cosine transform (DCT) and the model was trained with support vector machine (SVM). The prediction performance was further boosted by the genetic algorithm (GA) ensemble strategy. The novel predictor (DBP-GAPred) acquired 1.89%, 0.28%, and 6.63% higher accuracies on jackknife, 10-fold, and independent dataset tests, respectively than the best predictor. These outcomes confirm the superiority of our method over the existing predictors.
Collapse
Affiliation(s)
- Omar Barukab
- Faculty of Computing and Information Technology, King Abdulaziz University, Rabigh 21911 Jeddah, Saudi Arabia
| | - Farman Ali
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, P. R. China
| | - Sher Afzal Khan
- Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
| |
Collapse
|
5
|
Raimondi D, Orlando G, Michiels E, Pakravan D, Bratek-Skicki A, Van Den Bosch L, Moreau Y, Rousseau F, Schymkowitz J. In-silico prediction of in-vitro protein liquid-liquid phase separation experiments outcomes with multi-head neural attention. Bioinformatics 2021; 37:3473-3479. [PMID: 33983381 DOI: 10.1093/bioinformatics/btab350] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Revised: 04/15/2021] [Accepted: 05/10/2021] [Indexed: 01/21/2023] Open
Abstract
MOTIVATION Proteins able to undergo Liquid-Liquid Phase Separation (LLPS) in-vivo and in-vitro are drawing a lot of interest, due to their functional relevance for cell life. Nevertheless, the proteome-scale experimental screening of these proteins seems unfeasible, because besides being expensive and time consuming, LLPS is heavily influenced by multiple environmental conditions such as concentration, pH and temperature, thus requiring a combinatorial number of experiments for each protein. RESULTS To overcome this problem, we propose an Neural Network model able to predict the LLPS behavior of proteins given specified experimental conditions, effectively predicting the outcome of in-vitro experiments. Our model can be used to rapidly screen proteins and experimental conditions searching for LLPS, thus reducing the search space that needs to be covered experimentally. We experimentally validate Droppler's prediction on the the TAR DNA-binding protein in different experimental conditions, showing the consistency of its predictions. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | | | - Donya Pakravan
- KU Leuven, Department of Neurosciences, Leuven, LBI, Belgium.,VIB, Center for Brain and Disease Research, Laboratory of Neurobiology, Leuven, Belgium
| | | | - Ludo Van Den Bosch
- KU Leuven, Department of Neurosciences, Leuven, LBI, Belgium.,VIB, Center for Brain and Disease Research, Laboratory of Neurobiology, Leuven, Belgium
| | - Yves Moreau
- ESAT-STADIUS, 3001 Leuven, Leuven, KU, Belgium
| | | | | |
Collapse
|
6
|
Gao S, Yu S, Yao S. An efficient protein homology detection approach based on seq2seq model and ranking. BIOTECHNOL BIOTEC EQ 2021. [DOI: 10.1080/13102818.2021.1892522] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022] Open
Affiliation(s)
- Song Gao
- Department of Information and Electronic Science, School of Information Science and Engineering, Yunnan University, Kunming, PR China
| | - Shui Yu
- School of Computer Science, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, Australia
| | - Shaowen Yao
- Department of Cyberspace Security, National Pilot School of Software, Yunnan University, Kunming, PR China
| |
Collapse
|
7
|
Orlando G, Raimondi D, Kagami LP, Vranken WF. ShiftCrypt: a web server to understand and biophysically align proteins through their NMR chemical shift values. Nucleic Acids Res 2020; 48:W36-W40. [PMID: 32459331 PMCID: PMC7319548 DOI: 10.1093/nar/gkaa391] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2020] [Revised: 04/21/2020] [Accepted: 05/04/2020] [Indexed: 02/06/2023] Open
Abstract
Nuclear magnetic resonance (NMR) spectroscopy data provides valuable information on the behaviour of proteins in solution. The primary data to determine when studying proteins are the per-atom NMR chemical shifts, which reflect the local environment of atoms and provide insights into amino acid residue dynamics and conformation. Within an amino acid residue, chemical shifts present multi-dimensional and complexly cross-correlated information, making them difficult to analyse. The ShiftCrypt method, based on neural network auto-encoder architecture, compresses the per-amino acid chemical shift information in a single, interpretable, amino acid-type independent value that reflects the biophysical state of a residue. We here present the ShiftCrypt web server, which makes the method readily available. The server accepts chemical shifts input files in the NMR Exchange Format (NEF) or NMR-STAR format, executes ShiftCrypt and visualises the results, which are also accessible via an API. It also enables the ”biophysically-based” pairwise alignment of two proteins based on their ShiftCrypt values. This approach uses Dynamic Time Warping and can optionally include their amino acid code information, and has applications in, for example, the alignment of disordered regions. The server uses a token-based system to ensure the anonymity of the users and results. The web server is available at www.bio2byte.be/shiftcrypt.
Collapse
Affiliation(s)
- Gabriele Orlando
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Triomflaan, Brussels 1050, Belgium.,Switch Laboratory, VIB, Leuven, Belgium
| | - Daniele Raimondi
- ESAT-STADIUS, KU Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium
| | - Luciano Porto Kagami
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Triomflaan, Brussels 1050, Belgium
| | - Wim F Vranken
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, Triomflaan, Brussels 1050, Belgium.,Structural Biology Brussels, Vrije Universiteit Brussel, Pleinlaan 2, Brussels 1050, Belgium.,VIB Structural Biology Research Centre, Pleinlaan 2, Brussels 1050, Belgium
| |
Collapse
|
8
|
Margelevičius M. COMER2: GPU-accelerated sensitive and specific homology searches. Bioinformatics 2020; 36:3570-3572. [PMID: 32167522 PMCID: PMC7267824 DOI: 10.1093/bioinformatics/btaa185] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2019] [Revised: 03/02/2020] [Accepted: 03/10/2020] [Indexed: 11/30/2022] Open
Abstract
Summary Searching for homology in the vast amount of sequence data has a particular emphasis on its speed. We present a completely rewritten version of the sensitive homology search method COMER based on alignment of protein sequence profiles, which is capable of searching big databases even on a lightweight laptop. By harnessing the power of CUDA-enabled graphics processing units, it is up to 20 times faster than HHsearch, a state-of-the-art method using vectorized instructions on modern CPUs. Availability and implementation COMER2 is cross-platform open-source software available at https://sourceforge.net/projects/comer2 and https://github.com/minmarg/comer2. It can be easily installed from source code or using stand-alone installers. Contact mindaugas.margelevicius@bti.vu.lt Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mindaugas Margelevičius
- Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius 10257, Lithuania
| |
Collapse
|
9
|
Raimondi D, Orlando G, Fariselli P, Moreau Y. Insight into the protein solubility driving forces with neural attention. PLoS Comput Biol 2020; 16:e1007722. [PMID: 32352965 PMCID: PMC7217484 DOI: 10.1371/journal.pcbi.1007722] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Revised: 05/12/2020] [Accepted: 02/10/2020] [Indexed: 12/29/2022] Open
Abstract
Protein solubility is a key aspect for many biotechnological, biomedical and industrial processes, such as the production of active proteins and antibodies. In addition, understanding the molecular determinants of the solubility of proteins may be crucial to shed light on the molecular mechanisms of diseases caused by aggregation processes such as amyloidosis. Here we present SKADE, a novel Neural Network protein solubility predictor and we show how it can provide novel insight into the protein solubility mechanisms, thanks to its neural attention architecture. First, we show that SKADE positively compares with state of the art tools while using just the protein sequence as input. Then, thanks to the neural attention mechanism, we use SKADE to investigate the patterns learned during training and we analyse its decision process. We use this peculiarity to show that, while the attention profiles do not correlate with obvious sequence aspects such as biophysical properties of the aminoacids, they suggest that N- and C-termini are the most relevant regions for solubility prediction and are predictive for complex emergent properties such as aggregation-prone regions involved in beta-amyloidosis and contact density. Moreover, SKADE is able to identify mutations that increase or decrease the overall solubility of the protein, allowing it to be used to perform large scale in-silico mutagenesis of proteins in order to maximize their solubility. The solubility of proteins is a crucial biophysical aspect when it comes to understanding many human diseases and to improve the industrial processes for protein production. Due to its relevance, computational methods have been devised in order to study and possibly optimize the solubility of proteins. In this work we apply a deep-learning technique, called neural attention to predict protein solubility while “opening” the model itself to interpretability, even though Machine Learning models are usually considered black boxes. Thank to the attention mechanism, we show that i) our model implicitly learns complex patterns related to emergent, protein folding-related, aspects such as to recognize β-amyloidosis regions and that ii) the N-and C-termini are the regions with the highes signal fro solubility prediction. When it comes to enhancing the solubility of proteins, we, for the first time, propose to investigate the synergistic effects of tandem mutations instead of “single” mutations, suggesting that this could minimize the number of required proposed mutations.
Collapse
Affiliation(s)
| | | | | | - Yves Moreau
- ESAT-STADIUS, KU Leuven, Leuven, Belgium
- * E-mail:
| |
Collapse
|
10
|
Raimondi D, Orlando G, Vranken WF, Moreau Y. Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis. Sci Rep 2019; 9:16932. [PMID: 31729443 PMCID: PMC6858301 DOI: 10.1038/s41598-019-53324-w] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 10/25/2019] [Indexed: 11/21/2022] Open
Abstract
Machine learning (ML) is ubiquitous in bioinformatics, due to its versatility. One of the most crucial aspects to consider while training a ML model is to carefully select the optimal feature encoding for the problem at hand. Biophysical propensity scales are widely adopted in structural bioinformatics because they describe amino acids properties that are intuitively relevant for many structural and functional aspects of proteins, and are thus commonly used as input features for ML methods. In this paper we reproduce three classical structural bioinformatics prediction tasks to investigate the main assumptions about the use of propensity scales as input features for ML methods. We investigate their usefulness with different randomization experiments and we show that their effectiveness varies among the ML methods used and the tasks. We show that while linear methods are more dependent on the feature encoding, the specific biophysical meaning of the features is less relevant for non-linear methods. Moreover, we show that even among linear ML methods, the simpler one-hot encoding can surprisingly outperform the “biologically meaningful” scales. We also show that feature selection performed with non-linear ML methods may not be able to distinguish between randomized and “real” propensity scales by properly prioritizing to the latter. Finally, we show that learning problem-specific embeddings could be a simple, assumptions-free and optimal way to perform feature learning/engineering for structural bioinformatics tasks.
Collapse
Affiliation(s)
| | - Gabriele Orlando
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, 1050, Brussels, Belgium
| | - Wim F Vranken
- Interuniversity Institute of Bioinformatics in Brussels, ULB-VUB, 1050, Brussels, Belgium.,Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, 1050, Belgium
| | - Yves Moreau
- ESAT-STADIUS, KU Leuven, 3001, Leuven, Belgium.
| |
Collapse
|