1
|
Contreras-Torres E, Marrero-Ponce Y. MD-LAIs Software: Computing Whole-Sequence and Amino Acid-Level "Embeddings" for Peptides and Proteins. J Chem Inf Model 2024; 64:8665-8672. [PMID: 39552512 DOI: 10.1021/acs.jcim.3c01189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2024]
Abstract
Several computational tools have been developed to calculate sequence-based molecular descriptors (MDs) for peptides and proteins. However, these tools have certain limitations: 1) They generally lack capabilities for curating input data. 2) Their outputs often exhibit significant overlap. 3) There is limited availability of MDs at the amino acid (aa) level. 4) They lack flexibility in computing specific MDs. To address these issues, we developed MD-LAIs (Molecular Descriptors from Local Amino acid Invariants), Java-based software designed to compute both whole-sequence and aa-level MDs for peptides and proteins. These MDs are generated by applying aggregation operators (AOs) to macromolecular vectors containing the chemical-physical and structural properties of aas. The set of AOs includes both nonclassical (e.g., Minkowski norms) and classical AOs (e.g., Radial Distribution Function). Classical AOs capture neighborhood structural information at different k levels, while nonclassical AOs are applied using a sliding window to generalize the aa-level output. A weighting system based on fuzzy membership functions is also included to account for the contributions of individual aas. MD-LAIs features: 1) a module for data curation tasks, 2) a feature selection module, 3) projects of highly relevant MDs, and 4) low-dimensional lists of informative global and aa-level MDs. Overall, we expect that MD-LAIs will be a valuable tool for encoding protein or peptide sequences. The software is freely available as a stand-alone system on GitHub (https://github.com/Grupo-Medicina-Molecular-y-Traslacional/MD_LAIS).
Collapse
Affiliation(s)
- Ernesto Contreras-Torres
- Norewian Cruise Line Holdings Limited, Corporate Center Drive, Miami, Florida 33216, United States
- Facultad de Ingeniería, Universidad Panamericana, Augusto Rodin No. 498, Insurgentes Mixcoac, Benito Juárez, Ciudad de México 03920, México
| | - Yovani Marrero-Ponce
- Facultad de Ingeniería, Universidad Panamericana, Augusto Rodin No. 498, Insurgentes Mixcoac, Benito Juárez, Ciudad de México 03920, México
- Universidad San Francisco de Quito (USFQ), Grupo de Medicina Molecular y Traslacional (MeM&T), Colegio de Ciencias de la Salud (COCSA), Escuela de Medicina, Edificio de Especialidades Médicas, Quito 170157 Pichincha, Ecuador
| |
Collapse
|
2
|
Pratyush P, Bahmani S, Pokharel S, Ismail HD, KC DB. LMCrot: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model. Bioinformatics 2024; 40:btae290. [PMID: 38662579 PMCID: PMC11088740 DOI: 10.1093/bioinformatics/btae290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Revised: 02/13/2024] [Accepted: 04/24/2024] [Indexed: 05/13/2024] Open
Abstract
MOTIVATION Recent advancements in natural language processing have highlighted the effectiveness of global contextualized representations from protein language models (pLMs) in numerous downstream tasks. Nonetheless, strategies to encode the site-of-interest leveraging pLMs for per-residue prediction tasks, such as crotonylation (Kcr) prediction, remain largely uncharted. RESULTS Herein, we adopt a range of approaches for utilizing pLMs by experimenting with different input sequence types (full-length protein sequence versus window sequence), assessing the implications of utilizing per-residue embedding of the site-of-interest as well as embeddings of window residues centered around it. Building upon these insights, we developed a novel residual ConvBiLSTM network designed to process window-level embeddings of the site-of-interest generated by the ProtT5-XL-UniRef50 pLM using full-length sequences as input. This model, termed T5ResConvBiLSTM, surpasses existing state-of-the-art Kcr predictors in performance across three diverse datasets. To validate our approach of utilizing full sequence-based window-level embeddings, we also delved into the interpretability of ProtT5-derived embedding tensors in two ways: firstly, by scrutinizing the attention weights obtained from the transformer's encoder block; and secondly, by computing SHAP values for these tensors, providing a model-agnostic interpretation of the prediction results. Additionally, we enhance the latent representation of ProtT5 by incorporating two additional local representations, one derived from amino acid properties and the other from supervised embedding layer, through an intermediate fusion stacked generalization approach, using an n-mer window sequence (or, peptide/fragment). The resultant stacked model, dubbed LMCrot, exhibits a more pronounced improvement in predictive performance across the tested datasets. AVAILABILITY AND IMPLEMENTATION LMCrot is publicly available at https://github.com/KCLabMTU/LMCrot.
Collapse
Affiliation(s)
- Pawel Pratyush
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, United States
| | - Soufia Bahmani
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, United States
| | - Suresh Pokharel
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, United States
| | - Hamid D Ismail
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, United States
| | - Dukka B KC
- Department of Computer Science, Michigan Technological University, Houghton, MI 49931, United States
| |
Collapse
|
3
|
Zhao W, Luo X, Tong F, Zheng X, Li J, Zhao G, Zhao D. Improving antibody optimization ability of generative adversarial network through large language model. Comput Struct Biotechnol J 2023; 21:5839-5850. [PMID: 38074472 PMCID: PMC10698008 DOI: 10.1016/j.csbj.2023.11.041] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Revised: 11/21/2023] [Accepted: 11/21/2023] [Indexed: 10/16/2024] Open
Abstract
Generative adversarial networks (GANs) have successfully generated functional protein sequences. However, traditional GANs often suffer from inherent randomness, resulting in a lower probability of obtaining desirable sequences. Due to the high cost of wet-lab experiments, the main goal of computer-aided antibody optimization is to identify high-quality candidate antibodies from a large range of possibilities, yet improving the ability of GANs to generate these desired antibodies is a challenge. In this study, we propose and evaluate a new GAN called the Language Model Guided Antibody Generative Adversarial Network (AbGAN-LMG). This GAN uses a language model as an input, harnessing such models' powerful representational capabilities to improve the GAN's generation of high-quality antibodies. We conducted a comprehensive evaluation of the antibody libraries and sequences generated by AbGAN-LMG for COVID-19 (SARS-CoV-2) and Middle East Respiratory Syndrome (MERS-CoV). Results indicate that AbGAN-LMG has learned the fundamental characteristics of antibodies and that it improved the diversity of the generated libraries. Additionally, when generating sequences using AZD-8895 as the target antibody for optimization, over 50% of the generated sequences exhibited better developability than AZD-8895 itself. Through molecular docking, we identified 70 antibodies that demonstrated higher affinity for the wild-type receptor-binding domain (RBD) of SARS-CoV-2 compared to AZD-8895. In conclusion, AbGAN-LMG demonstrates that language models used in conjunction with GANs can enable the generation of higher-quality libraries and candidate sequences, thereby improving the efficiency of antibody optimization. AbGAN-LMG is available at http://39.102.71.224:88/.
Collapse
Affiliation(s)
- Wenbin Zhao
- Academy of Military Medical Sciences, Beijing 100850, China
| | - Xiaowei Luo
- Academy of Military Medical Sciences, Beijing 100850, China
| | - Fan Tong
- Academy of Military Medical Sciences, Beijing 100850, China
| | - Xiangwen Zheng
- Academy of Military Medical Sciences, Beijing 100850, China
| | - Jing Li
- Beijing Institute of Microbiology and Epidemiology, State Key Laboratory of Pathogen and Biosecurity, Beijing 100071, China
| | - Guangyu Zhao
- Beijing Institute of Microbiology and Epidemiology, State Key Laboratory of Pathogen and Biosecurity, Beijing 100071, China
| | - Dongsheng Zhao
- Academy of Military Medical Sciences, Beijing 100850, China
| |
Collapse
|
4
|
Gorostiola González M, van den Broek RL, Braun TGM, Chatzopoulou M, Jespers W, IJzerman AP, Heitman LH, van Westen GJP. 3DDPDs: describing protein dynamics for proteochemometric bioactivity prediction. A case for (mutant) G protein-coupled receptors. J Cheminform 2023; 15:74. [PMID: 37641107 PMCID: PMC10463931 DOI: 10.1186/s13321-023-00745-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Accepted: 08/10/2023] [Indexed: 08/31/2023] Open
Abstract
Proteochemometric (PCM) modelling is a powerful computational drug discovery tool used in bioactivity prediction of potential drug candidates relying on both chemical and protein information. In PCM features are computed to describe small molecules and proteins, which directly impact the quality of the predictive models. State-of-the-art protein descriptors, however, are calculated from the protein sequence and neglect the dynamic nature of proteins. This dynamic nature can be computationally simulated with molecular dynamics (MD). Here, novel 3D dynamic protein descriptors (3DDPDs) were designed to be applied in bioactivity prediction tasks with PCM models. As a test case, publicly available G protein-coupled receptor (GPCR) MD data from GPCRmd was used. GPCRs are membrane-bound proteins, which are activated by hormones and neurotransmitters, and constitute an important target family for drug discovery. GPCRs exist in different conformational states that allow the transmission of diverse signals and that can be modified by ligand interactions, among other factors. To translate the MD-encoded protein dynamics two types of 3DDPDs were considered: one-hot encoded residue-specific (rs) and embedding-like protein-specific (ps) 3DDPDs. The descriptors were developed by calculating distributions of trajectory coordinates and partial charges, applying dimensionality reduction, and subsequently condensing them into vectors per residue or protein, respectively. 3DDPDs were benchmarked on several PCM tasks against state-of-the-art non-dynamic protein descriptors. Our rs- and ps3DDPDs outperformed non-dynamic descriptors in regression tasks using a temporal split and showed comparable performance with a random split and in all classification tasks. Combinations of non-dynamic descriptors with 3DDPDs did not result in increased performance. Finally, the power of 3DDPDs to capture dynamic fluctuations in mutant GPCRs was explored. The results presented here show the potential of including protein dynamic information on machine learning tasks, specifically bioactivity prediction, and open opportunities for applications in drug discovery, including oncology.
Collapse
Affiliation(s)
- Marina Gorostiola González
- Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands
- ONCODE Institute, Leiden, The Netherlands
| | - Remco L van den Broek
- Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands
| | - Thomas G M Braun
- Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands
| | - Magdalini Chatzopoulou
- Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands
| | - Willem Jespers
- Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands
| | - Adriaan P IJzerman
- Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands
| | - Laura H Heitman
- Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands
- ONCODE Institute, Leiden, The Netherlands
| | - Gerard J P van Westen
- Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands.
| |
Collapse
|
5
|
Preeti P, Nath SK, Arambam N, Sharma T, Choudhury PR, Choudhury A, Khanna V, Strych U, Hotez PJ, Bottazzi ME, Rawal K. Vaxi-DL: An Artificial Intelligence-Enabled Platform for Vaccine Development. Methods Mol Biol 2023; 2673:305-316. [PMID: 37258923 DOI: 10.1007/978-1-0716-3239-0_21] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Vaccine development is a complex and long process. It involves several steps, including computational studies, experimental analyses, animal model system studies, and clinical trials. This process can be accelerated by using in silico antigen screening to identify potential vaccine candidates. In this chapter, we describe a deep learning-based technique which utilizes 18 biological and 9154 physicochemical properties of proteins for finding potential vaccine candidates. Using this technique, a new web-based system, named Vaxi-DL, was developed which helped in finding new vaccine candidates from bacteria, protozoa, viruses, and fungi. Vaxi-DL is available at: https://vac.kamalrawal.in/vaxidl/ .
Collapse
Affiliation(s)
- P Preeti
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Swarsat Kaushik Nath
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Nevidita Arambam
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Trapti Sharma
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Priyanka Ray Choudhury
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Alakto Choudhury
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Vrinda Khanna
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India
| | - Ulrich Strych
- Department of Pediatrics, Division of Tropical Medicine, Baylor College of Medicine, Houston, TX, USA
- Texas Children's Hospital Center for Vaccine Development, Houston, TX, USA
| | - Peter J Hotez
- Department of Pediatrics, Division of Tropical Medicine, Baylor College of Medicine, Houston, TX, USA
- Texas Children's Hospital Center for Vaccine Development, Houston, TX, USA
- Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, TX, USA
- Department of Biology, Baylor University, Waco, TX, USA
| | - Maria Elena Bottazzi
- Department of Pediatrics, Division of Tropical Medicine, Baylor College of Medicine, Houston, TX, USA
- Texas Children's Hospital Center for Vaccine Development, Houston, TX, USA
- Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, TX, USA
- Department of Biology, Baylor University, Waco, TX, USA
| | - Kamal Rawal
- Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India.
| |
Collapse
|