1
|
Basnet BB, Zhou ZY, Wei B, Wang H. Advances in AI-based strategies and tools to facilitate natural product and drug development. Crit Rev Biotechnol 2025:1-32. [PMID: 40159111 DOI: 10.1080/07388551.2025.2478094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2024] [Revised: 02/11/2025] [Accepted: 02/16/2025] [Indexed: 04/02/2025]
Abstract
Natural products and their derivatives have been important for treating diseases in humans, animals, and plants. However, discovering new structures from natural sources is still challenging. In recent years, artificial intelligence (AI) has greatly aided the discovery and development of natural products and drugs. AI facilitates to: connect genetic data to chemical structures or vice-versa, repurpose known natural products, predict metabolic pathways, and design and optimize metabolites biosynthesis. More recently, the emergence and improvement in neural networks such as deep learning and ensemble automated web based bioinformatics platforms have sped up the discovery process. Meanwhile, AI also improves the identification and structure elucidation of unknown compounds from raw data like mass spectrometry and nuclear magnetic resonance. This article reviews these AI-driven methods and tools, highlighting their practical applications and guide for efficient natural product discovery and drug development.
Collapse
Affiliation(s)
- Buddha Bahadur Basnet
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, China
- Central Department of Biotechnology, Tribhuvan University, Kathmandu, Nepal
| | - Zhen-Yi Zhou
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, China
| | - Bin Wei
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, China
| | - Hong Wang
- College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou, China
- Key Laboratory of Marine Fishery Resources Exploitment, Utilization of Zhejiang Province, Zhejiang University of Technology, Hangzhou, China
| |
Collapse
|
2
|
Zolkiewicz K, Oklestkova J, Chmielewska B, Gruszka D. Mutations of the brassinosteroid biosynthesis gene HvDWARF5 enable balance between semi-dwarfism and maintenance of grain size in barley. PHYSIOLOGIA PLANTARUM 2025; 177:e70179. [PMID: 40129050 PMCID: PMC11933512 DOI: 10.1111/ppl.70179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2024] [Revised: 02/18/2025] [Accepted: 02/28/2025] [Indexed: 03/26/2025]
Abstract
Brassinosteroids (BRs) are phytohormones which regulate various developmental processes in plants. They are exceptional phytohormones, as they do not undergo long-distance transport between plant organs. However, knowledge about the function of the enzymes that catalyse BR biosynthesis (particularly its early stages) in cereal crops remains limited. Therefore, this study identifies and analyses the function of the HvDWARF5 (HvDWF5) gene, involved in the early stage of BR biosynthesis in barley (Hordeum vulgare), an important cereal crop, using the TILLING (Targeting Induced Local Lesions IN Genomes) approach. The detailed functional analysis allowed for the identification of various mutations in different gene fragments. The influence of these mutations on plant architecture, reproduction, and yield was characterised. Moreover, effects of the missense and intron retention mutations on sequence and splicing of the HvDWF5 transcript, sequence and predicted structure of the encoded HvDWF5 enzyme, and accumulation of endogenous BR were determined. Some of the barley mutants identified in this study showed semi-dwarfism, a trait of particular importance for cereal breeding and yield. However, unlike other BR mutants in cereals, this did not negatively affect grain size or weight. It indicated that mutations in this gene allow for a balance between plant height reduction and maintenance of grain size. Thus, the results of this study provide a novel insight into the role of the HvDWF5 gene in the BR biosynthesis-dependent regulation of architecture and reproduction of the important cereal crop - barley.
Collapse
Affiliation(s)
- Karolina Zolkiewicz
- Institute of Biology, Biotechnology and Environmental Protection, Faculty of Natural SciencesUniversity of SilesiaKatowicePoland
| | - Jana Oklestkova
- Laboratory of Growth Regulators, Faculty of Science, Centre of the Region Haná for Biotechnological and Agricultural Research, Institute of Experimental BotanyCzech Academy of Sciences, Palacký UniversityOlomoucCzechia
| | - Beata Chmielewska
- Institute of Biology, Biotechnology and Environmental Protection, Faculty of Natural SciencesUniversity of SilesiaKatowicePoland
| | - Damian Gruszka
- Institute of Biology, Biotechnology and Environmental Protection, Faculty of Natural SciencesUniversity of SilesiaKatowicePoland
| |
Collapse
|
3
|
Kister AE. Beta Sandwich-Like Folds: Sequences, Contacts, Classification of Invariant Substructures and Beta Sandwich Protein Grammar. Methods Mol Biol 2025; 2870:51-62. [PMID: 39543030 DOI: 10.1007/978-1-0716-4213-9_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2024]
Abstract
This chapter addresses the following fundamental question: Do sequences of protein domains with sandwich architecture have common sequence characteristics even though they belong to different superfamilies and folds? The analysis was carried out in two stages: (1) determination of domain substructures shared by all sandwich proteins and (2) detection of common sequence characteristics within the substructures. Analysis of supersecondary structures in domains of proteins revealed two types of four-strand substructures that are common to sandwich proteins. At least one of these common substructures was found in proteins of 42 sandwich-like folds (per structural classification in the CATH database). A comparison of sequence fragments and residue-residue contacts constituting common substructures revealed specific distributions of hydrophobic residues in these chains. The shared sequences and structural characteristics can be conceptualized as the "grammatical rules of beta protein linguistics." Understanding the structural and sequence commonalities of sandwich proteins may prove useful for rational protein design.
Collapse
|
4
|
Li W, Almirantis Y, Provata A. Range-limited Heaps' law for functional DNA words in the human genome. J Theor Biol 2024; 592:111878. [PMID: 38901778 DOI: 10.1016/j.jtbi.2024.111878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 05/31/2024] [Accepted: 06/10/2024] [Indexed: 06/22/2024]
Abstract
Heaps' or Herdan-Heaps' law is a linguistic law describing the relationship between the vocabulary/dictionary size (type) and word counts (token) to be a power-law function. Its existence in genomes with certain definition of DNA words is unclear partly because the dictionary size in genome could be much smaller than that in a human language. We define a DNA word as a coding region in a genome that codes for a protein domain. Using human chromosomes and chromosome arms as individual samples, we establish the existence of Heaps' law in the human genome within limited range. Our definition of words in a genomic or proteomic context is different from other definitions such as over-represented k-mers which are much shorter in length. Although an approximate power-law distribution of protein domain sizes due to gene duplication and the related Zipf's law is well known, their translation to the Heaps' law in DNA words is not automatic. Several other animal genomes are shown herein also to exhibit range-limited Heaps' law with our definition of DNA words, though with various exponents. When tokens were randomly sampled and sample sizes reach to the maximum level, a deviation from the Heaps' law was observed, but a quadratic regression in log-log type-token plot fits the data perfectly. Investigation of type-token plot and its regression coefficients could provide an alternative narrative of reusage and redundancy of protein domains as well as creation of new protein domains from a linguistic perspective.
Collapse
Affiliation(s)
- Wentian Li
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA(1); The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA.
| | - Yannis Almirantis
- Theoretical Biology and Computational Genomics Laboratory, Institute of Bioscience and Applications, National Center for Scientific Research "Demokritos", 15341 Athens, Greece
| | - Astero Provata
- Statistical Mechanics and Dynamical Systems Laboratory, Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", 15341 Athens, Greece
| |
Collapse
|
5
|
Banerjee P, Eulenstein O, Friedberg I. Discovering genomic islands in unannotated bacterial genomes using sequence embedding. BIOINFORMATICS ADVANCES 2024; 4:vbae089. [PMID: 38911822 PMCID: PMC11193100 DOI: 10.1093/bioadv/vbae089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 05/26/2024] [Accepted: 06/11/2024] [Indexed: 06/25/2024]
Abstract
Motivation Genomic islands (GEIs) are clusters of genes in bacterial genomes that are typically acquired by horizontal gene transfer. GEIs play a crucial role in the evolution of bacteria by rapidly introducing genetic diversity and thus helping them adapt to changing environments. Specifically of interest to human health, many GEIs contain pathogenicity and antimicrobial resistance genes. Detecting GEIs is, therefore, an important problem in biomedical and environmental research. There have been many previous studies for computationally identifying GEIs. Still, most of these studies rely on detecting anomalies in the unannotated nucleotide sequences or on a fixed set of known features on annotated nucleotide sequences. Results Here, we present TreasureIsland, which uses a new unsupervised representation of DNA sequences to predict GEIs. We developed a high-precision boundary detection method featuring an incremental fine-tuning of GEI borders, and we evaluated the accuracy of this framework using a new comprehensive reference dataset, Benbow. We show that TreasureIsland's accuracy rivals other GEI predictors, enabling efficient and faster identification of GEIs in unannotated bacterial genomes. Availability and implementation TreasureIsland is available under an MIT license at: https://github.com/FriedbergLab/GenomicIslandPrediction.
Collapse
Affiliation(s)
- Priyanka Banerjee
- Department of Computer Science, Iowa State University, Ames, IA 50011, United States
| | - Oliver Eulenstein
- Department of Computer Science, Iowa State University, Ames, IA 50011, United States
| | - Iddo Friedberg
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, United States
| |
Collapse
|
6
|
Rajasekaran N, Kaiser CM. Navigating the complexities of multi-domain protein folding. Curr Opin Struct Biol 2024; 86:102790. [PMID: 38432063 DOI: 10.1016/j.sbi.2024.102790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Revised: 02/11/2024] [Accepted: 02/12/2024] [Indexed: 03/05/2024]
Abstract
Proteome complexity has expanded tremendously over evolutionary time, enabling biological diversification. Much of this complexity is achieved by combining a limited set of structural units into long polypeptides. This widely used evolutionary strategy poses challenges for folding of the resulting multi-domain proteins. As a consequence, their folding differs from that of small single-domain proteins, which generally fold quickly and reversibly. Co-translational processes and chaperone interactions are important aspects of multi-domain protein folding. In this review, we discuss some of the recent experimental progress toward understanding these processes.
Collapse
Affiliation(s)
| | - Christian M Kaiser
- Department of Biology, Johns Hopkins University, Baltimore, MD, United States; Bijvoet Center for Biomolecular Research, Utrecht University, Utrecht, Netherlands.
| |
Collapse
|
7
|
Gaschignard G, Millet M, Bruley A, Benzerara K, Dezi M, Skouri-Panet F, Duprat E, Callebaut I. AlphaFold2-guided description of CoBaHMA, a novel family of bacterial domains within the heavy-metal-associated superfamily. Proteins 2024; 92:776-794. [PMID: 38258321 DOI: 10.1002/prot.26668] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Revised: 12/22/2023] [Accepted: 01/01/2024] [Indexed: 01/24/2024]
Abstract
Three-dimensional (3D) structure information, now available at the proteome scale, may facilitate the detection of remote evolutionary relationships in protein superfamilies. Here, we illustrate this with the identification of a novel family of protein domains related to the ferredoxin-like superfold, by combining (i) transitive sequence similarity searches, (ii) clustering approaches, and (iii) the use of AlphaFold2 3D structure models. Domains of this family were initially identified in relation with the intracellular biomineralization of calcium carbonates by Cyanobacteria. They are part of the large heavy-metal-associated (HMA) superfamily, departing from the latter by specific sequence and structural features. In particular, most of them share conserved basic amino acids (hence their name CoBaHMA for Conserved Basic residues HMA), forming a positively charged surface, which is likely to interact with anionic partners. CoBaHMA domains are found in diverse modular organizations in bacteria, existing in the form of monodomain proteins or as part of larger proteins, some of which are membrane proteins involved in transport or lipid metabolism. This suggests that the CoBaHMA domains may exert a regulatory function, involving interactions with anionic lipids. This hypothesis might have a particular resonance in the context of the compartmentalization observed for cyanobacterial intracellular calcium carbonates.
Collapse
Affiliation(s)
- Geoffroy Gaschignard
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Maxime Millet
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Apolline Bruley
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Karim Benzerara
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Manuela Dezi
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Feriel Skouri-Panet
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Elodie Duprat
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| | - Isabelle Callebaut
- Sorbonne Université, Muséum National d'Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Matériaux et de Cosmochimie, IMPMC, Paris, France
| |
Collapse
|
8
|
Wu Z, Wang C, Li C, Xu N, Cao X, Chen S, Shi Y, He Y, Zhang P, Ji J. Integrated Computational Pipeline for the High-Throughput Discovery of Cell Adhesion Peptides. J Phys Chem Lett 2024; 15:3748-3756. [PMID: 38551401 DOI: 10.1021/acs.jpclett.4c00393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2024]
Abstract
Cell adhesion peptides (CAPs) often play a critical role in tissue engineering research. However, the discovery of novel CAPs for diverse applications remains a challenging and time-intensive process. This study presents an efficient computational pipeline integrating sequence embeddings, binding predictors, and molecular dynamics simulations to expedite the discovery of new CAPs. A Pro2vec model, trained on vast CAP data sets, was built to identify RGD-similar tripeptide candidates. These candidates were further evaluated for their binding affinity with integrin receptors using the Mutabind2 machine learning model. Additionally, molecular dynamics simulations were applied to model receptor-peptide interactions and calculate their binding free energies, providing a quantitative assessment of the binding strength for further screening. The resulting peptide demonstrated performance comparable to that of RGD in endothelial cell adhesion and spreading experimental assays, validating the efficacy of the integrated computational pipeline.
Collapse
Affiliation(s)
- Zhiyu Wu
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310058, China
- Institute of Zhejiang University-Quzhou, Quzhou 324000, China
| | - Cong Wang
- MOE Key Laboratory of Macromolecular Synthesis and Functionalization, Department of Polymer Science and Engineering, Zhejiang University, Hangzhou 310058, China
| | - Chen Li
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310058, China
- Institute of Zhejiang University-Quzhou, Quzhou 324000, China
| | - Nan Xu
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310058, China
- Institute of Zhejiang University-Quzhou, Quzhou 324000, China
| | - Xiaoyong Cao
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310058, China
- Institute of Zhejiang University-Quzhou, Quzhou 324000, China
| | - Shengfu Chen
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310058, China
| | - Yao Shi
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310058, China
- Key Laboratory of Biomass Chemical Engineering of Ministry of Education, Zhejiang University, Hangzhou 310058, China
| | - Yi He
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou 310058, China
- Institute of Zhejiang University-Quzhou, Quzhou 324000, China
- Department of Chemical Engineering, University of Washington, Seattle, Washington 98195, United States
| | - Peng Zhang
- MOE Key Laboratory of Macromolecular Synthesis and Functionalization, Department of Polymer Science and Engineering, Zhejiang University, Hangzhou 310058, China
- State Key Laboratory of Transvascular Implantation Devices, Qidi Road 456, Hangzhou 310058, China
| | - Jian Ji
- MOE Key Laboratory of Macromolecular Synthesis and Functionalization, Department of Polymer Science and Engineering, Zhejiang University, Hangzhou 310058, China
- State Key Laboratory of Transvascular Implantation Devices, Qidi Road 456, Hangzhou 310058, China
| |
Collapse
|
9
|
Ibtehaz N, Sourav SMSH, Bayzid MS, Rahman MS. Align-gram: Rethinking the Skip-gram Model for Protein Sequence Analysis. Protein J 2023; 42:135-146. [PMID: 36977849 DOI: 10.1007/s10930-023-10096-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/13/2023] [Indexed: 03/29/2023]
Abstract
The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the 'language of life', has been analyzed for a multitude of applications and inferences. Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. We propose a novel k-mer embedding scheme, Align-gram, which is capable of mapping the similar k-mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.
Collapse
|
10
|
MFIDMA: A Multiple Information Integration Model for the Prediction of Drug-miRNA Associations. BIOLOGY 2022; 12:biology12010041. [PMID: 36671734 PMCID: PMC9855084 DOI: 10.3390/biology12010041] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Revised: 12/19/2022] [Accepted: 12/22/2022] [Indexed: 12/28/2022]
Abstract
Abnormal microRNA (miRNA) functions play significant roles in various pathological processes. Thus, predicting drug-miRNA associations (DMA) may hold great promise for identifying the potential targets of drugs. However, discovering the associations between drugs and miRNAs through wet experiments is time-consuming and laborious. Therefore, it is significant to develop computational prediction methods to improve the efficiency of identifying DMA on a large scale. In this paper, a multiple features integration model (MFIDMA) is proposed to predict drug-miRNA association. Specifically, we first formulated known DMA as a bipartite graph and utilized structural deep network embedding (SDNE) to learn the topological features from the graph. Second, the Word2vec algorithm was utilized to construct the attribute features of the miRNAs and drugs. Third, two kinds of features were entered into the convolution neural network (CNN) and deep neural network (DNN) to integrate features and predict potential target miRNAs for the drugs. To evaluate the MFIDMA model, it was implemented on three different datasets under a five-fold cross-validation and achieved average AUCs of 0.9407, 0.9444 and 0.8919. In addition, the MFIDMA model showed reliable results in the case studies of Verapamil and hsa-let-7c-5p, confirming that the proposed model can also predict DMA in real-world situations. The model was effective in analyzing the neighbors and topological features of the drug-miRNA network by SDNE. The experimental results indicated that the MFIDMA is an accurate and robust model for predicting potential DMA, which is significant for miRNA therapeutics research and drug discovery.
Collapse
|
11
|
Li H, Liu X, Jia D, Chen Y, Hou P, Li H. Research on chest radiography recognition model based on deep learning. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:11768-11781. [PMID: 36124613 DOI: 10.3934/mbe.2022548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
With the development of medical informatization and against the background of the spread of global epidemic, the demand for automated chest X-ray detection by medical personnel and patients continues to increase. Although the rapid development of deep learning technology has made it possible to automatically generate a single conclusive sentence, the results produced by existing methods are not reliable enough due to the complexity of medical images. To solve this problem, this paper proposes an improved RCLN (Recurrent Learning Network) model as a solution. The model can generate high-level conclusive impressions and detailed descriptive findings sentence-by-sentence and realize the imitation of the doctoros standard tone by combining a convolutional neural network (CNN) with a long short-term memory (LSTM) network through a recurrent structure, and adding a multi-head attention mechanism. The proposed algorithm has been experimentally verified on publicly available chest X-ray images from the Open-i image set. The results show that it can effectively solve the problem of automatic generation of colloquial medical reports.
Collapse
Affiliation(s)
- Hui Li
- School of Computer Engineering, Jiangsu Ocean University, China
| | - Xintang Liu
- School of Computer Engineering, Jiangsu Ocean University, China
| | - Dongbao Jia
- School of Computer Engineering, Jiangsu Ocean University, China
| | - Yanyan Chen
- School of Computer Engineering, Jiangsu Ocean University, China
| | - Pengfei Hou
- School of Computer Engineering, Jiangsu Ocean University, China
| | - Haining Li
- Department of Neurology, General Hospital of Ningxia Medical University, China
| |
Collapse
|
12
|
Chu HY, Wong ASL. Facilitating Machine Learning-Guided Protein Engineering with Smart Library Design and Massively Parallel Assays. ADVANCED GENETICS (HOBOKEN, N.J.) 2021; 2:2100038. [PMID: 36619853 PMCID: PMC9744531 DOI: 10.1002/ggn2.202100038] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Revised: 11/08/2021] [Indexed: 01/11/2023]
Abstract
Protein design plays an important role in recent medical advances from antibody therapy to vaccine design. Typically, exhaustive mutational screens or directed evolution experiments are used for the identification of the best design or for improvements to the wild-type variant. Even with a high-throughput screening on pooled libraries and Next-Generation Sequencing to boost the scale of read-outs, surveying all the variants with combinatorial mutations for their empirical fitness scores is still of magnitudes beyond the capacity of existing experimental settings. To tackle this challenge, in-silico approaches using machine learning to predict the fitness of novel variants based on a subset of empirical measurements are now employed. These machine learning models turn out to be useful in many cases, with the premise that the experimentally determined fitness scores and the amino-acid descriptors of the models are informative. The machine learning models can guide the search for the highest fitness variants, resolve complex epistatic relationships, and highlight bio-physical rules for protein folding. Using machine learning-guided approaches, researchers can build more focused libraries, thus relieving themselves from labor-intensive screens and fast-tracking the optimization process. Here, we describe the current advances in massive-scale variant screens, and how machine learning and mutagenesis strategies can be integrated to accelerate protein engineering. More specifically, we examine strategies to make screens more economical, informative, and effective in discovery of useful variants.
Collapse
Affiliation(s)
- Hoi Yee Chu
- Laboratory of Combinatorial Genetics and Synthetic BiologySchool of Biomedical SciencesThe University of Hong KongHong Kong852China
| | - Alan S. L. Wong
- Laboratory of Combinatorial Genetics and Synthetic BiologySchool of Biomedical SciencesThe University of Hong KongHong Kong852China
- Electrical and Electronic EngineeringThe University of Hong KongPokfulamHong Kong852China
| |
Collapse
|
13
|
Monzon V, Lafita A, Bateman A. Discovery of fibrillar adhesins across bacterial species. BMC Genomics 2021; 22:550. [PMID: 34275445 PMCID: PMC8286594 DOI: 10.1186/s12864-021-07586-2] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 04/07/2021] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Fibrillar adhesins are long multidomain proteins that form filamentous structures at the cell surface of bacteria. They are an important yet understudied class of proteins composed of adhesive and stalk domains that mediate interactions of bacteria with their environment. This study aims to characterize fibrillar adhesins in a wide range of bacterial phyla and to identify new fibrillar adhesin-like proteins to improve our understanding of host-bacteria interactions. RESULTS Through careful literature and computational searches, we identified 82 stalk and 27 adhesive domain families in fibrillar adhesins. Based on the presence of these domains in the UniProt Reference Proteomes database, we identified and analysed 3,542 fibrillar adhesin-like proteins across species of the most common bacterial phyla. We further enumerate the adhesive and stalk domain combinations found in nature and demonstrate that fibrillar adhesins have complex and variable domain architectures, which differ across species. By analysing the domain architecture of fibrillar adhesins, we show that in Gram positive bacteria, adhesive domains are mostly positioned at the N-terminus and cell surface anchors at the C-terminus of the protein, while their positions are more variable in Gram negative bacteria. We provide an open repository of fibrillar adhesin-like proteins and domains to enable further studies of this class of bacterial surface proteins. CONCLUSION This study provides a domain-based characterization of fibrillar adhesins and demonstrates that they are widely found in species across the main bacterial phyla. We have discovered numerous novel fibrillar adhesins and improved our understanding of pathogenic adhesion and invasion mechanisms.
Collapse
Affiliation(s)
- Vivian Monzon
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK.
| | - Aleix Lafita
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, CB10 1SD, UK
| |
Collapse
|
14
|
Iuchi H, Matsutani T, Yamada K, Iwano N, Sumi S, Hosoda S, Zhao S, Fukunaga T, Hamada M. Representation learning applications in biological sequence analysis. Comput Struct Biotechnol J 2021; 19:3198-3208. [PMID: 34141139 PMCID: PMC8190442 DOI: 10.1016/j.csbj.2021.05.039] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 05/10/2021] [Accepted: 05/20/2021] [Indexed: 12/16/2022] Open
Abstract
Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains a critical hurdle. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention. In this method, biological sequences are regarded as sentences while the single nucleic acids/amino acids or k-mers in these sequences represent the words. Embedding is an essential step in NLP, which performs the conversion of these words into vectors. Specifically, representation learning is an approach used for this transformation process, which can be applied to biological sequences. Vectorized biological sequences can then be applied for function and structure estimation, or as input for other probabilistic models. Considering the importance and growing trend for the application of representation learning to biological research, in the present study, we have reviewed the existing knowledge in representation learning for biological sequence analysis.
Collapse
Affiliation(s)
- Hitoshi Iuchi
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
| | - Taro Matsutani
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Keisuke Yamada
- School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Natsuki Iwano
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Shunsuke Sumi
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Department of Life Science Frontiers, Center for iPS Cell Research and Application, Kyoto University, Kyoto 606-8507, Japan
| | - Shion Hosoda
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Shitao Zhao
- Waseda Research Institute for Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Tsukasa Fukunaga
- Waseda Institute for Advanced Study, Waseda University, Tokyo 169-0051, Japan
- Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-0032, Japan
| | - Michiaki Hamada
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo 169-8555, Japan
- Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Graduate School of Medicine, Nippon Medical School, Tokyo 113-8602, Japan
| |
Collapse
|
15
|
Kautsar SA, van der Hooft JJJ, de Ridder D, Medema MH. BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters. Gigascience 2021; 10:giaa154. [PMID: 33438731 PMCID: PMC7804863 DOI: 10.1093/gigascience/giaa154] [Citation(s) in RCA: 120] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Revised: 10/29/2020] [Accepted: 11/29/2020] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Genome mining for biosynthetic gene clusters (BGCs) has become an integral part of natural product discovery. The >200,000 microbial genomes now publicly available hold information on abundant novel chemistry. One way to navigate this vast genomic diversity is through comparative analysis of homologous BGCs, which allows identification of cross-species patterns that can be matched to the presence of metabolites or biological activities. However, current tools are hindered by a bottleneck caused by the expensive network-based approach used to group these BGCs into gene cluster families (GCFs). RESULTS Here, we introduce BiG-SLiCE, a tool designed to cluster massive numbers of BGCs. By representing them in Euclidean space, BiG-SLiCE can group BGCs into GCFs in a non-pairwise, near-linear fashion. We used BiG-SLiCE to analyze 1,225,071 BGCs collected from 209,206 publicly available microbial genomes and metagenome-assembled genomes within 10 days on a typical 36-core CPU server. We demonstrate the utility of such analyses by reconstructing a global map of secondary metabolic diversity across taxonomy to identify uncharted biosynthetic potential. BiG-SLiCE also provides a "query mode" that can efficiently place newly sequenced BGCs into previously computed GCFs, plus a powerful output visualization engine that facilitates user-friendly data exploration. CONCLUSIONS BiG-SLiCE opens up new possibilities to accelerate natural product discovery and offers a first step towards constructing a global and searchable interconnected network of BGCs. As more genomes are sequenced from understudied taxa, more information can be mined to highlight their potentially novel chemistry. BiG-SLiCE is available via https://github.com/medema-group/bigslice.
Collapse
Affiliation(s)
- Satria A Kautsar
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands
| | - Justin J J van der Hooft
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, sThe Netherlands
| | - Dick de Ridder
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands
| | - Marnix H Medema
- Bioinformatics Group, Wageningen University, Droevendaalsesteeg 1, 6708PB, Wageningen, The Netherlands
| |
Collapse
|