1
|
Kim HR, Ji H, Kim GB, Lee SY. Enzyme functional classification using artificial intelligence. Trends Biotechnol 2025:S0167-7799(25)00088-5. [PMID: 40155269 DOI: 10.1016/j.tibtech.2025.03.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2025] [Revised: 02/27/2025] [Accepted: 03/06/2025] [Indexed: 04/01/2025]
Abstract
Enzymes are essential for cellular metabolism, and elucidating their functions is critical for advancing biochemical research. However, experimental methods are often time consuming and resource intensive. To address this, significant efforts have been directed toward applying artificial intelligence (AI) to enzyme function prediction, enabling high-throughput and scalable approaches. In this review, we discuss advances in AI-driven enzyme functional annotation, transitioning from traditional machine learning (ML) methods to state-of-the-art deep learning approaches. We highlight how deep learning enables models to automatically extract features from raw data without manual intervention, leading to enhanced performance. Finally, we discuss the discovery of novel enzyme functions and generation of de novo enzymes through the integration of generative AIs and bio big data as future research directions.
Collapse
Affiliation(s)
- Ha Rim Kim
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea
| | - Hongkeun Ji
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea
| | - Gi Bae Kim
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; BioProcess Engineering Research Center, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea
| | - Sang Yup Lee
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Graduate School of Engineering Biology, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; BioProcess Engineering Research Center, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Center for Synthetic Biology, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea.
| |
Collapse
|
2
|
Tartici A, Nayar G, Altman RB. Pool PaRTI: A PageRank-Based Pooling Method for Identifying Critical Residues and Enhancing Protein Sequence Representations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.10.04.616701. [PMID: 40166178 PMCID: PMC11956911 DOI: 10.1101/2024.10.04.616701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
Motivation Protein language models produce token-level embeddings for each residue, resulting in an output matrix with dimensions that vary based on sequence length. However, downstream machine learning models typically require fixed-length input vectors, necessitating a pooling method to compress the output matrix into a single vector representation of the entire protein. Traditional pooling methods often result in substantial information loss, impacting downstream task performance. We aim to develop a pooling method that produces more expressive general-purpose protein embedding vectors while offering biological interpretability. Results We introduce Pool PaRTI, a novel pooling method that leverages internal transformer attention matrices and PageRank to assign token importance weights. Our unsupervised and parameter-free approach consistently prioritizes residues experimentally annotated as critical for function, assigning them higher importance scores. Across four diverse protein machine learning tasks, Pool PaRTI enables significant performance gains in predictive performance. Additionally, it enhances interpretability by identifying biologically relevant regions without relying on explicit structural data or annotated training. To assess generalizability, we evaluated Pool PaRTI with two encoder-only protein language models, confirming its robustness across different models. Availability and Implementation Pool PaRTI is implemented in Python with PyTorch and is available at https://github.com/Helix-Research-Lab/Pool_PaRTI.git. The Pool PaRTI sequence embeddings and residue importance values for all human proteins on UniProt are available at https://zenodo.org/records/15036725 for ESM2 and protBERT.
Collapse
|
3
|
Ashcroft E, Poma M, Tischler D, Munoz-Munoz J. Mining metagenomes from extremophiles as a resource for novel glycoside hydrolases for industrial applications. Methods Enzymol 2025; 714:45-60. [PMID: 40288852 DOI: 10.1016/bs.mie.2025.02.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/29/2025]
Abstract
The exploration of metagenomes from extremophiles has emerged as a promising approach for discovering novel glycoside hydrolases (GHs) with potential industrial applications. Extremophiles, which thrive in harsh conditions such as high salinity, extreme temperatures, and acidic or alkaline environments, produce enzymes naturally adapted to function under these conditions. This unique adaptability makes them highly desirable for industrial processes requiring robust and efficient biocatalysts. These biocatalysts reduce reliance on harsh chemicals and energy-intensive processes, contributing to greener industrial operations. This review underscores the power of metagenomics in bypassing the need to culture large libraries of extremophiles in the lab. High-throughput sequencing and bioinformatics enable the identification of novel GH-encoding genes directly from environmental DNA. While metagenomic mining has yielded promising results, challenges such as the expression of extremophile-derived genes in mesophilic hosts, low activity yields, and scalability remain. Advances in synthetic biology and protein engineering could address these bottlenecks, enabling more efficient utilization of GHs. Additionally, integrating machine learning for predictive functional annotation may accelerate the identification of high-value candidates.
Collapse
Affiliation(s)
- Ellie Ashcroft
- Microbial Enzymology Lab, Department of Applied Sciences, Ellison Building A, Northumbria University, Newcastle Upon Tyne, United Kingdom
| | - Melissa Poma
- Microbial Enzymology Lab, Department of Applied Sciences, Ellison Building A, Northumbria University, Newcastle Upon Tyne, United Kingdom
| | - Dirk Tischler
- Microbial Biotechnology, Faculty of Biology and Biotechnology, Ruhr University Bochum, Bochum, Germany
| | - Jose Munoz-Munoz
- Microbial Enzymology Lab, Department of Applied Sciences, Ellison Building A, Northumbria University, Newcastle Upon Tyne, United Kingdom.
| |
Collapse
|
4
|
Hirota K, Salim F, Yamada T. DeepES: deep learning-based enzyme screening to identify orphan enzyme genes. Bioinformatics 2025; 41:btaf053. [PMID: 39909853 PMCID: PMC11881691 DOI: 10.1093/bioinformatics/btaf053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 11/04/2024] [Accepted: 02/02/2025] [Indexed: 02/07/2025] Open
Abstract
MOTIVATION Progress in sequencing technology has led to determination of large numbers of protein sequences, and large enzyme databases are now available. Although many computational tools for enzyme annotation were developed, sequence information is unavailable for many enzymes, known as orphan enzymes. These orphan enzymes hinder sequence similarity-based functional annotation, leading gaps in understanding the association between sequences and enzymatic reactions. RESULTS Therefore, we developed DeepES, a deep learning-based tool for enzyme screening to identify orphan enzyme genes, focusing on biosynthetic gene clusters and reaction class. DeepES uses protein sequences as inputs and evaluates whether the input genes contain biosynthetic gene clusters of interest by integrating the outputs of the binary classifier for each reaction class. The validation results suggested that DeepES can capture functional similarity between protein sequences, and it can be implemented to explore orphan enzyme genes. By applying DeepES to 4744 metagenome-assembled genomes, we identified candidate genes for 236 orphan enzymes, including those involved in short-chain fatty acid production as a characteristic pathway in human gut bacteria. AVAILABILITY AND IMPLEMENTATION DeepES is available at https://github.com/yamada-lab/DeepES. Model weights and the candidate genes are available at Zenodo (https://doi.org/10.5281/zenodo.11123900).
Collapse
Affiliation(s)
- Keisuke Hirota
- School of Life Science and Technology, Institute of Science Tokyo, Tokyo, 152-8550, Japan
| | - Felix Salim
- School of Life Science and Technology, Institute of Science Tokyo, Tokyo, 152-8550, Japan
| | - Takuji Yamada
- School of Life Science and Technology, Institute of Science Tokyo, Tokyo, 152-8550, Japan
- Metagen, Inc., Yamagata, 997-0052, Japan
- Metagen Therapeutics, Inc., Yamagata, 997-0052, Japan
- digzyme, Inc., Tokyo, 105-0001, Japan
| |
Collapse
|
5
|
Nguyen VTD, Nguyen ND, Hy TS. ProteinReDiff: Complex-based ligand-binding proteins redesign by equivariant diffusion-based generative models. STRUCTURAL DYNAMICS (MELVILLE, N.Y.) 2024; 11:064102. [PMID: 39629167 PMCID: PMC11614476 DOI: 10.1063/4.0000271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/11/2024] [Accepted: 11/06/2024] [Indexed: 12/07/2024]
Abstract
Proteins, serving as the fundamental architects of biological processes, interact with ligands to perform a myriad of functions essential for life. Designing functional ligand-binding proteins is pivotal for advancing drug development and enhancing therapeutic efficacy. In this study, we introduce ProteinReDiff, an diffusion framework targeting the redesign of ligand-binding proteins. Using equivariant diffusion-based generative models, ProteinReDiff enables the creation of high-affinity ligand-binding proteins without the need for detailed structural information, leveraging instead the potential of initial protein sequences and ligand SMILES strings. Our evaluations across sequence diversity, structural preservation, and ligand binding affinity underscore ProteinReDiff's potential to advance computational drug discovery and protein engineering.
Collapse
Affiliation(s)
| | - Nhan D Nguyen
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, USA
| | - Truong Son Hy
- Department of Computer Science, University of Alabama at Birmingham, Birmingham, Alabama 35294, USA
| |
Collapse
|
6
|
Yang J, Li FZ, Arnold FH. Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS CENTRAL SCIENCE 2024; 10:226-241. [PMID: 38435522 PMCID: PMC10906252 DOI: 10.1021/acscentsci.3c01275] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 12/26/2023] [Accepted: 01/16/2024] [Indexed: 03/05/2024]
Abstract
Enzymes can be engineered at the level of their amino acid sequences to optimize key properties such as expression, stability, substrate range, and catalytic efficiency-or even to unlock new catalytic activities not found in nature. Because the search space of possible proteins is vast, enzyme engineering usually involves discovering an enzyme starting point that has some level of the desired activity followed by directed evolution to improve its "fitness" for a desired application. Recently, machine learning (ML) has emerged as a powerful tool to complement this empirical process. ML models can contribute to (1) starting point discovery by functional annotation of known protein sequences or generating novel protein sequences with desired functions and (2) navigating protein fitness landscapes for fitness optimization by learning mappings between protein sequences and their associated fitness values. In this Outlook, we explain how ML complements enzyme engineering and discuss its future potential to unlock improved engineering outcomes.
Collapse
Affiliation(s)
- Jason Yang
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Francesca-Zhoufan Li
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Frances H. Arnold
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
7
|
Ariaeenejad S, Gharechahi J, Foroozandeh Shahraki M, Fallah Atanaki F, Han JL, Ding XZ, Hildebrand F, Bahram M, Kavousi K, Hosseini Salekdeh G. Precision enzyme discovery through targeted mining of metagenomic data. NATURAL PRODUCTS AND BIOPROSPECTING 2024; 14:7. [PMID: 38200389 PMCID: PMC10781932 DOI: 10.1007/s13659-023-00426-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Accepted: 12/19/2023] [Indexed: 01/12/2024]
Abstract
Metagenomics has opened new avenues for exploring the genetic potential of uncultured microorganisms, which may serve as promising sources of enzymes and natural products for industrial applications. Identifying enzymes with improved catalytic properties from the vast amount of available metagenomic data poses a significant challenge that demands the development of novel computational and functional screening tools. The catalytic properties of all enzymes are primarily dictated by their structures, which are predominantly determined by their amino acid sequences. However, this aspect has not been fully considered in the enzyme bioprospecting processes. With the accumulating number of available enzyme sequences and the increasing demand for discovering novel biocatalysts, structural and functional modeling can be employed to identify potential enzymes with novel catalytic properties. Recent efforts to discover new polysaccharide-degrading enzymes from rumen metagenome data using homology-based searches and machine learning-based models have shown significant promise. Here, we will explore various computational approaches that can be employed to screen and shortlist metagenome-derived enzymes as potential biocatalyst candidates, in conjunction with the wet lab analytical methods traditionally used for enzyme characterization.
Collapse
Affiliation(s)
- Shohreh Ariaeenejad
- Department of Systems and Synthetic Biology, Agricultural Biotechnology Research Institute of Iran (ABRII), Agricultural Research Education and Extension Organization (AREEO), Karaj, Iran
| | - Javad Gharechahi
- Human Genetics Research Center, Baqiyatallah University of Medical Sciences, Tehran, Iran
| | - Mehdi Foroozandeh Shahraki
- Laboratory of Complex Biological Systems and Bioinformatics (CBB), Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | - Fereshteh Fallah Atanaki
- Laboratory of Complex Biological Systems and Bioinformatics (CBB), Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | - Jian-Lin Han
- Livestock Genetics Program, International Livestock Research, Institute (ILRI), Nairobi, 00100, Kenya
- CAAS-ILRI Joint Laboratory On Livestock and Forage Genetic Resources, Institute of Animal Science, Chinese Academy of Agricultural Sciences (CAAS), Beijing, 100193, China
| | - Xue-Zhi Ding
- Key Laboratory of Yak Breeding Engineering, Lanzhou Institute of Husbandry and Pharmaceutical Sciences, Chinese Academy of Agricultural Sciences (CAAS), Lanzhou, 730050, China
| | - Falk Hildebrand
- Gut Microbes and Health, Quadram Institute Bioscience, Norwich, Norfolk, UK
- Digital Biology, Earlham Institute, Norwich, Norfolk, UK
| | - Mohammad Bahram
- Department of Ecology, Swedish University of Agricultural Sciences, Ulls Väg 16, 756 51, Uppsala, Sweden
- Department of Botany, Institute of Ecology and Earth Sciences, University of Tartu, 40 Lai St, Tartu, Estonia
| | - Kaveh Kavousi
- Laboratory of Complex Biological Systems and Bioinformatics (CBB), Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran.
| | | |
Collapse
|