1
|
Zhang Z, Li Z, Wang Q, Wu H, Yang M, Zhao F, Tan M, Han S. A protein fitness predictive framework based on feature combination and intelligent searching. Protein Sci 2024; 33:e5211. [PMID: 39548358 PMCID: PMC11567853 DOI: 10.1002/pro.5211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2024] [Revised: 09/14/2024] [Accepted: 10/22/2024] [Indexed: 11/17/2024]
Abstract
Machine learning (ML) constructs predictive models by understanding the relationship between protein sequences and their functions, enabling efficient identification of protein sequences with high fitness values without falling into local optima, like directional evolution. However, how to extract the most pertinent functional feature information from a limited number of protein sequences is vital for optimizing the performance of ML models. Here, we propose scut_ProFP (Protein Fitness Predictor), a predictive framework that integrates feature combination and feature selection techniques. Feature combination offers comprehensive sequence information, while feature selection searches for the most beneficial features to enhance model performance, enabling accurate sequence-to-function mapping. Compared to similar frameworks, scut_ProFP demonstrates superior performance and is also competitive with more complex deep learning models-ECNet, EVmutation, and UniRep. In addition, scut_ProFP enables generalization from low-order mutants to high-order mutants. Finally, we utilized scut_ProFP to simulate the engineering of the fluorescent protein CreiLOV and highly enriched mutants with high fluorescence based on only a small number of low-fluorescence mutants. Essentially, the developed method is advantageous for ML in protein engineering, providing an effective approach to data-driven protein engineering. The code and datasets for scut_ProFP are available at https://github.com/Zhang66-star/scut_ProFP.
Collapse
Affiliation(s)
- Zhihui Zhang
- Guangdong Key Laboratory of Fermentation and Enzyme Engineering, School of Biology and Biological EngineeringSouth China University of TechnologyGuangzhouChina
| | - Zhixuan Li
- Guangdong Key Laboratory of Fermentation and Enzyme Engineering, School of Biology and Biological EngineeringSouth China University of TechnologyGuangzhouChina
| | - Qianyue Wang
- School of Software EngineeringSouth China University of TechnologyGuangzhouChina
| | - Hanlin Wu
- School of Software EngineeringSouth China University of TechnologyGuangzhouChina
| | - Manli Yang
- Guangdong Key Laboratory of Fermentation and Enzyme Engineering, School of Biology and Biological EngineeringSouth China University of TechnologyGuangzhouChina
| | - Fengguang Zhao
- School of Light Industry and EngineeringSouth China University of TechnologyGuangzhouChina
| | - Mingkui Tan
- School of Software EngineeringSouth China University of TechnologyGuangzhouChina
| | - Shuangyan Han
- Guangdong Key Laboratory of Fermentation and Enzyme Engineering, School of Biology and Biological EngineeringSouth China University of TechnologyGuangzhouChina
| |
Collapse
|
2
|
Pal J, Ghosh S, Maji B, Bhattacharya DK. MMV method: a new approach to compare protein sequences under binary representation. J Biomol Struct Dyn 2024:1-7. [PMID: 38375605 DOI: 10.1080/07391102.2024.2317982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Accepted: 02/07/2024] [Indexed: 02/21/2024]
Abstract
In the present work, a new form of descriptor using minimal moment vector (MMV) is introduced to compare protein sequences in the frequency domain under their component wise binary representations. From every sequence, 20 different binary component sequences are formed, each corresponding to 20 amino acids. Each such vector is now shifted from the time domain to the frequency domain by applying the Fast Fourier Transform (FFT). Next, the power spectrum calculated from the FFT values for each component sequence is so normalized that the sum of the components equals 1. The descriptor is defined as a 20-component vector composed of the 20 second-order minimal moments calculated from the normalized spectrum of the 20 component sequences. Once the descriptor is known, the distance matrix is created by applying the Euclidean Distance measure. The phylogenetic tree is generated by applying the unweighted pair group method with the arithmetic mean (UPGMA) algorithm using Molecular Evolutionary Genetics Analysis11 (MEGA11) software. In this work, the datasets used for similarity studies are 9 NADH dehydrogenase 5 (ND5), 12 Baculoviruses, 24 Transferrins (TF) proteins, and 50 Spike Protein of coronavirus. A qualitative measure using rationalized perception is used to compare the effectiveness of the proposed method. Quantitative measure based on symmetric distance (SD) is used to compare the phylogenetic trees of the present method with those obtained by other methods. It is observed that the phylogenetic trees generated by the proposed technique are at par with their known biological references, and they produce results better than those of the earlier methods.
Collapse
Affiliation(s)
- Jayanta Pal
- Department of ECE, National Institute of Technology, Durgapur, India
- Department of CSE, Narula Institute of Technology, Kolkata, India
| | - Soumen Ghosh
- Department of ECE, National Institute of Technology, Durgapur, India
- Department of IT, Narula Institute of Technology, Kolkata, India
| | - Bansibadan Maji
- Department of ECE, National Institute of Technology, Durgapur, India
| | | |
Collapse
|
3
|
Li G, Jia L, Wang K, Sun T, Huang J. Prediction of Thermostability of Enzymes Based on the Amino Acid Index (AAindex) Database and Machine Learning. Molecules 2023; 28:8097. [PMID: 38138586 PMCID: PMC10746113 DOI: 10.3390/molecules28248097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 12/06/2023] [Accepted: 12/12/2023] [Indexed: 12/24/2023] Open
Abstract
The combination of wet-lab experimental data on multi-site combinatorial mutations and machine learning is an innovative method in protein engineering. In this study, we used an innovative sequence-activity relationship (innov'SAR) methodology based on novel descriptors and digital signal processing (DSP) to construct a predictive model. In this paper, 21 experimental (R)-selective amine transaminases from Aspergillus terreus (AT-ATA) were used as an input to predict higher thermostability mutants than those predicted using the existing data. We successfully improved the coefficient of determination (R2) of the model from 0.66 to 0.92. In addition, root-mean-squared deviation (RMSD), root-mean-squared fluctuation (RMSF), solvent accessible surface area (SASA), hydrogen bonds, and the radius of gyration were estimated based on molecular dynamics simulations, and the differences between the predicted mutants and the wild-type (WT) were analyzed. The successful application of the innov'SAR algorithm in improving the thermostability of AT-ATA may help in directed evolutionary screening and open up new avenues for protein engineering.
Collapse
Affiliation(s)
- Gaolin Li
- School of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China;
| | - Lili Jia
- State Key Laboratory of Rice Biology and Breeding, China National Rice Research Institute, Hangzhou 311400, China;
| | - Kang Wang
- Department of Physics, Zhejiang University of Science and Technology, Hangzhou 310023, China;
| | - Tingting Sun
- Department of Physics, Zhejiang University of Science and Technology, Hangzhou 310023, China;
| | - Jun Huang
- School of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China;
| |
Collapse
|
4
|
Alquran H, Al Fahoum A, Zyout A, Abu Qasmieh I. A comprehensive framework for advanced protein classification and function prediction using synergistic approaches: Integrating bispectral analysis, machine learning, and deep learning. PLoS One 2023; 18:e0295805. [PMID: 38096313 PMCID: PMC10721063 DOI: 10.1371/journal.pone.0295805] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Accepted: 11/29/2023] [Indexed: 12/17/2023] Open
Abstract
Proteins are fundamental components of diverse cellular systems and play crucial roles in a variety of disease processes. Consequently, it is crucial to comprehend their structure, function, and intricate interconnections. Classifying proteins into families or groups with comparable structural and functional characteristics is a crucial aspect of this comprehension. This classification is crucial for evolutionary research, predicting protein function, and identifying potential therapeutic targets. Sequence alignment and structure-based alignment are frequently ineffective techniques for identifying protein families.This study addresses the need for a more efficient and accurate technique for feature extraction and protein classification. The research proposes a novel method that integrates bispectrum characteristics, deep learning techniques, and machine learning algorithms to overcome the limitations of conventional methods. The proposed method uses numbers to represent protein sequences, utilizes bispectrum analysis, uses different topologies for convolutional neural networks to pull out features, and chooses robust features to classify protein families. The goal is to outperform existing methods for identifying protein families, thereby enhancing classification metrics. The materials consist of numerous protein datasets, whereas the methods incorporate bispectrum characteristics and deep learning strategies. The results of this study demonstrate that the proposed method for identifying protein families is superior to conventional approaches. Significantly enhanced quality metrics demonstrated the efficacy of the combined bispectrum and deep learning approaches. These findings have the potential to advance the field of protein biology and facilitate pharmaceutical innovation. In conclusion, this study presents a novel method that employs bispectrum characteristics and deep learning techniques to improve the precision and efficiency of protein family identification. The demonstrated advancements in classification metrics demonstrate this method's applicability to numerous scientific disciplines. This furthers our understanding of protein function and its implications for disease and treatment.
Collapse
Affiliation(s)
- Hiam Alquran
- Hijjawi Faculty for Engineering Technology, Biomedical Systems and Informatics Engineering Department, Yarmouk University, Irbid, Jordan
| | - Amjed Al Fahoum
- Hijjawi Faculty for Engineering Technology, Biomedical Systems and Informatics Engineering Department, Yarmouk University, Irbid, Jordan
| | - Ala’a Zyout
- Hijjawi Faculty for Engineering Technology, Biomedical Systems and Informatics Engineering Department, Yarmouk University, Irbid, Jordan
| | - Isam Abu Qasmieh
- Hijjawi Faculty for Engineering Technology, Biomedical Systems and Informatics Engineering Department, Yarmouk University, Irbid, Jordan
| |
Collapse
|
5
|
Okagu IU, Aham EC, Ezeorba TPC, Ndefo JC, Aguchem RN, Udenigwe CC. Osteo‐modulatory dietary proteins and peptides: A concise review. J Food Biochem 2022; 46:e14365. [DOI: 10.1111/jfbc.14365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 06/20/2022] [Accepted: 07/18/2022] [Indexed: 11/29/2022]
Affiliation(s)
| | - Emmanuel Chigozie Aham
- Department of Biochemistry, Faculty of Biological Sciences University of Nigeria Nsukka Nigeria
| | | | - Joseph Chinedum Ndefo
- Department of Science Laboratory Technology Faculty of Physical Sciences, University of Nigeria Nsukka Nigeria
| | - Rita Ngozi Aguchem
- Department of Biochemistry, Faculty of Biological Sciences University of Nigeria Nsukka Nigeria
| | - Chibuike C. Udenigwe
- School of Nutrition Sciences, Faculty of Health Sciences University of Ottawa Ottawa Ontario Canada
| |
Collapse
|
6
|
Mckenna A, P N Dubey S. Machine Learning Based Predictive Model for the Analysis of Sequence Activity Relationships Using Protein Spectra and Protein Descriptors. J Biomed Inform 2022; 128:104016. [PMID: 35143999 DOI: 10.1016/j.jbi.2022.104016] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 12/13/2021] [Accepted: 02/03/2022] [Indexed: 11/26/2022]
Abstract
Accurately establishing the connection between a protein sequence and its function remains a focal point within the field of protein engineering, especially in the context of predicting the effects of mutations. From this, there has been a continued drive to build accurate and reliable predictive models via machine learning that allow for the virtual screening of many protein mutant sequences, measuring the relationship between sequence and 'fitness' or 'activity', commonly known as a Sequence-Activity-Relationship (SAR). An important preliminary stage in the building of these predictive models is the encoding of the chosen sequences. Evaluated in this work is a plethora of encoding strategies using the Amino Acid Index database, where the indices are transformed into their spectral form via Digital Signal Processing (DSP) techniques, as well as numerous protein structural and physiochemical descriptors. The encoding strategies are explored on a dataset curated to measure the thermostability of various mutants from a recombination library, designed from parental cytochrome P450s. In this work it was concluded that the implementation of protein spectra in concatenation with protein descriptors, together with the Partial Least Squares Regression (PLS) algorithm, gave the most noteworthy increase in the quality of the predictive models (as described in Encoding Strategy C), highlighting their utility in identifying an SAR. The accompanying software produced for this paper is termed pySAR (Python Sequence-Activity-Relationship), which allows for a user to find the optimal arrangement of structural and or physiochemical properties to encode their specific mutant library dataset; the source code is available at: https://github.com/amckenna41/pySAR.
Collapse
Affiliation(s)
- Adam Mckenna
- School of Electronics, Electrical Engineering and Computer Science, Queen's University of Belfast, University Road, BT7 1NN, Belfast, United Kingdom.
| | - Sandhya P N Dubey
- Department of Data Science and Computer Applications, Manipal Institute of Technology, Manipal Academy of Higher Education (MAHE), Manipal, Karnataka 576104, India.
| |
Collapse
|
7
|
Yang ZR. In silico prediction of Severe Acute Respiratory Syndrome Coronavirus 2 main protease cleavage sites. Proteins 2021; 90:791-801. [PMID: 34739145 PMCID: PMC8661936 DOI: 10.1002/prot.26274] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2021] [Revised: 10/19/2021] [Accepted: 10/25/2021] [Indexed: 11/07/2022]
Abstract
One of the emerging subjects to combat the SARS-CoV-2 virus is to design accurate and efficient drug such as inhibitors against the viral protease to stop the viral spread. In addition to laboratory investigation of the viral protease, which is fundamental, the in silico research of viral protease such as the protease cleavage site prediction is critically important and urgent. However, this problem has yet to be addressed. This article has, for the first time, investigated this problem using the pattern recognition approaches. The article has shown that the pattern recognition approaches incorporating a specially tailored kernel function for dealing with amino acids has the outstanding performance in the accuracy of cleavage site prediction and the discovery of the prototype cleavage peptides.
Collapse
|
8
|
Siedhoff NE, Illig AM, Schwaneberg U, Davari MD. PyPEF-An Integrated Framework for Data-Driven Protein Engineering. J Chem Inf Model 2021; 61:3463-3476. [PMID: 34260225 DOI: 10.1021/acs.jcim.1c00099] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Data-driven strategies are gaining increased attention in protein engineering due to recent advances in access to large experimental databanks of proteins, next-generation sequencing (NGS), high-throughput screening (HTS) methods, and the development of artificial intelligence algorithms. However, the reliable prediction of beneficial amino acid substitutions, their combination, and the effect on functional properties remain the most significant challenges in protein engineering, which is applied to develop proteins and enzymes for biocatalysis, biomedicine, and life sciences. Here, we present a general-purpose framework (PyPEF: pythonic protein engineering framework) for performing data-driven protein engineering using machine learning methods combined with techniques from signal processing and statistical physics. PyPEF guides the identification and selection of beneficial proteins of a defined sequence space by systematically or randomly exploring the fitness of variants and by sampling random evolution pathways. The performance of PyPEF was evaluated concerning its predictive accuracy and throughput on four public protein and enzyme data sets using common regression models. It was proved that the program could efficiently predict the fitness of protein sequences for different target properties (predictive models with coefficient of determination values ranging from 0.58 to 0.92). By combining machine learning and protein evolution, PyPEF enabled the screening of proteins with various functions, reaching a screening capacity of more than 500,000 protein sequence variants in the timeframe of only a few minutes on a personal computer. PyPEF displayed significant accuracies on four public data sets (different proteins and properties) and underlined the potential of integrating data-driven technologies for covering different philosophies by either predicting the fitness of the variants to the highest accuracy accounting for epistatic effects or capturing the general trend of introduced mutations on the fitness in directed protein evolution campaigns. In essence, PyPEF can provide a powerful solution to current sequence exploration and combinatorial problems faced in protein engineering through exhaustive in silico screening of the sequence space.
Collapse
Affiliation(s)
- Niklas E Siedhoff
- Institute of Biotechnology, RWTH Aachen University, Worringer Weg 3, 52074 Aachen, Germany
| | | | - Ulrich Schwaneberg
- Institute of Biotechnology, RWTH Aachen University, Worringer Weg 3, 52074 Aachen, Germany.,DWI-Leibniz Institute for Interactive Materials, Forckenbeckstraße 50, 52074 Aachen, Germany
| | - Mehdi D Davari
- Institute of Biotechnology, RWTH Aachen University, Worringer Weg 3, 52074 Aachen, Germany
| |
Collapse
|
9
|
Li G, Qin Y, Fontaine NT, Ng Fuk Chong M, Maria‐Solano MA, Feixas F, Cadet XF, Pandjaitan R, Garcia‐Borràs M, Cadet F, Reetz MT. Machine Learning Enables Selection of Epistatic Enzyme Mutants for Stability Against Unfolding and Detrimental Aggregation. Chembiochem 2021; 22:904-914. [PMID: 33094545 PMCID: PMC7984044 DOI: 10.1002/cbic.202000612] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Revised: 10/22/2020] [Indexed: 12/15/2022]
Abstract
Machine learning (ML) has pervaded most areas of protein engineering, including stability and stereoselectivity. Using limonene epoxide hydrolase as the model enzyme and innov'SAR as the ML platform, comprising a digital signal process, we achieved high protein robustness that can resist unfolding with concomitant detrimental aggregation. Fourier transform (FT) allows us to take into account the order of the protein sequence and the nonlinear interactions between positions, and thus to grasp epistatic phenomena. The innov'SAR approach is interpolative, extrapolative and makes outside-the-box, predictions not found in other state-of-the-art ML or deep learning approaches. Equally significant is the finding that our approach to ML in the present context, flanked by advanced molecular dynamics simulations, uncovers the connection between epistatic mutational interactions and protein robustness.
Collapse
Affiliation(s)
- Guangyue Li
- State Key Laboratory for Biology of Plant Diseases and Insect Pests Key Laboratory of Control of Biological Hazard Factors (Plant Origin) for Agri-product Quality and Safety Ministry of Agriculture, Institute of Plant ProtectionChinese Academy of Agricultural SciencesBeijing100081P. R. China
| | - Youcai Qin
- State Key Laboratory for Biology of Plant Diseases and Insect Pests Key Laboratory of Control of Biological Hazard Factors (Plant Origin) for Agri-product Quality and Safety Ministry of Agriculture, Institute of Plant ProtectionChinese Academy of Agricultural SciencesBeijing100081P. R. China
| | - Nicolas T. Fontaine
- PEACCELArtificial Intelligence Department6 Square Albin Cachot, Box 4275013ParisFrance) .
| | - Matthieu Ng Fuk Chong
- PEACCELArtificial Intelligence Department6 Square Albin Cachot, Box 4275013ParisFrance) .
| | - Miguel A. Maria‐Solano
- Institut de Química Computacional i Catàlisi and Departament de QuímicaUniversitat de Girona Campus Montilivi17003Girona, CataloniaSpain) .
| | - Ferran Feixas
- Institut de Química Computacional i Catàlisi and Departament de QuímicaUniversitat de Girona Campus Montilivi17003Girona, CataloniaSpain) .
| | - Xavier F. Cadet
- PEACCELArtificial Intelligence Department6 Square Albin Cachot, Box 4275013ParisFrance) .
| | - Rudy Pandjaitan
- PEACCELArtificial Intelligence Department6 Square Albin Cachot, Box 4275013ParisFrance) .
| | - Marc Garcia‐Borràs
- Institut de Química Computacional i Catàlisi and Departament de QuímicaUniversitat de Girona Campus Montilivi17003Girona, CataloniaSpain) .
| | - Frederic Cadet
- PEACCELArtificial Intelligence Department6 Square Albin Cachot, Box 4275013ParisFrance) .
| | - Manfred T. Reetz
- Department of ChemistryPhilipps-Universität35032MarburgGermany) .
- Max-Planck-Institut fuer Kohlenforschung45470MülheimGermany
- Tianjin Institute of Industrial BiotechnologyChinese Academy of Sciences32 West 7th Avenue, Tianjin Airport Economic Area300308TianjinP. R. China
| |
Collapse
|