1
|
Qu Y, Niu Z, Ding Q, Zhao T, Kong T, Bai B, Ma J, Zhao Y, Zheng J. Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction. Int J Mol Sci 2023; 24:16496. [PMID: 38003686 PMCID: PMC10671426 DOI: 10.3390/ijms242216496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 11/11/2023] [Accepted: 11/17/2023] [Indexed: 11/26/2023] Open
Abstract
Machine learning has been increasingly utilized in the field of protein engineering, and research directed at predicting the effects of protein mutations has attracted increasing attention. Among them, so far, the best results have been achieved by related methods based on protein language models, which are trained on a large number of unlabeled protein sequences to capture the generally hidden evolutionary rules in protein sequences, and are therefore able to predict their fitness from protein sequences. Although numerous similar models and methods have been successfully employed in practical protein engineering processes, the majority of the studies have been limited to how to construct more complex language models to capture richer protein sequence feature information and utilize this feature information for unsupervised protein fitness prediction. There remains considerable untapped potential in these developed models, such as whether the prediction performance can be further improved by integrating different models to further improve the accuracy of prediction. Furthermore, how to utilize large-scale models for prediction methods of mutational effects on quantifiable properties of proteins due to the nonlinear relationship between protein fitness and the quantification of specific functionalities has yet to be explored thoroughly. In this study, we propose an ensemble learning approach for predicting mutational effects of proteins integrating protein sequence features extracted from multiple large protein language models, as well as evolutionarily coupled features extracted in homologous sequences, while comparing the differences between linear regression and deep learning models in mapping these features to quantifiable functional changes. We tested our approach on a dataset of 17 protein deep mutation scans and indicated that the integrated approach together with linear regression enables the models to have higher prediction accuracy and generalization. Moreover, we further illustrated the reliability of the integrated approach by exploring the differences in the predictive performance of the models across species and protein sequence lengths, as well as by visualizing clustering of ensemble and non-ensemble features.
Collapse
Affiliation(s)
- Yang Qu
- Cixi Biomedical Research Institute, Wenzhou Medical University, Ningbo 315300, China; (Y.Q.); (Z.N.); (Q.D.); (T.Z.)
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| | - Zitong Niu
- Cixi Biomedical Research Institute, Wenzhou Medical University, Ningbo 315300, China; (Y.Q.); (Z.N.); (Q.D.); (T.Z.)
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| | - Qiaojiao Ding
- Cixi Biomedical Research Institute, Wenzhou Medical University, Ningbo 315300, China; (Y.Q.); (Z.N.); (Q.D.); (T.Z.)
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| | - Taowa Zhao
- Cixi Biomedical Research Institute, Wenzhou Medical University, Ningbo 315300, China; (Y.Q.); (Z.N.); (Q.D.); (T.Z.)
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| | - Tong Kong
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| | - Bing Bai
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| | - Jianwei Ma
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| | - Yitian Zhao
- Cixi Biomedical Research Institute, Wenzhou Medical University, Ningbo 315300, China; (Y.Q.); (Z.N.); (Q.D.); (T.Z.)
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| | - Jianping Zheng
- Cixi Biomedical Research Institute, Wenzhou Medical University, Ningbo 315300, China; (Y.Q.); (Z.N.); (Q.D.); (T.Z.)
- Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences, Ningbo 315300, China; (T.K.); (B.B.); (J.M.)
| |
Collapse
|
2
|
Mardikoraem M, Woldring D. Protein Fitness Prediction Is Impacted by the Interplay of Language Models, Ensemble Learning, and Sampling Methods. Pharmaceutics 2023; 15:1337. [PMID: 37242577 PMCID: PMC10224321 DOI: 10.3390/pharmaceutics15051337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 04/19/2023] [Accepted: 04/21/2023] [Indexed: 05/28/2023] Open
Abstract
Advances in machine learning (ML) and the availability of protein sequences via high-throughput sequencing techniques have transformed the ability to design novel diagnostic and therapeutic proteins. ML allows protein engineers to capture complex trends hidden within protein sequences that would otherwise be difficult to identify in the context of the immense and rugged protein fitness landscape. Despite this potential, there persists a need for guidance during the training and evaluation of ML methods over sequencing data. Two key challenges for training discriminative models and evaluating their performance include handling severely imbalanced datasets (e.g., few high-fitness proteins among an abundance of non-functional proteins) and selecting appropriate protein sequence representations (numerical encodings). Here, we present a framework for applying ML over assay-labeled datasets to elucidate the capacity of sampling techniques and protein encoding methods to improve binding affinity and thermal stability prediction tasks. For protein sequence representations, we incorporate two widely used methods (One-Hot encoding and physiochemical encoding) and two language-based methods (next-token prediction, UniRep; masked-token prediction, ESM). Elaboration on performance is provided over protein fitness, protein size, and sampling techniques. In addition, an ensemble of protein representation methods is generated to discover the contribution of distinct representations and improve the final prediction score. We then implement multiple criteria decision analysis (MCDA; TOPSIS with entropy weighting), using multiple metrics well-suited for imbalanced data, to ensure statistical rigor in ranking our methods. Within the context of these datasets, the synthetic minority oversampling technique (SMOTE) outperformed undersampling while encoding sequences with One-Hot, UniRep, and ESM representations. Moreover, ensemble learning increased the predictive performance of the affinity-based dataset by 4% compared to the best single-encoding candidate (F1-score = 97%), while ESM alone was rigorous enough in stability prediction (F1-score = 92%).
Collapse
Affiliation(s)
- Mehrsa Mardikoraem
- Department of Chemical Engineering and Materials Science, Michigan State University, East Lansing, MI 48824, USA
- Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Daniel Woldring
- Department of Chemical Engineering and Materials Science, Michigan State University, East Lansing, MI 48824, USA
- Institute for Quantitative Health Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
3
|
Pfeifer C, Panzer S, Shea CH. Attentional Demand of a Movement Sequence Guided by Visual-Spatial and by Motor Representations. J Mot Behav 2022; 55:58-67. [PMID: 35878952 DOI: 10.1080/00222895.2022.2101424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
The objective of the experiment was to assess the change in attentional demands of a movement sequence guided by visual-spatial and motor representations across practice sessions in a dual-task probe paradigm. Participants were randomly assigned to either a 1-day or 2-day practice group. Following acquisition of the motor sequence task, participants first conducted a retention test and then four inter-manual transfer tests under single and dual-task conditions. The probe task was a simple reaction time. The inter-manual transfer tests, consisting of a mirror and non-mirror test, examined the development of the motor and visual-spatial representation, respectively. The results indicated that both representations guided the movement sequence and required attention. The attentional demands did not change with additional practice.
Collapse
Affiliation(s)
| | - Stefan Panzer
- Saarland University, Saarbrücken, Germany.,Texas A&M University, College Station, USA
| | | |
Collapse
|
4
|
Cui F, Zhang Z, Zou Q. Sequence representation approaches for sequence-based protein prediction tasks that use deep learning. Brief Funct Genomics 2021; 20:61-73. [PMID: 33527980 DOI: 10.1093/bfgp/elaa030] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Revised: 12/16/2020] [Accepted: 12/18/2020] [Indexed: 11/12/2022] Open
Abstract
Deep learning has been increasingly used in bioinformatics, especially in sequence-based protein prediction tasks, as large amounts of biological data are available and deep learning techniques have been developed rapidly in recent years. For sequence-based protein prediction tasks, the selection of a suitable model architecture is essential, whereas sequence data representation is a major factor in controlling model performance. Here, we summarized all the main approaches that are used to represent protein sequence data (amino acid sequence encoding or embedding), which include end-to-end embedding methods, non-contextual embedding methods and embedding methods that use transfer learning and others that are applied for some specific tasks (such as protein sequence embedding based on extracted features for protein structure predictions and graph convolutional network-based embedding for drug discovery tasks). We have also reviewed the architectures of various types of embedding models theoretically and the development of these types of sequence embedding approaches to facilitate researchers and users in selecting the model that best suits their requirements.
Collapse
Affiliation(s)
- Feifei Cui
- University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Zilong Zhang
- University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| |
Collapse
|
5
|
Hyafil A, Giraud AL, Fontolan L, Gutkin B. Neural Cross-Frequency Coupling: Connecting Architectures, Mechanisms, and Functions. Trends Neurosci 2016; 38:725-740. [PMID: 26549886 DOI: 10.1016/j.tins.2015.09.001] [Citation(s) in RCA: 223] [Impact Index Per Article: 27.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2015] [Revised: 08/14/2015] [Accepted: 09/01/2015] [Indexed: 10/22/2022]
Abstract
Neural oscillations are ubiquitously observed in the mammalian brain, but it has proven difficult to tie oscillatory patterns to specific cognitive operations. Notably, the coupling between neural oscillations at different timescales has recently received much attention, both from experimentalists and theoreticians. We review the mechanisms underlying various forms of this cross-frequency coupling. We show that different types of neural oscillators and cross-frequency interactions yield distinct signatures in neural dynamics. Finally, we associate these mechanisms with several putative functions of cross-frequency coupling, including neural representations of multiple environmental items, communication over distant areas, internal clocking of neural processes, and modulation of neural processing based on temporal predictions.
Collapse
Affiliation(s)
- Alexandre Hyafil
- Universitat Pompeu Fabra, Theoretical and Computational Neuroscience, Roc Boronat 138, 08018 Barcelona, Spain; Research Unit, Parc Sanitari Sant Joan de Déu and Universitat de Barcelona, Esplugues de Llobregat, Barcelona, Spain.
| | - Anne-Lise Giraud
- Department of Neuroscience, University of Geneva, Campus Biotech, 9 chemin des Mines, 1211 Geneva, Switzerland
| | - Lorenzo Fontolan
- Department of Neuroscience, University of Geneva, Campus Biotech, 9 chemin des Mines, 1211 Geneva, Switzerland
| | - Boris Gutkin
- Group for Neural Theory, Institut National de la Santé et de la Recherche Médicale (INSERM) Unité 960, Département d'Etudes Cognitives, Ecole Normale Supérieure, 29 rue d'Ulm, 75005 Paris, France; Centre for Cognition and Decision Making, National Research University Higher School of Economics, Myasnitskaya Street 20, Moscow 101000, Russia
| |
Collapse
|
6
|
Abstract
PURPOSE We sought to compare the effects of physical practice (PP) and mental practice (MP) on the immediate and long-term learning of the finger-to-thumb opposition sequence task (FOS) in children; in addition, we investigated the transfer of this learning to an untrained sequence of movements and to the contralateral untrained hand. METHOD This study included thirty-six 9- and 10-year-old children who were randomly allocated into 3 groups: MP, PP, and no practice (NP). The MP and PP groups were subjected to a single session of training with the dominant trained hand. MP participants were trained by mentally rehearsing the movements, PP participants were trained by executing the movements, and the NP group had no training. The performance of the trained sequence (TS) and untrained reverse sequence (URS) by each of the 3 groups was evaluated under identical conditions before training, after 5 min, and at 4 days, 7 days, and 28 days after training. RESULTS Whereas both trained groups (MP and PP) showed statistically significant improvement in TS using the trained hand at all assessment points after the training, only MP participants were able to transfer the performance gains from the TS to the URS and from the trained hand to the untrained opposite hand. CONCLUSION Children were able to learn the FOS through MP or PP with a similar level of performance. Unlike PP, MP allowed for the transfer of performance gain to the URS and to the opposite hand, suggesting that the internal representations developed by MP were effector-independent.
Collapse
|