1
|
Fu X, Suo H, Zhang J, Chen D. Machine-learning-guided Directed Evolution for AAV Capsid Engineering. Curr Pharm Des 2024; 30:811-824. [PMID: 38445704 DOI: 10.2174/0113816128286593240226060318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Revised: 02/07/2024] [Accepted: 02/13/2024] [Indexed: 03/07/2024]
Abstract
Target gene delivery is crucial to gene therapy. Adeno-associated virus (AAV) has emerged as a primary gene therapy vector due to its broad host range, long-term expression, and low pathogenicity. However, AAV vectors have some limitations, such as immunogenicity and insufficient targeting. Designing or modifying capsids is a potential method of improving the efficacy of gene delivery, but hindered by weak biological basis of AAV, complexity of the capsids, and limitations of current screening methods. Artificial intelligence (AI), especially machine learning (ML), has great potential to accelerate and improve the optimization of capsid properties as well as decrease their development time and manufacturing costs. This review introduces the traditional methods of designing AAV capsids and the general steps of building a sequence-function ML model, highlights the applications of ML in the development workflow, and summarizes its advantages and challenges.
Collapse
Affiliation(s)
- Xianrong Fu
- School of Artificial Intelligence, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Hairui Suo
- School of Artificial Intelligence, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Jiachen Zhang
- School of Artificial Intelligence, Hangzhou Dianzi University, Hangzhou 310018, China
| | - Dongmei Chen
- School of Artificial Intelligence, Hangzhou Dianzi University, Hangzhou 310018, China
| |
Collapse
|
2
|
Xu H, Wu W, Zhao Y, Liu Z, Bao D, Li L, Lin M, Zhang Y, Zhao X, Luo D. Analysis of preoperative computed tomography radiomics and clinical factors for predicting postsurgical recurrence of papillary thyroid carcinoma. Cancer Imaging 2023; 23:118. [PMID: 38098119 PMCID: PMC10722708 DOI: 10.1186/s40644-023-00629-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Accepted: 10/19/2023] [Indexed: 12/17/2023] Open
Abstract
BACKGROUND Postsurgical recurrence is of great concern for papillary thyroid carcinoma (PTC). We aim to investigate the value of computed tomography (CT)-based radiomics features and conventional clinical factors in predicting the recurrence of PTC. METHODS Two-hundred and eighty patients with PTC were retrospectively enrolled and divided into training and validation cohorts at a 6:4 ratio. Recurrence was defined as cytology/pathology-proven disease or morphological evidence of lesions on imaging examinations within 5 years after surgery. Radiomics features were extracted from manually segmented tumor on CT images and were then selected using four different feature selection methods sequentially. Multivariate logistic regression analysis was conducted to identify clinical features associated with recurrence. Radiomics, clinical, and combined models were constructed separately using logistic regression (LR), support vector machine (SVM), k-nearest neighbor (KNN), and neural network (NN), respectively. Receiver operating characteristic analysis was performed to evaluate the model performance in predicting recurrence. A nomogram was established based on all relevant features, with its reliability and reproducibility verified using calibration curves and decision curve analysis (DCA). RESULTS Eighty-nine patients with PTC experienced recurrence. A total of 1218 radiomics features were extracted from each segmentation. Five radiomics and six clinical features were related to recurrence. Among the 4 radiomics models, the LR-based and SVM-based radiomics models outperformed the NN-based radiomics model (P = 0.032 and 0.026, respectively). Among the 4 clinical models, only the difference between the area under the curve (AUC) of the LR-based and NN-based clinical model was statistically significant (P = 0.035). The combined models had higher AUCs than the corresponding radiomics and clinical models based on the same classifier, although most differences were not statistically significant. In the validation cohort, the combined models based on the LR, SVM, KNN, and NN classifiers had AUCs of 0.746, 0.754, 0.669, and 0.711, respectively. However, the AUCs of these combined models had no significant differences (all P > 0.05). Calibration curves and DCA indicated that the nomogram have potential clinical utility. CONCLUSIONS The combined model may have potential for better prediction of PTC recurrence than radiomics and clinical models alone. Further testing with larger cohort may help reach statistical significance.
Collapse
Affiliation(s)
- Haijun Xu
- Department of Radiology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Wenli Wu
- Medical Imaging Center, Liaocheng Tumor Hospital, Liaocheng, 252000, China
| | - Yanfeng Zhao
- Department of Radiology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China.
| | - Zhou Liu
- Department of Radiology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital & Shenzhen Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Shenzhen, 518116, China
| | - Dan Bao
- Department of Radiology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Lin Li
- Department of Radiology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Meng Lin
- Department of Radiology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Ya Zhang
- Department of Radiology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital & Shenzhen Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Shenzhen, 518116, China
| | - Xinming Zhao
- Department of Radiology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Dehong Luo
- Department of Radiology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China.
- Department of Radiology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital & Shenzhen Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Shenzhen, 518116, China.
| |
Collapse
|
3
|
Johnston KE, Fannjiang C, Wittmann BJ, Hie BL, Yang KK, Wu Z. Machine Learning for Protein Engineering. ARXIV 2023:arXiv:2305.16634v1. [PMID: 37292483 PMCID: PMC10246115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Directed evolution of proteins has been the most effective method for protein engineering. However, a new paradigm is emerging, fusing the library generation and screening approaches of traditional directed evolution with computation through the training of machine learning models on protein sequence fitness data. This chapter highlights successful applications of machine learning to protein engineering and directed evolution, organized by the improvements that have been made with respect to each step of the directed evolution cycle. Additionally, we provide an outlook for the future based on the current direction of the field, namely in the development of calibrated models and in incorporating other modalities, such as protein structure.
Collapse
Affiliation(s)
| | | | - Bruce J Wittmann
- work done while at California Institute of Technology, now at Microsoft
| | | | | | | |
Collapse
|
4
|
Kuntz CP, Woods H, McKee AG, Zelt NB, Mendenhall JL, Meiler J, Schlebach JP. Towards generalizable predictions for G protein-coupled receptor variant expression. Biophys J 2022; 121:2712-2720. [PMID: 35715957 DOI: 10.1016/j.bpj.2022.06.018] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 05/31/2022] [Accepted: 06/13/2022] [Indexed: 11/30/2022] Open
Abstract
Missense mutations that compromise the plasma membrane expression (PME) of integral membrane proteins are the root cause of numerous genetic diseases. Differentiation of this class of mutations from those that specifically modify the activity of the folded protein has proven useful for the development and targeting of precision therapeutics. Nevertheless, it remains challenging to predict the effects of mutations on the stability and/ or expression of membrane proteins. In this work, we utilize deep mutational scanning data to train a series of artificial neural networks to predict the PME of transmembrane domain variants of G protein-coupled receptors from structural and/ or evolutionary features. We show that our best-performing network, which we term the PME predictor, can recapitulate mutagenic trends within rhodopsin and can differentiate pathogenic transmembrane domain variants that cause it to misfold from those that compromise its signaling. This network also generates statistically significant predictions for the relative PME of transmembrane domain variants for another class A G protein-coupled receptor (β2 adrenergic receptor) but not for an unrelated voltage-gated potassium channel (KCNQ1). Notably, our analyses of these networks suggest structural features alone are generally sufficient to recapitulate the observed mutagenic trends. Moreover, our findings imply that networks trained in this manner may be generalizable to proteins that share a common fold. Implications of our findings for the design of mechanistically specific genetic predictors are discussed.
Collapse
Affiliation(s)
- Charles P Kuntz
- Department of Chemistry, Indiana University, Bloomington, Indiana
| | - Hope Woods
- Department of Chemistry, Vanderbilt University, Nashville, Tennessee; Chemical and Physical Biology Program, Vanderbilt University, Nashville, Tennessee
| | - Andrew G McKee
- Department of Chemistry, Indiana University, Bloomington, Indiana
| | - Nathan B Zelt
- Department of Chemistry, Indiana University, Bloomington, Indiana
| | - Jeffrey L Mendenhall
- Department of Chemistry, Vanderbilt University, Nashville, Tennessee; Chemical and Physical Biology Program, Vanderbilt University, Nashville, Tennessee
| | - Jens Meiler
- Department of Chemistry, Vanderbilt University, Nashville, Tennessee; Institute for Drug Discovery, Leipzig University Medical School, Leipzig, Saxony, Germany.
| | | |
Collapse
|
5
|
Wang Y, Xue P, Cao M, Yu T, Lane ST, Zhao H. Directed Evolution: Methodologies and Applications. Chem Rev 2021; 121:12384-12444. [PMID: 34297541 DOI: 10.1021/acs.chemrev.1c00260] [Citation(s) in RCA: 280] [Impact Index Per Article: 70.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Directed evolution aims to expedite the natural evolution process of biological molecules and systems in a test tube through iterative rounds of gene diversifications and library screening/selection. It has become one of the most powerful and widespread tools for engineering improved or novel functions in proteins, metabolic pathways, and even whole genomes. This review describes the commonly used gene diversification strategies, screening/selection methods, and recently developed continuous evolution strategies for directed evolution. Moreover, we highlight some representative applications of directed evolution in engineering nucleic acids, proteins, pathways, genetic circuits, viruses, and whole cells. Finally, we discuss the challenges and future perspectives in directed evolution.
Collapse
Affiliation(s)
- Yajie Wang
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Pu Xue
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Mingfeng Cao
- DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Tianhao Yu
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Stephan T Lane
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Huimin Zhao
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,DOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States.,Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| |
Collapse
|
6
|
Volk MJ, Lourentzou I, Mishra S, Vo LT, Zhai C, Zhao H. Biosystems Design by Machine Learning. ACS Synth Biol 2020; 9:1514-1533. [PMID: 32485108 DOI: 10.1021/acssynbio.0c00129] [Citation(s) in RCA: 56] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Biosystems such as enzymes, pathways, and whole cells have been increasingly explored for biotechnological applications. However, the intricate connectivity and resulting complexity of biosystems poses a major hurdle in designing biosystems with desirable features. As -omics and other high throughput technologies have been rapidly developed, the promise of applying machine learning (ML) techniques in biosystems design has started to become a reality. ML models enable the identification of patterns within complicated biological data across multiple scales of analysis and can augment biosystems design applications by predicting new candidates for optimized performance. ML is being used at every stage of biosystems design to help find nonobvious engineering solutions with fewer design iterations. In this review, we first describe commonly used models and modeling paradigms within ML. We then discuss some applications of these models that have already shown success in biotechnological applications. Moreover, we discuss successful applications at all scales of biosystems design, including nucleic acids, genetic circuits, proteins, pathways, genomes, and bioprocesses. Finally, we discuss some limitations of these methods and potential solutions as well as prospects of the combination of ML and biosystems design.
Collapse
|
7
|
Application of artificial intelligence to the in silico assessment of antimicrobial resistance and risks to human and animal health presented by priority enteric bacterial pathogens. ACTA ACUST UNITED AC 2020; 46:180-185. [PMID: 32673383 DOI: 10.14745/ccdr.v46i06a05] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Each year, approximately one in eight Canadians are affected by foodborne illness, either through outbreaks or sporadic illness, with animals being the major reservoir for the pathogens. Whole genome sequence analyses are now routinely implemented by public and animal health laboratories to define epidemiological disease clusters and to identify potential sources of infection. Similarly, a number of bioinformatics tools can be used to identify virulence and antimicrobial resistance (AMR) determinants in the genomes of pathogenic strains. Many important clinical and phenotypic characteristics of these pathogens can now be predicted using machine learning algorithms applied to whole genome sequence data. In this overview, we compare the ability of support vector machines, gradient-boosted decision trees and artificial neural networks to predict the levels of AMR within Salmonella enterica and extended-spectrum β-lactamase (ESBL) producing Escherichia coli. We show that minimum inhibitory concentrations (MIC) for each of 13 antimicrobials for S. enterica strains can be accurately determined, and that ESBL-producing E. coli strains can be accurately classified as susceptible, intermediate or resistant for each of seven antimicrobials. In addition to AMR and bacterial populations of greatest risk to human health, artificial intelligence algorithms hold promise as tools to predict other clinically and epidemiologically important phenotypes of enteric pathogens.
Collapse
|
8
|
Yang KK, Wu Z, Arnold FH. Machine-learning-guided directed evolution for protein engineering. Nat Methods 2019; 16:687-694. [PMID: 31308553 DOI: 10.1038/s41592-019-0496-6] [Citation(s) in RCA: 519] [Impact Index Per Article: 86.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Accepted: 06/17/2019] [Indexed: 02/06/2023]
Abstract
Protein engineering through machine-learning-guided directed evolution enables the optimization of protein functions. Machine-learning approaches predict how sequence maps to function in a data-driven manner without requiring a detailed model of the underlying physics or biological pathways. Such methods accelerate directed evolution by learning from the properties of characterized variants and using that information to select sequences that are likely to exhibit improved properties. Here we introduce the steps required to build machine-learning sequence-function models and to use those models to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to the use of machine learning for protein engineering, as well as the current literature and applications of this engineering paradigm. We illustrate the process with two case studies. Finally, we look to future opportunities for machine learning to enable the discovery of unknown protein functions and uncover the relationship between protein sequence and function.
Collapse
Affiliation(s)
- Kevin K Yang
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Zachary Wu
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Frances H Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA.
| |
Collapse
|
9
|
Pedro AQ, Queiroz JA, Passarinha LA. Smoothing membrane protein structure determination by initial upstream stage improvements. Appl Microbiol Biotechnol 2019; 103:5483-5500. [PMID: 31127356 PMCID: PMC7079970 DOI: 10.1007/s00253-019-09873-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Revised: 04/25/2019] [Accepted: 04/26/2019] [Indexed: 12/14/2022]
Abstract
Membrane proteins (MP) constitute 20–30% of all proteins encoded by the genome of various organisms and perform a wide range of essential biological functions. However, despite they represent the largest class of protein drug targets, a relatively small number high-resolution 3D structures have been obtained yet. Membrane protein biogenesis is more complex than that of the soluble proteins and its recombinant biosynthesis has been a major drawback, thus delaying their further structural characterization. Indeed, the major limitation in structure determination of MP is the low yield achieved in recombinant expression, usually coupled to low functionality, pinpointing the optimization target in recombinant MP research. Recently, the growing attention that have been dedicated to the upstream stage of MP bioprocesses allowed great advances, permitting the evolution of the number of MP solved structures. In this review, we analyse and discuss effective solutions and technical advances at the level of the upstream stage using prokaryotic and eukaryotic organisms foreseeing an increase in expression yields of correctly folded MP and that may facilitate the determination of their three-dimensional structure. A section on techniques used to protein quality control and further structure determination of MP is also included. Lastly, a critical assessment of major factors contributing for a good decision-making process related to the upstream stage of MP is presented.
Collapse
Affiliation(s)
- Augusto Quaresma Pedro
- CICS-UBI - Centro de Investigação em Ciências da Saúde, Universidade da Beira Interior, 6201-001, Covilhã, Portugal
- CICECO - Aveiro Institute of Materials, Department of Chemistry, Universidade de Aveiro, 3810-193, Aveiro, Portugal
| | - João António Queiroz
- CICS-UBI - Centro de Investigação em Ciências da Saúde, Universidade da Beira Interior, 6201-001, Covilhã, Portugal
| | - Luís António Passarinha
- CICS-UBI - Centro de Investigação em Ciências da Saúde, Universidade da Beira Interior, 6201-001, Covilhã, Portugal.
- UCIBIO@REQUIMTE, Departamento de Química, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829-516, Caparica, Portugal.
| |
Collapse
|
10
|
Varga JK, Tusnády GE. TMCrys: predict propensity of success for transmembrane protein crystallization. Bioinformatics 2018; 34:3126-3130. [PMID: 29718100 PMCID: PMC6137969 DOI: 10.1093/bioinformatics/bty342] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2017] [Revised: 03/10/2018] [Accepted: 04/25/2018] [Indexed: 11/30/2022] Open
Abstract
Motivation Transmembrane proteins (TMPs) are crucial in the life of the cells. As they have special properties, their structure is hard to determine--the PDB database consists of 2% TMPs, despite the fact that they are predicted to make up to 25% of the human proteome. Crystallization prediction methods were developed to aid the target selection for structure determination, however, there is a need for a TMP specific service. Results Here, we present TMCrys, a crystallization prediction method that surpasses existing prediction methods in performance thanks to its specialization for TMPs. We expect TMCrys to improve target selection of TMPs. Availability and implementation https://github.com/brgenzim/tmcrys. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Julia K Varga
- ‘Momentum’ Membrane Protein Bioinformatics Research Group, Institute of Enzymology, Research Center of Natural Sciences, Hungarian Academy of Sciences, Budapest, Hungary
| | - Gábor E Tusnády
- ‘Momentum’ Membrane Protein Bioinformatics Research Group, Institute of Enzymology, Research Center of Natural Sciences, Hungarian Academy of Sciences, Budapest, Hungary
| |
Collapse
|
11
|
|
12
|
Yang KK, Wu Z, Bedbrook CN, Arnold FH. Learned protein embeddings for machine learning. Bioinformatics 2018; 34:2642-2648. [PMID: 29584811 PMCID: PMC6061698 DOI: 10.1093/bioinformatics/bty178] [Citation(s) in RCA: 152] [Impact Index Per Article: 21.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2017] [Revised: 03/20/2018] [Accepted: 03/22/2018] [Indexed: 12/26/2022] Open
Abstract
Motivation Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model's ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling. Results The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured. Availability and implementation The embedding vectors and code to reproduce the results are available at https://github.com/fhalab/embeddings_reproduction/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kevin K Yang
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Zachary Wu
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Claire N Bedbrook
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Frances H Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| |
Collapse
|