1
|
Ding W, Nakai K, Gong H. Protein design via deep learning. Brief Bioinform 2022; 23:bbac102. [PMID: 35348602 PMCID: PMC9116377 DOI: 10.1093/bib/bbac102] [Citation(s) in RCA: 34] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 02/26/2022] [Accepted: 03/01/2022] [Indexed: 12/11/2022] Open
Abstract
Proteins with desired functions and properties are important in fields like nanotechnology and biomedicine. De novo protein design enables the production of previously unseen proteins from the ground up and is believed as a key point for handling real social challenges. Recent introduction of deep learning into design methods exhibits a transformative influence and is expected to represent a promising and exciting future direction. In this review, we retrospect the major aspects of current advances in deep-learning-based design procedures and illustrate their novelty in comparison with conventional knowledge-based approaches through noticeable cases. We not only describe deep learning developments in structure-based protein design and direct sequence design, but also highlight recent applications of deep reinforcement learning in protein design. The future perspectives on design goals, challenges and opportunities are also comprehensively discussed.
Collapse
Affiliation(s)
- Wenze Ding
- School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing 210044, China
- School of Future Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing 100084, China
| | - Kenta Nakai
- Institute of Medical Science, the University of Tokyo, Tokyo 1088639, Japan
| | - Haipeng Gong
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China
- Beijing Advanced Innovation Center for Structural Biology, Tsinghua University, Beijing 100084, China
| |
Collapse
|
2
|
Boral A, Khamaru M, Mitra D. Designing synthetic transcription factors: A structural perspective. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2022; 130:245-287. [PMID: 35534109 DOI: 10.1016/bs.apcsb.2021.12.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
In this chapter, we discuss different design strategies of synthetic proteins, especially synthetic transcription factors. Design and engineering of synthetic transcription factors is particularly relevant for the need-based manipulation of gene expression. With recent advances in structural biology techniques and with the emergence of other precision biochemical/physical tools, accurate knowledge on structure-function relations is increasingly becoming available. Besides discussing the underlying principles of design, we go through individual cases, especially those involving four major groups of transcription factors-basic leucine zippers, zinc fingers, helix-turn-helix and homeodomains. We further discuss how synthetic biology can come together with structural biology to alter the genetic blueprint of life.
Collapse
Affiliation(s)
- Aparna Boral
- Department of Life Sciences, Presidency University, Kolkata, West Bengal, India
| | - Madhurima Khamaru
- Department of Life Sciences, Presidency University, Kolkata, West Bengal, India
| | - Devrani Mitra
- Department of Life Sciences, Presidency University, Kolkata, West Bengal, India.
| |
Collapse
|
3
|
Pan X, Kortemme T. Recent advances in de novo protein design: Principles, methods, and applications. J Biol Chem 2021; 296:100558. [PMID: 33744284 PMCID: PMC8065224 DOI: 10.1016/j.jbc.2021.100558] [Citation(s) in RCA: 117] [Impact Index Per Article: 29.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2021] [Revised: 03/12/2021] [Accepted: 03/16/2021] [Indexed: 02/06/2023] Open
Abstract
The computational de novo protein design is increasingly applied to address a number of key challenges in biomedicine and biological engineering. Successes in expanding applications are driven by advances in design principles and methods over several decades. Here, we review recent innovations in major aspects of the de novo protein design and include how these advances were informed by principles of protein architecture and interactions derived from the wealth of structures in the Protein Data Bank. We describe developments in de novo generation of designable backbone structures, optimization of sequences, design scoring functions, and the design of the function. The advances not only highlight design goals reachable now but also point to the challenges and opportunities for the future of the field.
Collapse
Affiliation(s)
- Xingjie Pan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, USA; UC Berkeley - UCSF Graduate Program in Bioengineering, University of California San Francisco, San Francisco, California, USA.
| | - Tanja Kortemme
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, USA; UC Berkeley - UCSF Graduate Program in Bioengineering, University of California San Francisco, San Francisco, California, USA; Quantitative Biosciences Institute (QBI), University of California San Francisco, San Francisco, California, USA.
| |
Collapse
|
4
|
Maguire JB, Haddox HK, Strickland D, Halabiya SF, Coventry B, Griffin JR, Pulavarti SVSRK, Cummins M, Thieker DF, Klavins E, Szyperski T, DiMaio F, Baker D, Kuhlman B. Perturbing the energy landscape for improved packing during computational protein design. Proteins 2020; 89:436-449. [PMID: 33249652 DOI: 10.1002/prot.26030] [Citation(s) in RCA: 101] [Impact Index Per Article: 20.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2020] [Revised: 10/04/2020] [Accepted: 11/21/2020] [Indexed: 01/04/2023]
Abstract
The FastDesign protocol in the molecular modeling program Rosetta iterates between sequence optimization and structure refinement to stabilize de novo designed protein structures and complexes. FastDesign has been used previously to design novel protein folds and assemblies with important applications in research and medicine. To promote sampling of alternative conformations and sequences, FastDesign includes stages where the energy landscape is smoothened by reducing repulsive forces. Here, we discover that this process disfavors larger amino acids in the protein core because the protein compresses in the early stages of refinement. By testing alternative ramping strategies for the repulsive weight, we arrive at a scheme that produces lower energy designs with more native-like sequence composition in the protein core. We further validate the protocol by designing and experimentally characterizing over 4000 proteins and show that the new protocol produces higher stability proteins.
Collapse
Affiliation(s)
- Jack B Maguire
- Program in Bioinformatics and Computational Biology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Hugh K Haddox
- Department of Biochemistry, University of Washington, Seattle, Washington, USA.,Institute for Protein Design, University of Washington, Seattle, Washington, USA
| | - Devin Strickland
- Department of Electrical and Computer Engineering, University of Washington, Seattle, Washington, USA
| | - Samer F Halabiya
- Department of Electrical and Computer Engineering, University of Washington, Seattle, Washington, USA
| | - Brian Coventry
- Institute for Protein Design, University of Washington, Seattle, Washington, USA.,Molecular Engineering PhD Program, University of Washington, Seattle, Washington, USA
| | - Jermel R Griffin
- Department of Chemistry, State University of New York at Buffalo, Buffalo, New York, USA
| | | | - Matthew Cummins
- Department of Pharmacology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - David F Thieker
- Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA.,Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Eric Klavins
- Department of Electrical and Computer Engineering, University of Washington, Seattle, Washington, USA
| | - Thomas Szyperski
- Department of Chemistry, State University of New York at Buffalo, Buffalo, New York, USA
| | - Frank DiMaio
- Department of Biochemistry, University of Washington, Seattle, Washington, USA.,Institute for Protein Design, University of Washington, Seattle, Washington, USA
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, Washington, USA.,Institute for Protein Design, University of Washington, Seattle, Washington, USA.,Howard Hughes Medical Institute, University of Washington, Seattle, Washington, USA
| | - Brian Kuhlman
- Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA.,Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| |
Collapse
|
5
|
Lowegard AU, Frenkel MS, Holt GT, Jou JD, Ojewole AA, Donald BR. Novel, provable algorithms for efficient ensemble-based computational protein design and their application to the redesign of the c-Raf-RBD:KRas protein-protein interface. PLoS Comput Biol 2020; 16:e1007447. [PMID: 32511232 PMCID: PMC7329130 DOI: 10.1371/journal.pcbi.1007447] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2019] [Revised: 07/01/2020] [Accepted: 05/13/2020] [Indexed: 11/25/2022] Open
Abstract
The K* algorithm provably approximates partition functions for a set of states (e.g., protein, ligand, and protein-ligand complex) to a user-specified accuracy ε. Often, reaching an ε-approximation for a particular set of partition functions takes a prohibitive amount of time and space. To alleviate some of this cost, we introduce two new algorithms into the osprey suite for protein design: fries, a Fast Removal of Inadequately Energied Sequences, and EWAK*, an Energy Window Approximation to K*. fries pre-processes the sequence space to limit a design to only the most stable, energetically favorable sequence possibilities. EWAK* then takes this pruned sequence space as input and, using a user-specified energy window, calculates K* scores using the lowest energy conformations. We expect fries/EWAK* to be most useful in cases where there are many unstable sequences in the design sequence space and when users are satisfied with enumerating the low-energy ensemble of conformations. In combination, these algorithms provably retain calculational accuracy while limiting the input sequence space and the conformations included in each partition function calculation to only the most energetically favorable, effectively reducing runtime while still enriching for desirable sequences. This combined approach led to significant speed-ups compared to the previous state-of-the-art multi-sequence algorithm, BBK*, while maintaining its efficiency and accuracy, which we show across 40 different protein systems and a total of 2,826 protein design problems. Additionally, as a proof of concept, we used these new algorithms to redesign the protein-protein interface (PPI) of the c-Raf-RBD:KRas complex. The Ras-binding domain of the protein kinase c-Raf (c-Raf-RBD) is the tightest known binder of KRas, a protein implicated in difficult-to-treat cancers. fries/EWAK* accurately retrospectively predicted the effect of 41 different sets of mutations in the PPI of the c-Raf-RBD:KRas complex. Notably, these mutations include mutations whose effect had previously been incorrectly predicted using other computational methods. Next, we used fries/EWAK* for prospective design and discovered a novel point mutation that improves binding of c-Raf-RBD to KRas in its active, GTP-bound state (KRasGTP). We combined this new mutation with two previously reported mutations (which were highly-ranked by osprey) to create a new variant of c-Raf-RBD, c-Raf-RBD(RKY). fries/EWAK* in osprey computationally predicted that this new variant binds even more tightly than the previous best-binding variant, c-Raf-RBD(RK). We measured the binding affinity of c-Raf-RBD(RKY) using a bio-layer interferometry (BLI) assay, and found that this new variant exhibits single-digit nanomolar affinity for KRasGTP, confirming the computational predictions made with fries/EWAK*. This new variant binds roughly five times more tightly than the previous best known binder and roughly 36 times more tightly than the design starting point (wild-type c-Raf-RBD). This study steps through the advancement and development of computational protein design by presenting theory, new algorithms, accurate retrospective designs, new prospective designs, and biochemical validation. Computational structure-based protein design is an innovative tool for redesigning proteins to introduce a particular or novel function. One such function is improving the binding of one protein to another, which can increase our understanding of important protein systems. Herein we introduce two novel, provable algorithms, fries and EWAK*, for more efficient computational structure-based protein design as well as their application to the redesign of the c-Raf-RBD:KRas protein-protein interface. These new algorithms speed-up computational structure-based protein design while maintaining accurate calculations, allowing for larger, previously infeasible protein designs. Additionally, using fries and EWAK* within the osprey suite, we designed the tightest known binder of KRas, a heavily studied cancer target that interacts with a number of different proteins. This previously undiscovered variant of a KRas-binding domain, c-Raf-RBD, has potential to serve as a tool to further probe the protein-protein interface of KRas with its effectors and its discovery alone emphasizes the potential for more successful applications of computational structure-based protein design.
Collapse
Affiliation(s)
- Anna U. Lowegard
- Program in Computational Biology and Bioinformatics, Duke University Medical Center, Durham, North Carolina, United States of America
- Department of Computer Science, Duke University, Durham, North Carolina, United States of America
| | - Marcel S. Frenkel
- Department of Biochemistry, Duke University Medical Center, Durham, North Carolina, United States of America
| | - Graham T. Holt
- Program in Computational Biology and Bioinformatics, Duke University Medical Center, Durham, North Carolina, United States of America
- Department of Computer Science, Duke University, Durham, North Carolina, United States of America
| | - Jonathan D. Jou
- Department of Computer Science, Duke University, Durham, North Carolina, United States of America
| | - Adegoke A. Ojewole
- Program in Computational Biology and Bioinformatics, Duke University Medical Center, Durham, North Carolina, United States of America
- Department of Computer Science, Duke University, Durham, North Carolina, United States of America
| | - Bruce R. Donald
- Department of Computer Science, Duke University, Durham, North Carolina, United States of America
- Department of Biochemistry, Duke University Medical Center, Durham, North Carolina, United States of America
- * E-mail:
| |
Collapse
|
6
|
Jou JD, Holt GT, Lowegard AU, Donald BR. Minimization-Aware Recursive K*: A Novel, Provable Algorithm that Accelerates Ensemble-Based Protein Design and Provably Approximates the Energy Landscape. J Comput Biol 2019; 27:550-564. [PMID: 31855059 DOI: 10.1089/cmb.2019.0315] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Protein design algorithms that model continuous sidechain flexibility and conformational ensembles better approximate the in vitro and in vivo behavior of proteins. The previous state of the art, iMinDEE-A*-K*, computes provable ɛ-approximations to partition functions of protein states (e.g., bound vs. unbound) by computing provable, admissible pairwise-minimized energy lower bounds on protein conformations, and using the A* enumeration algorithm to return a gap-free list of lowest-energy conformations. iMinDEE-A*-K* runs in time sublinear in the number of conformations, but can be trapped in loosely-bounded, low-energy conformational wells containing many conformations with highly similar energies. That is, iMinDEE-A*-K* is unable to exploit the correlation between protein conformation and energy: similar conformations often have similar energy. We introduce two new concepts that exploit this correlation: Minimization-Aware Enumeration and Recursive K*. We combine these two insights into a novel algorithm, Minimization-Aware Recursive K* (MARK*), which tightens bounds not on single conformations, but instead on distinct regions of the conformation space. We compare the performance of iMinDEE-A*-K* versus MARK* by running the Branch and Bound over K* (BBK*) algorithm, which provably returns sequences in order of decreasing K* score, using either iMinDEE-A*-K* or MARK* to approximate partition functions. We show on 200 design problems that MARK* not only enumerates and minimizes vastly fewer conformations than the previous state of the art, but also runs up to 2 orders of magnitude faster. Finally, we show that MARK* not only efficiently approximates the partition function, but also provably approximates the energy landscape. To our knowledge, MARK* is the first algorithm to do so. We use MARK* to analyze the change in energy landscape of the bound and unbound states of an HIV-1 capsid protein C-terminal domain in complex with a camelid VHH, and measure the change in conformational entropy induced by binding. Thus, MARK* both accelerates existing designs and offers new capabilities not possible with previous algorithms.
Collapse
Affiliation(s)
- Jonathan D Jou
- Department of Computer Science, Duke University, Durham, North Carolina
| | - Graham T Holt
- Department of Computer Science, Duke University, Durham, North Carolina.,Computational Biology and Bioinformatics Program, Duke University, Durham, North Carolina
| | - Anna U Lowegard
- Department of Computer Science, Duke University, Durham, North Carolina.,Computational Biology and Bioinformatics Program, Duke University, Durham, North Carolina
| | - Bruce R Donald
- Department of Computer Science, Duke University, Durham, North Carolina.,Department of Biochemistry, Duke University Medical Center, Durham, North Carolina.,Department of Chemistry, Duke University, Durham, North Carolina
| |
Collapse
|
7
|
Xu G, Ma T, Du J, Wang Q, Ma J. OPUS-Rota2: An Improved Fast and Accurate Side-Chain Modeling Method. J Chem Theory Comput 2019; 15:5154-5160. [PMID: 31412199 DOI: 10.1021/acs.jctc.9b00309] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Side-chain modeling plays a critical role in protein structure prediction. However, in many current methods, balancing the speed and accuracy is still challenging. In this paper, on the basis of our previous work OPUS-Rota (Protein Sci. 2008, 17, 1576-1585), we introduce a new side-chain modeling method, OPUS-Rota2, which is tested on both a 65-protein test set (DB65) in the OPUS-Rota paper and a 379-protein test set (DB379) in the SCWRL4 paper. If the main chain is native, OPUS-Rota2 is more accurate than OPUS-Rota, SCWRL4, and OSCAR-star but slightly less accurate than OSCAR-o. Also, if the main chain is non-native, OPUS-Rota2 is more accurate than any other method. Moreover, OPUS-Rota2 is significantly faster than any other method, in particular, 2 orders of magnitude faster than OSCAR-o. Thus, the combination of higher accuracy and speed of OPUS-Rota2 in modeling side chains on both the native and non-native main chains makes OPUS-Rota2 a very useful tool in protein structure modeling.
Collapse
Affiliation(s)
- Gang Xu
- Multiscale Research Institute of Complex Systems , Fudan University , Shanghai 200433 , China.,School of Life Sciences , Tsinghua University , Beijing 100084 , China
| | | | - Junqing Du
- Verna and Marrs Mclean Department of Biochemistry and Molecular Biology , Baylor College of Medicine , One Baylor Plaza, BCM-125 , Houston , Texas 77030 , United States
| | - Qinghua Wang
- Verna and Marrs Mclean Department of Biochemistry and Molecular Biology , Baylor College of Medicine , One Baylor Plaza, BCM-125 , Houston , Texas 77030 , United States
| | - Jianpeng Ma
- Multiscale Research Institute of Complex Systems , Fudan University , Shanghai 200433 , China.,School of Life Sciences , Tsinghua University , Beijing 100084 , China.,Verna and Marrs Mclean Department of Biochemistry and Molecular Biology , Baylor College of Medicine , One Baylor Plaza, BCM-125 , Houston , Texas 77030 , United States.,School of Life Sciences , Fudan University , Shanghai 200433 , China
| |
Collapse
|
8
|
Abstract
Motivation Multistate protein design addresses real-world challenges, such as multi-specificity design and backbone flexibility, by considering both positive and negative protein states with an ensemble of substates for each. It also presents an enormous challenge to exact algorithms that guarantee the optimal solutions and enable a direct test of mechanistic hypotheses behind models. However, efficient exact algorithms are lacking for multistate protein design. Results We have developed an efficient exact algorithm called interconnected cost function networks (iCFN) for multistate protein design. Its generic formulation allows for a wide array of applications such as stability, affinity and specificity designs while addressing concerns such as global flexibility of protein backbones. iCFN treats each substate design as a weighted constraint satisfaction problem (WCSP) modeled through a CFN; and it solves the coupled WCSPs using novel bounds and a depth-first branch-and-bound search over a tree structure of sequences, substates, and conformations. When iCFN is applied to specificity design of a T-cell receptor, a problem of unprecedented size to exact methods, it drastically reduces search space and running time to make the problem tractable. Moreover, iCFN generates experimentally-agreeing receptor designs with improved accuracy compared with state-of-the-art methods, highlights the importance of modeling backbone flexibility in protein design, and reveals molecular mechanisms underlying binding specificity. Availability and implementation https://shen-lab.github.io/software/iCFN. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mostafa Karimi
- Department of Electrical and Computer Engineering and TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, USA
| | - Yang Shen
- Department of Electrical and Computer Engineering and TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, USA
| |
Collapse
|
9
|
Hallen MA, Donald BR. CATS (Coordinates of Atoms by Taylor Series): protein design with backbone flexibility in all locally feasible directions. Bioinformatics 2018; 33:i5-i12. [PMID: 28882005 PMCID: PMC5870559 DOI: 10.1093/bioinformatics/btx277] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Motivation When proteins mutate or bind to ligands, their backbones often move significantly, especially in loop regions. Computational protein design algorithms must model these motions in order to accurately optimize protein stability and binding affinity. However, methods for backbone conformational search in design have been much more limited than for sidechain conformational search. This is especially true for combinatorial protein design algorithms, which aim to search a large sequence space efficiently and thus cannot rely on temporal simulation of each candidate sequence. Results We alleviate this difficulty with a new parameterization of backbone conformational space, which represents all degrees of freedom of a specified segment of protein chain that maintain valid bonding geometry (by maintaining the original bond lengths and angles and ω dihedrals). In order to search this space, we present an efficient algorithm, CATS, for computing atomic coordinates as a function of our new continuous backbone internal coordinates. CATS generalizes the iMinDEE and EPIC protein design algorithms, which model continuous flexibility in sidechain dihedrals, to model continuous, appropriately localized flexibility in the backbone dihedrals ϕ and ψ as well. We show using 81 test cases based on 29 different protein structures that CATS finds sequences and conformations that are significantly lower in energy than methods with less or no backbone flexibility do. In particular, we show that CATS can model the viability of an antibody mutation known experimentally to increase affinity, but that appears sterically infeasible when modeled with less or no backbone flexibility. Availability and implementation Our code is available as free software at https://github.com/donaldlab/OSPREY_refactor. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mark A Hallen
- Department of Computer Science, Duke University, Durham, NC, USA.,Toyota Technological Institute at Chicago, Chicago, IL, USA
| | - Bruce R Donald
- Department of Computer Science, Duke University, Durham, NC, USA.,Department of Chemistry, Duke University, Durham, NC, USA.,Department of Biochemistry, Duke University Medical Center, Durham, NC, USA
| |
Collapse
|
10
|
Ojewole AA, Jou JD, Fowler VG, Donald BR. BBK* (Branch and Bound Over K*): A Provable and Efficient Ensemble-Based Protein Design Algorithm to Optimize Stability and Binding Affinity Over Large Sequence Spaces. J Comput Biol 2018; 25:726-739. [PMID: 29641249 DOI: 10.1089/cmb.2017.0267] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
Computational protein design (CPD) algorithms that compute binding affinity, Ka, search for sequences with an energetically favorable free energy of binding. Recent work shows that three principles improve the biological accuracy of CPD: ensemble-based design, continuous flexibility of backbone and side-chain conformations, and provable guarantees of accuracy with respect to the input. However, previous methods that use all three design principles are single-sequence (SS) algorithms, which are very costly: linear in the number of sequences and thus exponential in the number of simultaneously mutable residues. To address this computational challenge, we introduce BBK*, a new CPD algorithm whose key innovation is the multisequence (MS) bound: BBK* efficiently computes a single provable upper bound to approximate Ka for a combinatorial number of sequences, and avoids SS computation for all provably suboptimal sequences. Thus, to our knowledge, BBK* is the first provable, ensemble-based CPD algorithm to run in time sublinear in the number of sequences. Computational experiments on 204 protein design problems show that BBK* finds the tightest binding sequences while approximating Ka for up to 105-fold fewer sequences than the previous state-of-the-art algorithms, which require exhaustive enumeration of sequences. Furthermore, for 51 protein-ligand design problems, BBK* provably approximates Ka up to 1982-fold faster than the previous state-of-the-art iMinDEE/[Formula: see text]/[Formula: see text] algorithm. Therefore, BBK* not only accelerates protein designs that are possible with previous provable algorithms, but also efficiently performs designs that are too large for previous methods.
Collapse
Affiliation(s)
- Adegoke A Ojewole
- 1 Department of Computer Science, Duke University , Durham, North Carolina.,2 Computational Biology and Bioinformatics Program, Duke University , Durham, North Carolina
| | - Jonathan D Jou
- 1 Department of Computer Science, Duke University , Durham, North Carolina
| | - Vance G Fowler
- 3 Division of Infectious Diseases, Duke University Medical Center , Durham, North Carolina
| | - Bruce R Donald
- 1 Department of Computer Science, Duke University , Durham, North Carolina.,4 Department of Biochemistry, Duke University Medical Center , Durham North Carolina
| |
Collapse
|
11
|
In silico methods for design of biological therapeutics. Methods 2017; 131:33-65. [PMID: 28958951 DOI: 10.1016/j.ymeth.2017.09.008] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2017] [Revised: 09/21/2017] [Accepted: 09/23/2017] [Indexed: 12/18/2022] Open
Abstract
It has been twenty years since the first rationally designed small molecule drug was introduced into the market. Since then, we have progressed from designing small molecules to designing biotherapeutics. This class of therapeutics includes designed proteins, peptides and nucleic acids that could more effectively combat drug resistance and even act in cases where the disease is caused because of a molecular deficiency. Computational methods are crucial in this design exercise and this review discusses the various elements of designing biotherapeutic proteins and peptides. Many of the techniques discussed here, such as the deterministic and stochastic design methods, are generally used in protein design. We have devoted special attention to the design of antibodies and vaccines. In addition to the methods for designing these molecules, we have included a comprehensive list of all biotherapeutics approved for clinical use. Also included is an overview of methods that predict the binding affinity, cell penetration ability, half-life, solubility, immunogenicity and toxicity of the designed therapeutics. Biotherapeutics are only going to grow in clinical importance and are set to herald a new generation of disease management and cure.
Collapse
|
12
|
Xianwei T, Diannan L, Boxiong W. Substrate transport pathway inside outward open conformation of EmrD: a molecular dynamics simulation study. MOLECULAR BIOSYSTEMS 2017; 12:2634-41. [PMID: 27327574 DOI: 10.1039/c6mb00348f] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The EmrD transporter, which is a classical major facilitator superfamily (MFS) protein, can extrude a range of drug molecules out of E. coil. The drug molecules transport through the channel of MFS in an outward open state, an important issue in research about bacterial drug resistance, which however, is still unknown. In this paper, we construct a starting outward-open model of the EmrD transporter using a state transition method. The starting model is refined by a conventional molecular dynamics simulation. Locally enhanced sampling simulation (LES) is used to validate the outward-open model of EmrD. In the locally enhanced sampling simulation, ten substrates are placed along the channel of the outward-open EmrD, and these substrates are sampled in the outward-open center cavity. It is found that the translocation pathway of these substrates from the inside to the outside of the cell through the EmrD transporter is composed of two sub-pathways, one sub-pathway, including H2, H4, and H5, and another sub-pathway, including H8, H10, and H11. The results give us have a further insight to the ways of substrate translocation of an MFS protein. The model method is based on common features of an MFS protein, so this modeling method can be used to construct various MFS protein models which have a desired state with other conformations not known in the alternating-access mechanism.
Collapse
Affiliation(s)
- Tan Xianwei
- School of Life Sciences, Tsinghua University, Beijing, China.
| | - Lu Diannan
- Department of Chemical Engineering, Tsinghua University, Beijing, China
| | - Wang Boxiong
- Department of Precision Instrument, Tsinghua University, Beijing, China
| |
Collapse
|
13
|
Jain S, Jou JD, Georgiev IS, Donald BR. A critical analysis of computational protein design with sparse residue interaction graphs. PLoS Comput Biol 2017; 13:e1005346. [PMID: 28358804 PMCID: PMC5391103 DOI: 10.1371/journal.pcbi.1005346] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Revised: 04/13/2017] [Accepted: 01/03/2017] [Indexed: 11/19/2022] Open
Abstract
Protein design algorithms enumerate a combinatorial number of candidate structures to compute the Global Minimum Energy Conformation (GMEC). To efficiently find the GMEC, protein design algorithms must methodically reduce the conformational search space. By applying distance and energy cutoffs, the protein system to be designed can thus be represented using a sparse residue interaction graph, where the number of interacting residue pairs is less than all pairs of mutable residues, and the corresponding GMEC is called the sparse GMEC. However, ignoring some pairwise residue interactions can lead to a change in the energy, conformation, or sequence of the sparse GMEC vs. the original or the full GMEC. Despite the widespread use of sparse residue interaction graphs in protein design, the above mentioned effects of their use have not been previously analyzed. To analyze the costs and benefits of designing with sparse residue interaction graphs, we computed the GMECs for 136 different protein design problems both with and without distance and energy cutoffs, and compared their energies, conformations, and sequences. Our analysis shows that the differences between the GMECs depend critically on whether or not the design includes core, boundary, or surface residues. Moreover, neglecting long-range interactions can alter local interactions and introduce large sequence differences, both of which can result in significant structural and functional changes. Designs on proteins with experimentally measured thermostability show it is beneficial to compute both the full and the sparse GMEC accurately and efficiently. To this end, we show that a provable, ensemble-based algorithm can efficiently compute both GMECs by enumerating a small number of conformations, usually fewer than 1000. This provides a novel way to combine sparse residue interaction graphs with provable, ensemble-based algorithms to reap the benefits of sparse residue interaction graphs while avoiding their potential inaccuracies. Computational structure-based protein design algorithms have successfully redesigned proteins to fold and bind target substrates in vitro, and even in vivo. Because the complexity of a computational design increases dramatically with the number of mutable residues, many design algorithms employ cutoffs (distance or energy) to neglect some pairwise residue interactions, thereby reducing the effective search space and computational cost. However, the energies neglected by such cutoffs can add up, which may have nontrivial effects on the designed sequence and its function. To study the effects of using cutoffs on protein design, we computed the optimal sequence both with and without cutoffs, and showed that neglecting long-range interactions can significantly change the computed conformation and sequence. Designs on proteins with experimentally measured thermostability showed the benefits of computing the optimal sequences (and their conformations), both with and without cutoffs, efficiently and accurately. Therefore, we also showed that a provable, ensemble-based algorithm can efficiently compute the optimal conformation and sequence, both with and without applying cutoffs, by enumerating a small number of conformations, usually fewer than 1000. This provides a novel way to combine cutoffs with provable, ensemble-based algorithms to reap the computational efficiency of cutoffs while avoiding their potential inaccuracies.
Collapse
Affiliation(s)
- Swati Jain
- Computational Biology and Bioinformatics Program, Duke University, Durham, North Carolina, United States of America
- Department of Computer Science, Duke University, Durham, North Carolina, United States of America
- Department of Biochemistry, Duke University Medical Center, Durham, North Carolina, United States of America
| | - Jonathan D. Jou
- Department of Computer Science, Duke University, Durham, North Carolina, United States of America
| | - Ivelin S. Georgiev
- Department of Computer Science, Duke University, Durham, North Carolina, United States of America
| | - Bruce R. Donald
- Department of Computer Science, Duke University, Durham, North Carolina, United States of America
- Department of Biochemistry, Duke University Medical Center, Durham, North Carolina, United States of America
- Department of Chemistry, Duke University, Durham, North Carolina, United States of America
- * E-mail:
| |
Collapse
|
14
|
Watkins AM, Bonneau R, Arora PS. Modeling and Design of Peptidomimetics to Modulate Protein-Protein Interactions. Methods Mol Biol 2017; 1561:291-307. [PMID: 28236245 DOI: 10.1007/978-1-4939-6798-8_17] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
We describe a modular approach to identify and inhibit protein-protein interactions (PPIs) that are mediated by protein secondary and tertiary structures with rationally designed peptidomimetics. Our analysis begins with entries of high-resolution complexes in the Protein Data Bank and utilizes conformational sampling, scoring, and design capabilities of advanced biomolecular modeling software to develop peptidomimetics.
Collapse
Affiliation(s)
| | - Richard Bonneau
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, NY, USA
- Computer Science Department, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
| | - Paramjit S Arora
- Department of Chemistry, New York University, 29 Washington Place, Brown Bldg., Room 360, New York, NY, USA.
| |
Collapse
|
15
|
Traoré S, Allouche D, André I, Schiex T, Barbe S. Deterministic Search Methods for Computational Protein Design. Methods Mol Biol 2017; 1529:107-123. [PMID: 27914047 DOI: 10.1007/978-1-4939-6637-0_4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
One main challenge in Computational Protein Design (CPD) lies in the exploration of the amino-acid sequence space, while considering, to some extent, side chain flexibility. The exorbitant size of the search space urges for the development of efficient exact deterministic search methods enabling identification of low-energy sequence-conformation models, corresponding either to the global minimum energy conformation (GMEC) or an ensemble of guaranteed near-optimal solutions. In contrast to stochastic local search methods that are not guaranteed to find the GMEC, exact deterministic approaches always identify the GMEC and prove its optimality in finite but exponential worst-case time. After a brief overview on these two classes of methods, we discuss the grounds and merits of four deterministic methods that have been applied to solve CPD problems. These approaches are based either on the Dead-End-Elimination theorem combined with A* algorithm (DEE/A*), on Cost Function Networks algorithms (CFN), on Integer Linear Programming solvers (ILP) or on Markov Random Fields solvers (MRF). The way two of these methods (DEE/A* and CFN) can be used in practice to identify low-energy sequence-conformation models starting from a pairwise decomposed energy matrix is detailed in this review.
Collapse
Affiliation(s)
- Seydou Traoré
- INSA, UPS, INP, Université de Toulouse, 135 Avenue de Rangueil, 31077, Toulouse, France
- Laboratoire d'Ingénierie Ingénierie des Systèmes Biologiques et des Procédés - INSA, INRA, UMR792, 31400, Toulouse, France
- CNRS, UMR5504, 31400, Toulouse, France
| | - David Allouche
- Unité de Mathématiques et Informatique de Toulouse, UR 875, INRA, 31320, Castanet Tolosan, France
| | - Isabelle André
- INSA, UPS, INP, Université de Toulouse, 135 Avenue de Rangueil, 31077, Toulouse, France
- Laboratoire d'Ingénierie Ingénierie des Systèmes Biologiques et des Procédés - INSA, INRA, UMR792, 31400, Toulouse, France
- CNRS, UMR5504, 31400, Toulouse, France
| | - Thomas Schiex
- Unité de Mathématiques et Informatique de Toulouse, UR 875, INRA, 31320, Castanet Tolosan, France
| | - Sophie Barbe
- INSA, UPS, INP, Université de Toulouse, 135 Avenue de Rangueil, 31077, Toulouse, France.
- Laboratoire d'Ingénierie Ingénierie des Systèmes Biologiques et des Procédés - INSA, INRA, UMR792, 31400, Toulouse, France.
- CNRS, UMR5504, 31400, Toulouse, France.
| |
Collapse
|
16
|
Romero-Rivera A, Garcia-Borràs M, Osuna S. Computational tools for the evaluation of laboratory-engineered biocatalysts. Chem Commun (Camb) 2016; 53:284-297. [PMID: 27812570 PMCID: PMC5310519 DOI: 10.1039/c6cc06055b] [Citation(s) in RCA: 82] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2016] [Accepted: 09/06/2016] [Indexed: 12/18/2022]
Abstract
Biocatalysis is based on the application of natural catalysts for new purposes, for which enzymes were not designed. Although the first examples of biocatalysis were reported more than a century ago, biocatalysis was revolutionized after the discovery of an in vitro version of Darwinian evolution called Directed Evolution (DE). Despite the recent advances in the field, major challenges remain to be addressed. Currently, the best experimental approach consists of creating multiple mutations simultaneously while limiting the choices using statistical methods. Still, tens of thousands of variants need to be tested experimentally, and little information is available on how these mutations lead to enhanced enzyme proficiency. This review aims to provide a brief description of the available computational techniques to unveil the molecular basis of improved catalysis achieved by DE. An overview of the strengths and weaknesses of current computational strategies is explored with some recent representative examples. The understanding of how this powerful technique is able to obtain highly active variants is important for the future development of more robust computational methods to predict amino-acid changes needed for activity.
Collapse
Affiliation(s)
- Adrian Romero-Rivera
- Institut de Química Computacional i Catàlisi and Departament de Química Universitat de Girona, Campus Montilivi, 17071 Girona, Catalonia, Spain.
| | - Marc Garcia-Borràs
- Department of Chemistry and Biochemistry, University of California, 607 Charles E. Young Drive, Los Angeles, California 90095, USA
| | - Sílvia Osuna
- Institut de Química Computacional i Catàlisi and Departament de Química Universitat de Girona, Campus Montilivi, 17071 Girona, Catalonia, Spain.
| |
Collapse
|
17
|
Druart K, Bigot J, Audit E, Simonson T. A Hybrid Monte Carlo Scheme for Multibackbone Protein Design. J Chem Theory Comput 2016; 12:6035-6048. [DOI: 10.1021/acs.jctc.6b00421] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Karen Druart
- Laboratoire
de Biochimie (CNRS UMR7654), Ecole Polytechnique, Palaiseau, France
- Maison
de la Simulation, CEA, CNRS, Univ. Paris-Sud, UVSQ, Université Paris-Saclay, 91191 Gif-sur-Yvette, France
| | - Julien Bigot
- Maison
de la Simulation, CEA, CNRS, Univ. Paris-Sud, UVSQ, Université Paris-Saclay, 91191 Gif-sur-Yvette, France
| | - Edouard Audit
- Maison
de la Simulation, CEA, CNRS, Univ. Paris-Sud, UVSQ, Université Paris-Saclay, 91191 Gif-sur-Yvette, France
| | - Thomas Simonson
- Laboratoire
de Biochimie (CNRS UMR7654), Ecole Polytechnique, Palaiseau, France
| |
Collapse
|
18
|
Computational protein design with backbone plasticity. Biochem Soc Trans 2016; 44:1523-1529. [PMID: 27911735 PMCID: PMC5264498 DOI: 10.1042/bst20160155] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2016] [Revised: 08/01/2016] [Accepted: 08/03/2016] [Indexed: 11/17/2022]
Abstract
The computational algorithms used in the design of artificial proteins have become increasingly sophisticated in recent years, producing a series of remarkable successes. The most dramatic of these is the de novo design of artificial enzymes. The majority of these designs have reused naturally occurring protein structures as ‘scaffolds’ onto which novel functionality can be grafted without having to redesign the backbone structure. The incorporation of backbone flexibility into protein design is a much more computationally challenging problem due to the greatly increased search space, but promises to remove the limitations of reusing natural protein scaffolds. In this review, we outline the principles of computational protein design methods and discuss recent efforts to consider backbone plasticity in the design process.
Collapse
|
19
|
Au L, Green DF. Direct Calculation of Protein Fitness Landscapes through Computational Protein Design. Biophys J 2016; 110:75-84. [PMID: 26745411 DOI: 10.1016/j.bpj.2015.11.029] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2015] [Revised: 11/03/2015] [Accepted: 11/16/2015] [Indexed: 11/24/2022] Open
Abstract
Naturally selected amino-acid sequences or experimentally derived ones are often the basis for understanding how protein three-dimensional conformation and function are determined by primary structure. Such sequences for a protein family comprise only a small fraction of all possible variants, however, representing the fitness landscape with limited scope. Explicitly sampling and characterizing alternative, unexplored protein sequences would directly identify fundamental reasons for sequence robustness (or variability), and we demonstrate that computational methods offer an efficient mechanism toward this end, on a large scale. The dead-end elimination and A(∗) search algorithms were used here to find all low-energy single mutant variants, and corresponding structures of a G-protein heterotrimer, to measure changes in structural stability and binding interactions to define a protein fitness landscape. We established consistency between these algorithms with known biophysical and evolutionary trends for amino-acid substitutions, and could thus recapitulate known protein side-chain interactions and predict novel ones.
Collapse
Affiliation(s)
- Loretta Au
- Department of Statistics, The University of Chicago, Chicago, Illinois.
| | - David F Green
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, New York
| |
Collapse
|
20
|
Hallen MA, Jou JD, Donald BR. LUTE (Local Unpruned Tuple Expansion): Accurate Continuously Flexible Protein Design with General Energy Functions and Rigid Rotamer-Like Efficiency. J Comput Biol 2016; 24:536-546. [PMID: 27681371 DOI: 10.1089/cmb.2016.0136] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Most protein design algorithms search over discrete conformations and an energy function that is residue-pairwise, that is, a sum of terms that depend on the sequence and conformation of at most two residues. Although modeling of continuous flexibility and of non-residue-pairwise energies significantly increases the accuracy of protein design, previous methods to model these phenomena add a significant asymptotic cost to design calculations. We now remove this cost by modeling continuous flexibility and non-residue-pairwise energies in a form suitable for direct input to highly efficient, discrete combinatorial optimization algorithms such as DEE/A* or branch-width minimization. Our novel algorithm performs a local unpruned tuple expansion (LUTE), which can efficiently represent both continuous flexibility and general, possibly nonpairwise energy functions to an arbitrary level of accuracy using a discrete energy matrix. We show using 47 design calculation test cases that LUTE provides a dramatic speedup in both single-state and multistate continuously flexible designs.
Collapse
Affiliation(s)
- Mark A Hallen
- 1 Department of Computer Science, Levine Science Research Center, Duke University , Durham, North Carolina
| | - Jonathan D Jou
- 1 Department of Computer Science, Levine Science Research Center, Duke University , Durham, North Carolina
| | - Bruce R Donald
- 1 Department of Computer Science, Levine Science Research Center, Duke University , Durham, North Carolina.,2 Department of Chemistry, Duke University , Durham, North Carolina.,3 Department of Biochemistry, Duke University Medical Center , Durham, North Carolina
| |
Collapse
|
21
|
Hallen MA, Gainza P, Donald BR. Compact Representation of Continuous Energy Surfaces for More Efficient Protein Design. J Chem Theory Comput 2016; 11:2292-306. [PMID: 26089744 DOI: 10.1021/ct501031m] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
In macromolecular design, conformational energies are sensitive to small changes in atom coordinates; thus, modeling the small, continuous motions of atoms around low-energy wells confers a substantial advantage in structural accuracy. However, modeling these motions comes at the cost of a very large number of energy function calls, which form the bottleneck in the design calculations. In this work, we remove this bottleneck by consolidating all conformational energy evaluations into the pre-computation of a local polynomial expansion of the energy about the "ideal" conformation for each low-energy, "rotameric" state of each residue pair. This expansion is called "energy as polynomials in internal coordinates" (EPIC), where the internal coordinates can be side-chain dihedrals, backrub angles, and/or any other continuous degrees of freedom of a macromolecule, and any energy function can be used without adding any asymptotic complexity to the design. We demonstrate that EPIC efficiently represents the energy surface for both molecular-mechanics and quantum-mechanical energy functions, and apply it specifically to protein design for modeling both side chain and backbone degrees of freedom.
Collapse
|
22
|
Hallen MA, Donald BR. comets (Constrained Optimization of Multistate Energies by Tree Search): A Provable and Efficient Protein Design Algorithm to Optimize Binding Affinity and Specificity with Respect to Sequence. J Comput Biol 2016; 23:311-21. [PMID: 26761641 DOI: 10.1089/cmb.2015.0188] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Practical protein design problems require designing sequences with a combination of affinity, stability, and specificity requirements. Multistate protein design algorithms model multiple structural or binding "states" of a protein to address these requirements. comets provides a new level of versatile, efficient, and provable multistate design. It provably returns the minimum with respect to sequence of any desired linear combination of the energies of multiple protein states, subject to constraints on other linear combinations. Thus, it can target nearly any combination of affinity (to one or multiple ligands), specificity, and stability (for multiple states if needed). Empirical calculations on 52 protein design problems showed comets is far more efficient than the previous state of the art for provable multistate design (exhaustive search over sequences). comets can handle a very wide range of protein flexibility and can enumerate a gap-free list of the best constraint-satisfying sequences in order of objective function value.
Collapse
Affiliation(s)
- Mark A Hallen
- 1 Department of Computer Science, Levine Science Research Center, Duke University , North Carolina
- 2 Department of Biochemistry, Duke University Medical Center , Durham, North Carolina
| | - Bruce R Donald
- 1 Department of Computer Science, Levine Science Research Center, Duke University , North Carolina
- 2 Department of Biochemistry, Duke University Medical Center , Durham, North Carolina
- 3 Department of Chemistry, Duke University , Durham, North Carolina
| |
Collapse
|
23
|
Jou JD, Jain S, Georgiev IS, Donald BR. BWM*: A Novel, Provable, Ensemble-based Dynamic Programming Algorithm for Sparse Approximations of Computational Protein Design. J Comput Biol 2016; 23:413-24. [PMID: 26744898 DOI: 10.1089/cmb.2015.0194] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Sparse energy functions that ignore long range interactions between residue pairs are frequently used by protein design algorithms to reduce computational cost. Current dynamic programming algorithms that fully exploit the optimal substructure produced by these energy functions only compute the GMEC. This disproportionately favors the sequence of a single, static conformation and overlooks better binding sequences with multiple low-energy conformations. Provable, ensemble-based algorithms such as A* avoid this problem, but A* cannot guarantee better performance than exhaustive enumeration. We propose a novel, provable, dynamic programming algorithm called Branch-Width Minimization* (BWM*) to enumerate a gap-free ensemble of conformations in order of increasing energy. Given a branch-decomposition of branch-width w for an n-residue protein design with at most q discrete side-chain conformations per residue, BWM* returns the sparse GMEC in O([Formula: see text]) time and enumerates each additional conformation in merely O([Formula: see text]) time. We define a new measure, Total Effective Search Space (TESS), which can be computed efficiently a priori before BWM* or A* is run. We ran BWM* on 67 protein design problems and found that TESS discriminated between BWM*-efficient and A*-efficient cases with 100% accuracy. As predicted by TESS and validated experimentally, BWM* outperforms A* in 73% of the cases and computes the full ensemble or a close approximation faster than A*, enumerating each additional conformation in milliseconds. Unlike A*, the performance of BWM* can be predicted in polynomial time before running the algorithm, which gives protein designers the power to choose the most efficient algorithm for their particular design problem.
Collapse
Affiliation(s)
- Jonathan D Jou
- 1 Department of Computer Science, Duke University , Durham, North Carolina
| | - Swati Jain
- 1 Department of Computer Science, Duke University , Durham, North Carolina.,2 Department of Biochemistry, Duke University Medical Center , Durham, North Carolina.,3 Department of Computational Biology and Bioinformatics Program, Duke University , Durham, North Carolina
| | - Ivelin S Georgiev
- 1 Department of Computer Science, Duke University , Durham, North Carolina
| | - Bruce R Donald
- 1 Department of Computer Science, Duke University , Durham, North Carolina.,2 Department of Biochemistry, Duke University Medical Center , Durham, North Carolina.,4 Department of Chemistry, Duke University , Durham, North Carolina
| |
Collapse
|
24
|
Roberts KE, Gainza P, Hallen MA, Donald BR. Fast gap-free enumeration of conformations and sequences for protein design. Proteins 2015; 83:1859-1877. [PMID: 26235965 DOI: 10.1002/prot.24870] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2015] [Revised: 07/14/2015] [Accepted: 07/21/2015] [Indexed: 12/12/2022]
Abstract
Despite significant successes in structure-based computational protein design in recent years, protein design algorithms must be improved to increase the biological accuracy of new designs. Protein design algorithms search through an exponential number of protein conformations, protein ensembles, and amino acid sequences in an attempt to find globally optimal structures with a desired biological function. To improve the biological accuracy of protein designs, it is necessary to increase both the amount of protein flexibility allowed during the search and the overall size of the design, while guaranteeing that the lowest-energy structures and sequences are found. DEE/A*-based algorithms are the most prevalent provable algorithms in the field of protein design and can provably enumerate a gap-free list of low-energy protein conformations, which is necessary for ensemble-based algorithms that predict protein binding. We present two classes of algorithmic improvements to the A* algorithm that greatly increase the efficiency of A*. First, we analyze the effect of ordering the expansion of mutable residue positions within the A* tree and present a dynamic residue ordering that reduces the number of A* nodes that must be visited during the search. Second, we propose new methods to improve the conformational bounds used to estimate the energies of partial conformations during the A* search. The residue ordering techniques and improved bounds can be combined for additional increases in A* efficiency. Our enhancements enable all A*-based methods to more fully search protein conformation space, which will ultimately improve the accuracy of complex biomedically relevant designs.
Collapse
Affiliation(s)
- Kyle E Roberts
- Department of Computer Science, Duke University, Durham, NC
| | - Pablo Gainza
- Department of Computer Science, Duke University, Durham, NC
| | - Mark A Hallen
- Department of Computer Science, Duke University, Durham, NC
| | - Bruce R Donald
- Department of Computer Science, Duke University, Durham, NC.,Department of Biochemistry, Duke University Medical Center, Durham, NC.,Department of Chemistry, Duke University, Durham, NC
| |
Collapse
|
25
|
Jou JD, Jain S, Georgiev I, Donald BR. BWM*: A Novel, Provable, Ensemble-Based Dynamic Programming Algorithm for Sparse Approximations of Computational Protein Design. LECTURE NOTES IN COMPUTER SCIENCE 2015. [DOI: 10.1007/978-3-319-16706-0_16] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/11/2023]
|
26
|
Francis-Lyon P, Koehl P. Protein side-chain modeling with a protein-dependent optimized rotamer library. Proteins 2014; 82:2000-17. [PMID: 24623614 DOI: 10.1002/prot.24555] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2013] [Revised: 02/28/2014] [Accepted: 03/07/2014] [Indexed: 12/16/2022]
Abstract
Despite years of effort, the problem of predicting the conformations of protein side chains remains a subject of inquiry. This problem has three major issues, namely defining the conformations that a side chain may adopt within a protein, developing a sampling procedure for generating possible side-chain packings, and defining a scoring function that can rank these possible packings. To solve the former of these issues, most procedures rely on a rotamer library derived from databases of known protein structures. We introduce an alternative method that is free of statistics. We begin with a rotamer library that is based only on stereochemical considerations; this rotamer library is then optimized independently for each protein under study. We show that this optimization step restores the diversity of conformations observed in native proteins. We combine this protein-dependent rotamer library (PDRL) method with the self-consistent mean field (SCMF) sampling approach and a physics-based scoring function into a new side-chain prediction method, SCMF-PDRL. Using two large test sets of 831 and 378 proteins, respectively, we show that this new method compares favorably with competing methods such as SCAP, OPUS-Rota, and SCWRL4 for energy-minimized structures.
Collapse
Affiliation(s)
- Patricia Francis-Lyon
- Department of Computer Science, University of San Francisco, San Francisco, California, 94117
| | | |
Collapse
|
27
|
Traoré S, Allouche D, André I, de Givry S, Katsirelos G, Schiex T, Barbe S. A new framework for computational protein design through cost function network optimization. Bioinformatics 2013; 29:2129-36. [DOI: 10.1093/bioinformatics/btt374] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
28
|
Dunn BJ, Khosla C. Engineering the acyltransferase substrate specificity of assembly line polyketide synthases. J R Soc Interface 2013; 10:20130297. [PMID: 23720536 DOI: 10.1098/rsif.2013.0297] [Citation(s) in RCA: 86] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Polyketide natural products act as a broad range of therapeutics, including antibiotics, immunosuppressants and anti-cancer agents. This therapeutic diversity stems from the structural diversity of these small molecules, many of which are produced in an assembly line manner by modular polyketide synthases. The acyltransferase (AT) domains of these megasynthases are responsible for selection and incorporation of simple monomeric building blocks, and are thus responsible for a large amount of the resulting polyketide structural diversity. The substrate specificity of these domains is often targeted for engineering in the generation of novel, therapeutically active natural products. This review outlines recent developments that can be used in the successful engineering of these domains, including AT sequence and structural data, mechanistic insights and the production of a diverse pool of extender units. It also provides an overview of previous AT domain engineering attempts, and concludes with proposed engineering approaches that take advantage of current knowledge. These approaches may lead to successful production of biologically active 'unnatural' natural products.
Collapse
Affiliation(s)
- Briana J Dunn
- Department of Chemical Engineering, Stanford University, Stanford, CA, USA
| | | |
Collapse
|
29
|
Gainza P, Roberts KE, Georgiev I, Lilien RH, Keedy DA, Chen CY, Reza F, Anderson AC, Richardson DC, Richardson JS, Donald BR. OSPREY: protein design with ensembles, flexibility, and provable algorithms. Methods Enzymol 2013; 523:87-107. [PMID: 23422427 DOI: 10.1016/b978-0-12-394292-0.00005-9] [Citation(s) in RCA: 96] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Abstract
UNLABELLED We have developed a suite of protein redesign algorithms that improves realistic in silico modeling of proteins. These algorithms are based on three characteristics that make them unique: (1) improved flexibility of the protein backbone, protein side-chains, and ligand to accurately capture the conformational changes that are induced by mutations to the protein sequence; (2) modeling of proteins and ligands as ensembles of low-energy structures to better approximate binding affinity; and (3) a globally optimal protein design search, guaranteeing that the computational predictions are optimal with respect to the input model. Here, we illustrate the importance of these three characteristics. We then describe OSPREY, a protein redesign suite that implements our protein design algorithms. OSPREY has been used prospectively, with experimental validation, in several biomedically relevant settings. We show in detail how OSPREY has been used to predict resistance mutations and explain why improved flexibility, ensembles, and provability are essential for this application. AVAILABILITY OSPREY is free and open source under a Lesser GPL license. The latest version is OSPREY 2.0. The program, user manual, and source code are available at www.cs.duke.edu/donaldlab/software.php. CONTACT osprey@cs.duke.edu.
Collapse
Affiliation(s)
- Pablo Gainza
- Department of Computer Science, Duke University, Durham, North Carolina, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
30
|
Hallen MA, Keedy DA, Donald BR. Dead-end elimination with perturbations (DEEPer): a provable protein design algorithm with continuous sidechain and backbone flexibility. Proteins 2012; 81:18-39. [PMID: 22821798 DOI: 10.1002/prot.24150] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2012] [Revised: 07/01/2012] [Accepted: 07/11/2012] [Indexed: 11/12/2022]
Abstract
Computational protein and drug design generally require accurate modeling of protein conformations. This modeling typically starts with an experimentally determined protein structure and considers possible conformational changes due to mutations or new ligands. The DEE/A* algorithm provably finds the global minimum-energy conformation (GMEC) of a protein assuming that the backbone does not move and the sidechains take on conformations from a set of discrete, experimentally observed conformations called rotamers. DEE/A* can efficiently find the overall GMEC for exponentially many mutant sequences. Previous improvements to DEE/A* include modeling ensembles of sidechain conformations and either continuous sidechain or backbone flexibility. We present a new algorithm, DEEPer (Dead-End Elimination with Perturbations), that combines these advantages and can also handle much more extensive backbone flexibility and backbone ensembles. DEEPer provably finds the GMEC or, if desired by the user, all conformations and sequences within a specified energy window of the GMEC. It includes the new abilities to handle arbitrarily large backbone perturbations and to generate ensembles of backbone conformations. It also incorporates the shear, an experimentally observed local backbone motion never before used in design. Additionally, we derive a new method to accelerate DEE/A*-based calculations, indirect pruning, that is particularly useful for DEEPer. In 67 benchmark tests on 64 proteins, DEEPer consistently identified lower-energy conformations than previous methods did, indicating more accurate modeling. Additional tests demonstrated its ability to incorporate larger, experimentally observed backbone conformational changes and to model realistic conformational ensembles. These capabilities provide significant advantages for modeling protein mutations and protein-ligand interactions.
Collapse
Affiliation(s)
- Mark A Hallen
- Department of Biochemistry, Duke University Medical Center, Durham, North Carolina, USA
| | | | | |
Collapse
|
31
|
Sereda YV, Singharoy AB, Jarrold MF, Ortoleva PJ. Discovering free energy basins for macromolecular systems via guided multiscale simulation. J Phys Chem B 2012; 116:8534-44. [PMID: 22423635 PMCID: PMC3408247 DOI: 10.1021/jp2126174] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
An approach for the automated discovery of low free energy states of macromolecular systems is presented. The method does not involve delineating the entire free energy landscape but proceeds in a sequential free energy minimizing state discovery; i.e., it first discovers one low free energy state and then automatically seeks a distinct neighboring one. These states and the associated ensembles of atomistic configurations are characterized by coarse-grained variables capturing the large-scale structure of the system. A key facet of our approach is the identification of such coarse-grained variables. Evolution of these variables is governed by Langevin dynamics driven by thermal-average forces and mediated by diffusivities, both of which are constructed by an ensemble of short molecular dynamics runs. In the present approach, the thermal-average forces are modified to account for the entropy changes following from our knowledge of the free energy basins already discovered. Such forces guide the system away from the known free energy minima, over free energy barriers, and to a new one. The theory is demonstrated for lactoferrin, known to have multiple energy-minimizing structures. The approach is validated using experimental structures and traditional molecular dynamics. The method can be generalized to enable the interpretation of nanocharacterization data (e.g., ion mobility-mass spectrometry, atomic force microscopy, chemical labeling, and nanopore measurements).
Collapse
Affiliation(s)
- Yuriy V. Sereda
- Center for Cell and Virus Theory, Department of Chemistry, Indiana University, 800 E. Kirkwood Ave, Bloomington, IN 47405
| | - Abhishek B. Singharoy
- Center for Cell and Virus Theory, Department of Chemistry, Indiana University, 800 E. Kirkwood Ave, Bloomington, IN 47405
| | - Martin F. Jarrold
- Center for Cell and Virus Theory, Department of Chemistry, Indiana University, 800 E. Kirkwood Ave, Bloomington, IN 47405
| | - Peter J. Ortoleva
- Center for Cell and Virus Theory, Department of Chemistry, Indiana University, 800 E. Kirkwood Ave, Bloomington, IN 47405
| |
Collapse
|
32
|
Murphy GS, Mills JL, Miley MJ, Machius M, Szyperski T, Kuhlman B. Increasing sequence diversity with flexible backbone protein design: the complete redesign of a protein hydrophobic core. Structure 2012; 20:1086-96. [PMID: 22632833 PMCID: PMC3372604 DOI: 10.1016/j.str.2012.03.026] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2011] [Revised: 02/15/2012] [Accepted: 03/30/2012] [Indexed: 01/07/2023]
Abstract
Protein design tests our understanding of protein stability and structure. Successful design methods should allow the exploration of sequence space not found in nature. However, when redesigning naturally occurring protein structures, most fixed backbone design algorithms return amino acid sequences that share strong sequence identity with wild-type sequences, especially in the protein core. This behavior places a restriction on functional space that can be explored and is not consistent with observations from nature, where sequences of low identity have similar structures. Here, we allow backbone flexibility during design to mutate every position in the core (38 residues) of a four-helix bundle protein. Only small perturbations to the backbone, 1-2 Å, were needed to entirely mutate the core. The redesigned protein, DRNN, is exceptionally stable (melting point >140°C). An NMR and X-ray crystal structure show that the side chains and backbone were accurately modeled (all-atom RMSD = 1.3 Å).
Collapse
Affiliation(s)
- Grant S. Murphy
- Department of Chemistry, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599-3290, USA
| | - Jeffrey L. Mills
- Department of Chemistry, State University of New York at Buffalo, Buffalo, NY, 14260, USA
,Northeast Structural Genomics Consortium
| | - Michael J. Miley
- Center for Structural Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
| | - Mischa Machius
- Center for Structural Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
| | - Thomas Szyperski
- Department of Chemistry, State University of New York at Buffalo, Buffalo, NY, 14260, USA
,Northeast Structural Genomics Consortium
| | - Brian Kuhlman
- Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599-7260, USA
,Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
,corresponding author. , Phone: 919-843-0188, Fax: 919-966-2852
| |
Collapse
|
33
|
|
34
|
Abstract
UNLABELLED Optimizing amino acid conformation and identity is a central problem in computational protein design. Protein design algorithms must allow realistic protein flexibility to occur during this optimization, or they may fail to find the best sequence with the lowest energy. Most design algorithms implement side-chain flexibility by allowing the side chains to move between a small set of discrete, low-energy states, which we call rigid rotamers. In this work we show that allowing continuous side-chain flexibility (which we call continuous rotamers) greatly improves protein flexibility modeling. We present a large-scale study that compares the sequences and best energy conformations in 69 protein-core redesigns using a rigid-rotamer model versus a continuous-rotamer model. We show that in nearly all of our redesigns the sequence found by the continuous-rotamer model is different and has a lower energy than the one found by the rigid-rotamer model. Moreover, the sequences found by the continuous-rotamer model are more similar to the native sequences. We then show that the seemingly easy solution of sampling more rigid rotamers within the continuous region is not a practical alternative to a continuous-rotamer model: at computationally feasible resolutions, using more rigid rotamers was never better than a continuous-rotamer model and almost always resulted in higher energies. Finally, we present a new protein design algorithm based on the dead-end elimination (DEE) algorithm, which we call iMinDEE, that makes the use of continuous rotamers feasible in larger systems. iMinDEE guarantees finding the optimal answer while pruning the search space with close to the same efficiency of DEE. AVAILABILITY Software is available under the Lesser GNU Public License v3. Contact the authors for source code.
Collapse
|
35
|
Zeng J, Roberts KE, Zhou P, Donald BR. A Bayesian approach for determining protein side-chain rotamer conformations using unassigned NOE data. J Comput Biol 2011; 18:1661-79. [PMID: 21970619 DOI: 10.1089/cmb.2011.0172] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
A major bottleneck in protein structure determination via nuclear magnetic resonance (NMR) is the lengthy and laborious process of assigning resonances and nuclear Overhauser effect (NOE) cross peaks. Recent studies have shown that accurate backbone folds can be determined using sparse NMR data, such as residual dipolar couplings (RDCs) or backbone chemical shifts. This opens a question of whether we can also determine the accurate protein side-chain conformations using sparse or unassigned NMR data. We attack this question by using unassigned nuclear Overhauser effect spectroscopy (NOESY) data, which records the through-space dipolar interactions between protons nearby in three-dimensional (3D) space. We propose a Bayesian approach with a Markov random field (MRF) model to integrate the likelihood function derived from observed experimental data, with prior information (i.e., empirical molecular mechanics energies) about the protein structures. We unify the side-chain structure prediction problem with the side-chain structure determination problem using unassigned NMR data, and apply the deterministic dead-end elimination (DEE) and A* search algorithms to provably find the global optimum solution that maximizes the posterior probability. We employ a Hausdorff-based measure to derive the likelihood of a rotamer or a pairwise rotamer interaction from unassigned NOESY data. In addition, we apply a systematic and rigorous approach to estimate the experimental noise in NMR data, which also determines the weighting factor of the data term in the scoring function derived from the Bayesian framework. We tested our approach on real NMR data of three proteins: the FF Domain 2 of human transcription elongation factor CA150 (FF2), the B1 domain of Protein G (GB1), and human ubiquitin. The promising results indicate that our algorithm can be applied in high-resolution protein structure determination. Since our approach does not require any NOE assignment, it can accelerate the NMR structure determination process.
Collapse
Affiliation(s)
- Jianyang Zeng
- Department of Computer Science, Duke University, Durham, NC 27708, USA
| | | | | | | |
Collapse
|
36
|
Moughon SE, Samudrala R. LoCo: a novel main chain scoring function for protein structure prediction based on local coordinates. BMC Bioinformatics 2011; 12:368. [PMID: 21920038 PMCID: PMC3184297 DOI: 10.1186/1471-2105-12-368] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2011] [Accepted: 09/15/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Successful protein structure prediction requires accurate low-resolution scoring functions so that protein main chain conformations that are close to the native can be identified. Once that is accomplished, a more detailed and time-consuming treatment to produce all-atom models can be undertaken. The earliest low-resolution scoring used simple distance-based "contact potentials," but more recently, the relative orientations of interacting amino acids have been taken into account to improve performance. RESULTS We developed a new knowledge-based scoring function, LoCo, that locates the interaction partners of each individual residue within a local coordinate system based only on the position of its main chain N, Cα and C atoms. LoCo was trained on a large set of experimentally determined structures and optimized using standard sets of modeled structures, or "decoys." No structure used to train or optimize the function was included among those used to test it. When tested against 29 other published main chain functions on a group of 77 commonly used decoy sets, our function outperformed all others in Cα RMSD rank of the best-scoring decoy, with statistically significant p-values < 0.05 for 26 out of the 29 other functions considered. LoCo is fast, requiring on average less than 6 microseconds per residue for interaction and scoring on commonly-used computer hardware. CONCLUSIONS Our function demonstrates an unmatched combination of accuracy, speed, and simplicity and shows excellent promise for protein structure prediction. Broader applications may include protein-protein interactions and protein design.
Collapse
Affiliation(s)
- Stewart E Moughon
- Department of Microbiology, University of Washington, Box 357735, Seattle, Washington 98195-7242, USA.
| | | |
Collapse
|
37
|
Smith CA, Kortemme T. Predicting the tolerated sequences for proteins and protein interfaces using RosettaBackrub flexible backbone design. PLoS One 2011; 6:e20451. [PMID: 21789164 PMCID: PMC3138746 DOI: 10.1371/journal.pone.0020451] [Citation(s) in RCA: 77] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2011] [Accepted: 04/20/2011] [Indexed: 11/18/2022] Open
Abstract
Predicting the set of sequences that are tolerated by a protein or protein interface, while maintaining a desired function, is useful for characterizing protein interaction specificity and for computationally designing sequence libraries to engineer proteins with new functions. Here we provide a general method, a detailed set of protocols, and several benchmarks and analyses for estimating tolerated sequences using flexible backbone protein design implemented in the Rosetta molecular modeling software suite. The input to the method is at least one experimentally determined three-dimensional protein structure or high-quality model. The starting structure(s) are expanded or refined into a conformational ensemble using Monte Carlo simulations consisting of backrub backbone and side chain moves in Rosetta. The method then uses a combination of simulated annealing and genetic algorithm optimization methods to enrich for low-energy sequences for the individual members of the ensemble. To emphasize certain functional requirements (e.g. forming a binding interface), interactions between and within parts of the structure (e.g. domains) can be reweighted in the scoring function. Results from each backbone structure are merged together to create a single estimate for the tolerated sequence space. We provide an extensive description of the protocol and its parameters, all source code, example analysis scripts and three tests applying this method to finding sequences predicted to stabilize proteins or protein interfaces. The generality of this method makes many other applications possible, for example stabilizing interactions with small molecules, DNA, or RNA. Through the use of within-domain reweighting and/or multistate design, it may also be possible to use this method to find sequences that stabilize particular protein conformations or binding interactions over others.
Collapse
Affiliation(s)
- Colin A. Smith
- Graduate Program in Biological and Medical Informatics, University of California San Francisco, San Francisco, California, United States of America
- California Institute for Quantitative Biosciences, San Francisco, California, United States of America
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, United States of America
| | - Tanja Kortemme
- Graduate Program in Biological and Medical Informatics, University of California San Francisco, San Francisco, California, United States of America
- California Institute for Quantitative Biosciences, San Francisco, California, United States of America
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, United States of America
- * E-mail:
| |
Collapse
|
38
|
Samish I, MacDermaid CM, Perez-Aguilar JM, Saven JG. Theoretical and Computational Protein Design. Annu Rev Phys Chem 2011; 62:129-49. [DOI: 10.1146/annurev-physchem-032210-103509] [Citation(s) in RCA: 119] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
| | | | | | - Jeffery G. Saven
- Department of Chemistry, University of Pennsylvania, Philadelphia, Pennsylvania 19104;
| |
Collapse
|
39
|
Babor M, Mandell DJ, Kortemme T. Assessment of flexible backbone protein design methods for sequence library prediction in the therapeutic antibody Herceptin-HER2 interface. Protein Sci 2011; 20:1082-9. [PMID: 21465611 DOI: 10.1002/pro.632] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2010] [Revised: 03/15/2011] [Accepted: 03/16/2011] [Indexed: 01/28/2023]
Abstract
Computational protein design methods can complement experimental screening and selection techniques by predicting libraries of low-energy sequences compatible with a desired structure and function. Incorporating backbone flexibility in computational design allows conformational adjustments that should broaden the range of predicted low-energy sequences. Here, we evaluate computational predictions of sequence libraries from different protocols for modeling backbone flexibility using the complex between the therapeutic antibody Herceptin and its target human epidermal growth factor receptor 2 (HER2) as a model system. Within the program RosettaDesign, three methods are compared: The first two use ensembles of structures generated by Monte Carlo protocols for near-native conformational sampling: kinematic closure (KIC) and backrub, and the third method uses snapshots from molecular dynamics (MD) simulations. KIC or backrub methods were better able to identify the amino acid residues experimentally observed by phage display in the Herceptin-HER2 interface than MD snapshots, which generated much larger conformational and sequence diversity. KIC and backrub, as well as fixed backbone simulations, captured the key mutation Asp98Trp in Herceptin, which leads to a further threefold affinity improvement of the already subnanomolar parental Herceptin-HER2 interface. Modeling subtle backbone conformational changes may assist in the design of sequence libraries for improving the affinity of antibody-antigen interfaces and could be suitable for other protein complexes for which structural information is available.
Collapse
Affiliation(s)
- Mariana Babor
- California Institute for Quantitative Biomedical Research, University of California, San Francisco, San Francisco, California 94158-2330, USA
| | | | | |
Collapse
|
40
|
Sammond DW, Bosch DE, Butterfoss GL, Purbeck C, Machius M, Siderovski DP, Kuhlman B. Computational design of the sequence and structure of a protein-binding peptide. J Am Chem Soc 2011; 133:4190-2. [PMID: 21388199 PMCID: PMC3081598 DOI: 10.1021/ja110296z] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The de novo design of protein-binding peptides is challenging because it requires the identification of both a sequence and a backbone conformation favorable for binding. We used a computational strategy that iterates between structure and sequence optimization to redesign the C-terminal portion of the RGS14 GoLoco motif peptide so that it adopts a new conformation when bound to Gα(i1). An X-ray crystal structure of the redesigned complex closely matches the computational model, with a backbone root-mean-square deviation of 1.1 Å.
Collapse
Affiliation(s)
- Deanne W. Sammond
- Department of Biochemistry and Biophysics, University of North Carolina, Chapel Hill, North Carolina 27599-7260, USA
| | - Dustin E. Bosch
- Department of Pharmacology, University of North Carolina, Chapel Hill, North Carolina 27599-7365, USA
| | - Glenn L. Butterfoss
- Department of Biochemistry and Biophysics, University of North Carolina, Chapel Hill, North Carolina 27599-7260, USA
| | - Carrie Purbeck
- Department of Biochemistry and Biophysics, University of North Carolina, Chapel Hill, North Carolina 27599-7260, USA
| | - Mischa Machius
- Department of Pharmacology, University of North Carolina, Chapel Hill, North Carolina 27599-7365, USA
| | - David P. Siderovski
- Department of Pharmacology, University of North Carolina, Chapel Hill, North Carolina 27599-7365, USA
| | - Brian Kuhlman
- Department of Biochemistry and Biophysics, University of North Carolina, Chapel Hill, North Carolina 27599-7260, USA
| |
Collapse
|
41
|
Bellows ML, Taylor MS, Cole PA, Shen L, Siliciano RF, Fung HK, Floudas CA. Discovery of entry inhibitors for HIV-1 via a new de novo protein design framework. Biophys J 2011; 99:3445-53. [PMID: 21081094 DOI: 10.1016/j.bpj.2010.09.050] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2010] [Revised: 09/23/2010] [Accepted: 09/27/2010] [Indexed: 12/11/2022] Open
Abstract
A new (to our knowledge) de novo design framework with a ranking metric based on approximate binding affinity calculations is introduced and applied to the discovery of what we believe are novel HIV-1 entry inhibitors. The framework consists of two stages: a sequence selection stage and a validation stage. The sequence selection stage produces a rank-ordered list of amino-acid sequences by solving an integer programming sequence selection model. The validation stage consists of fold specificity and approximate binding affinity calculations. The designed peptidic inhibitors are 12-amino-acids-long and target the hydrophobic core of gp41. A number of the best-predicted sequences were synthesized and their inhibition of HIV-1 was tested in cell culture. All peptides examined showed inhibitory activity when compared with no drug present, and the novel peptide sequences outperformed the native template sequence used for the design. The best sequence showed micromolar inhibition, which is a 3-15-fold improvement over the native sequence, depending on the donor. In addition, the best sequence equally inhibited wild-type and Enfuvirtide-resistant virus strains.
Collapse
Affiliation(s)
- M L Bellows
- Department of Chemical and Biological Engineering, Princeton University, Princeton, NJ, USA
| | | | | | | | | | | | | |
Collapse
|
42
|
Abstract
Drug resistance resulting from mutations to the target is an unfortunate common phenomenon that limits the lifetime of many of the most successful drugs. In contrast to the investigation of mutations after clinical exposure, it would be powerful to be able to incorporate strategies early in the development process to predict and overcome the effects of possible resistance mutations. Here we present a unique prospective application of an ensemble-based protein design algorithm, K*, to predict potential resistance mutations in dihydrofolate reductase from Staphylococcus aureus using positive design to maintain catalytic function and negative design to interfere with binding of a lead inhibitor. Enzyme inhibition assays show that three of the four highly-ranked predicted mutants are active yet display lower affinity (18-, 9-, and 13-fold) for the inhibitor. A crystal structure of the top-ranked mutant enzyme validates the predicted conformations of the mutated residues and the structural basis of the loss of potency. The use of protein design algorithms to predict resistance mutations could be incorporated in a lead design strategy against any target that is susceptible to mutational resistance.
Collapse
|
43
|
Friedland GD, Kortemme T. Designing ensembles in conformational and sequence space to characterize and engineer proteins. Curr Opin Struct Biol 2010; 20:377-84. [DOI: 10.1016/j.sbi.2010.02.004] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2010] [Accepted: 02/19/2010] [Indexed: 11/16/2022]
|
44
|
Guiding the Search for Native-like Protein Conformations with an Ab-initio Tree-based Exploration. Int J Rob Res 2010. [DOI: 10.1177/0278364910371527] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In this paper we propose a robotics-inspired method to enhance sampling of native-like conformations when employing only aminoacid sequence information for a protein at hand. Computing such conformations, essential to associating structural and functional information with gene sequences, is challenging due to the high-dimensionality and the rugged energy surface of the protein conformational space. The contribution of this paper is a novel two-layered method to enhance the sampling of geometrically distinct low-energy conformations at a coarse-grained level of detail. The method grows a tree in conformational space reconciling two goals: (i) guiding the tree towards lower energies; and (ii) not oversampling geometrically similar conformations. Discretizations of the energy surface and a low-dimensional projection space are employed to select more often for expansion low-energy conformations in under-explored regions of the conformational space. The tree is expanded with low-energy conformations through a Metropolis Monte Carlo framework that uses a move set of physical fragment configurations. Testing on sequences of eight small-to-medium structurally diverse proteins shows that the method rapidly samples native-like conformations in a few hours on a single CPU. Analysis shows that computed conformations are good candidates for further detailed energetic refinements by larger studies in protein engineering and design.
Collapse
|
45
|
Fromer M, Yanover C, Linial M. Design of multispecific protein sequences using probabilistic graphical modeling. Proteins 2010; 78:530-47. [PMID: 19842166 DOI: 10.1002/prot.22575] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
In nature, proteins partake in numerous protein- protein interactions that mediate their functions. Moreover, proteins have been shown to be physically stable in multiple structures, induced by cellular conditions, small ligands, or covalent modifications. Understanding how protein sequences achieve this structural promiscuity at the atomic level is a fundamental step in the drug design pipeline and a critical question in protein physics. One way to investigate this subject is to computationally predict protein sequences that are compatible with multiple states, i.e., multiple target structures or binding to distinct partners. The goal of engineering such proteins has been termed multispecific protein design. We develop a novel computational framework to efficiently and accurately perform multispecific protein design. This framework utilizes recent advances in probabilistic graphical modeling to predict sequences with low energies in multiple target states. Furthermore, it is also geared to specifically yield positional amino acid probability profiles compatible with these target states. Such profiles can be used as input to randomly bias high-throughput experimental sequence screening techniques, such as phage display, thus providing an alternative avenue for elucidating the multispecificity of natural proteins and the synthesis of novel proteins with specific functionalities. We prove the utility of such multispecific design techniques in better recovering amino acid sequence diversities similar to those resulting from millions of years of evolution. We then compare the approaches of prediction of low energy ensembles and of amino acid profiles and demonstrate their complementarity in providing more robust predictions for protein design.
Collapse
Affiliation(s)
- Menachem Fromer
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Israel.
| | | | | |
Collapse
|
46
|
Abstract
Predictive methods for the computational design of proteins search for amino acid sequences adopting desired structures that perform specific functions. Typically, design of 'function' is formulated as engineering new and altered binding activities into proteins. Progress in the design of functional protein-protein interactions is directed toward engineering proteins to precisely control biological processes by specifically recognizing desired interaction partners while avoiding competitors. The field is aiming for strategies to harness recent advances in high-resolution computational modeling-particularly those exploiting protein conformational variability-to engineer new functions and incorporate many functional requirements simultaneously.
Collapse
Affiliation(s)
- Daniel J Mandell
- Graduate Program in Bioinformatics and Computational Biology, California Institute for Quantitative Biosciences, and Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, USA
| | | |
Collapse
|
47
|
Tian P. Computational protein design, from single domain soluble proteins to membrane proteins. Chem Soc Rev 2010; 39:2071-82. [DOI: 10.1039/b810924a] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
48
|
Fromer M, Yanover C. Accurate prediction for atomic-level protein design and its application in diversifying the near-optimal sequence space. Proteins 2009; 75:682-705. [PMID: 19003998 DOI: 10.1002/prot.22280] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
The task of engineering a protein to assume a target three-dimensional structure is known as protein design. Computational search algorithms are devised to predict a minimal energy amino acid sequence for a particular structure. In practice, however, an ensemble of low-energy sequences is often sought. Primarily, this is performed because an individual predicted low-energy sequence may not necessarily fold to the target structure because of both inaccuracies in modeling protein energetics and the nonoptimal nature of search algorithms employed. Additionally, some low-energy sequences may be overly stable and thus lack the dynamic flexibility required for biological functionality. Furthermore, the investigation of low-energy sequence ensembles will provide crucial insights into the pseudo-physical energy force fields that have been derived to describe structural energetics for protein design. Significantly, numerous studies have predicted low-energy sequences, which were subsequently synthesized and demonstrated to fold to desired structures. However, the characterization of the sequence space defined by such energy functions as compatible with a target structure has not been performed in full detail. This issue is critical for protein design scientists to successfully continue using these force fields at an ever-increasing pace and scale. In this paper, we present a conceptually novel algorithm that rapidly predicts the set of lowest energy sequences for a given structure. Based on the theory of probabilistic graphical models, it performs efficient inspection and partitioning of the near-optimal sequence space, without making any assumptions of positional independence. We benchmark its performance on a diverse set of relevant protein design examples and show that it consistently yields sequences of lower energy than those derived from state-of-the-art techniques. Thus, we find that previously presented search techniques do not fully depict the low-energy space as precisely. Examination of the predicted ensembles indicates that, for each structure, the amino acid identity at a majority of positions must be chosen extremely selectively so as to not incur significant energetic penalties. We investigate this high degree of similarity and demonstrate how more diverse near-optimal sequences can be predicted in order to systematically overcome this bottleneck for computational design. Finally, we exploit this in-depth analysis of a collection of the lowest energy sequences to suggest an explanation for previously observed experimental design results. The novel methodologies introduced here accurately portray the sequence space compatible with a protein structure and further supply a scheme to yield heterogeneous low-energy sequences, thus providing a powerful instrument for future work on protein design.
Collapse
Affiliation(s)
- Menachem Fromer
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel.
| | | |
Collapse
|
49
|
Ma J. Explicit orientation dependence in empirical potentials and its significance to side-chain modeling. Acc Chem Res 2009; 42:1087-96. [PMID: 19445451 PMCID: PMC2728797 DOI: 10.1021/ar900009e] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Protein structure modeling and prediction have important applications throughout the biological sciences, from the design of pharmaceuticals to the elucidation of enzyme mechanisms. At the core of most protein modeling is an energy function, the minimum of which represents the free energy "cost" for forming a correct protein structure. The most commonly used energy functions are knowledge-based statistical potential functions; that is, they are empirically derived from statistical analysis of a set of high-resolution protein structures. When that kind of potential function is constructed, the anisotropic orientation dependence between the interacting groups is a critical component for accurately representing key molecular interactions, such as those involved in protein side-chain packing. In the literature, however, many potential functions are limited in their ability to describe orientation dependence. In all-atom potentials, they typically ignore heterogeneous chemical-bond connectivity. In coarse-grained potentials, such as (semi)-residue-based potentials, the simplified representation of residues often reduces the sensitivity of the potential to side-chain orientation. Recently, in an effort to maximally capture the orientation dependence in side-chain interactions, a new type of all-atom statistical potential was developed: OPUS-PSP (potential derived from side-chain packing). The key feature of this potential is its explicit description of orientation dependence in molecular interactions, which is achieved with a basis set of 19 rigid-body blocks extracted from the chemical structures of 20 amino acid residues. This basis set is specifically designed to maximally capture the essential elements of orientation dependence in molecular packing interactions. The potential is constructed from the orientation-specific packing statistics of pairs of those blocks in a nonredundant structural database. On decoy set tests, OPUS-PSP significantly outperforms most of the existing knowledge-based potentials in terms of both its ability to recognize native structures and its consistency in achieving high Z scores across decoy sets. The application of OPUS-PSP to conformational modeling of side chains has led to another method, called OPUS-Rota. In terms of combined speed and accuracy, OPUS-Rota outperforms all of the other methods in modeling side-chain conformation. In this Account, we briefly outline the basic scheme of the OPUS-PSP potential and its application to side-chain modeling via OPUS-Rota. Future perspectives on the modeling of orientation dependence are also discussed. The computer programs for OPUS-PSP and OPUS-Rota can be downloaded at http://sigler.bioch.bcm.tmc.edu/MaLab . They are free for academic users.
Collapse
Affiliation(s)
- Jianpeng Ma
- Department of Biochemistry and Molecular Biology, Baylor College of Medicine, One Baylor Plaza, Houston, Texas 77030, USA.
| |
Collapse
|
50
|
Backbone flexibility in computational protein design. Curr Opin Biotechnol 2009; 20:420-8. [DOI: 10.1016/j.copbio.2009.07.006] [Citation(s) in RCA: 84] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2009] [Revised: 07/17/2009] [Accepted: 07/25/2009] [Indexed: 11/22/2022]
|