1
|
Huang L, Duan Q, Liu Y, Wu Y, Li Z, Guo Z, Liu M, Lu X, Wang P, Liu F, Ren F, Li C, Wang J, Huang Y, Yan B, Kioumourtzoglou MA, Kinney PL. Artificial intelligence: A key fulcrum for addressing complex environmental health issues. ENVIRONMENT INTERNATIONAL 2025; 198:109389. [PMID: 40121790 DOI: 10.1016/j.envint.2025.109389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2024] [Revised: 02/16/2025] [Accepted: 03/15/2025] [Indexed: 03/25/2025]
Abstract
Environmental health (EH) is a complex and interdisciplinary field dedicated to the examination of environmental behaviours, toxicological effects, health risks, and strategies for mitigating harmful environmental factors. Traditional EH research investigates correlations between risk factors and health outcomes through control variables, but this route is difficult to address complex EH issue. Artificial intelligence (AI) technology not only has accelerated the innovation of the scientific research paradigm but also has become an important tool for solving complex EH problems. However, the in-depth and comprehensive implementation of AI in the field of EH still faces many barriers, such as model generalizability, data privacy protection, algorithm transparency, and regulatory and ethical issues. This review focuses on the compound exposures of EH and explores the potential, challenges, and development directions of AI in four key phases of EH research: (1) data collection, fusion, and management, (2) hazard identification and screening, (3) risk modeling and assessment and (4) EH management. It is not difficult to see that in the future, artificial intelligence technology will inevitably carry out multidimensional simulation of complex exposure factors through multi-mode data fusion, so as to achieve accurate identification of environmental health risks, and eventually become an efficient tool for global environmental health management. This review will help researchers re-examine this strategy and provide a reference for AI to solve complex exposure problems.
Collapse
Affiliation(s)
- Lei Huang
- State Key Laboratory of Water Pollution Control and Green Resource Recycling, School of the Environment, Nanjing University, Nanjing 210023, China; Basic Science Center for Energy and Climate Change, Beijing 100081, China.
| | - Qiannan Duan
- Shaanxi Key Laboratory of Earth Surface System and Environmental Carrying Capacity, College of Urban and Environmental Sciences, Northwest University, Xi'an 710127, China.
| | - Yuxin Liu
- State Key Laboratory of Water Pollution Control and Green Resource Recycling, School of the Environment, Nanjing University, Nanjing 210023, China
| | - Yangyang Wu
- State Key Laboratory of Water Pollution Control and Green Resource Recycling, School of the Environment, Nanjing University, Nanjing 210023, China
| | - Zenghui Li
- State Key Laboratory of Water Pollution Control and Green Resource Recycling, School of the Environment, Nanjing University, Nanjing 210023, China
| | - Zhao Guo
- State Key Laboratory of Water Pollution Control and Green Resource Recycling, School of the Environment, Nanjing University, Nanjing 210023, China
| | - Mingliang Liu
- State Key Laboratory of Water Pollution Control and Green Resource Recycling, School of the Environment, Nanjing University, Nanjing 210023, China
| | - Xiaowei Lu
- State Key Laboratory of Water Pollution Control and Green Resource Recycling, School of the Environment, Nanjing University, Nanjing 210023, China
| | - Peng Wang
- Faculty of Civil Engineering and Mechanics, Jiangsu University, Zhenjiang 212013, China
| | - Fan Liu
- State Key Laboratory of Water Pollution Control and Green Resource Recycling, School of the Environment, Nanjing University, Nanjing 210023, China
| | - Futian Ren
- State Key Laboratory of Water Pollution Control and Green Resource Recycling, School of the Environment, Nanjing University, Nanjing 210023, China
| | - Chen Li
- State Key Laboratory of Water Pollution Control and Green Resource Recycling, School of the Environment, Nanjing University, Nanjing 210023, China; Medical School, Nanjing University, Nanjing 210093, China
| | - Jiaming Wang
- State Key Laboratory of Water Pollution Control and Green Resource Recycling, School of the Environment, Nanjing University, Nanjing 210023, China
| | - Yujia Huang
- State Key Laboratory of Water Pollution Control and Green Resource Recycling, School of the Environment, Nanjing University, Nanjing 210023, China
| | - Beizhan Yan
- Lamont-Doherty Earth Observatory, Columbia University, New York, USA
| | | | | |
Collapse
|
2
|
Aranganathan A, Gu X, Wang D, Vani BP, Tiwary P. Modeling Boltzmann-weighted structural ensembles of proteins using artificial intelligence-based methods. Curr Opin Struct Biol 2025; 91:103000. [PMID: 39923288 PMCID: PMC12011212 DOI: 10.1016/j.sbi.2025.103000] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Revised: 01/09/2025] [Accepted: 01/20/2025] [Indexed: 02/11/2025]
Abstract
This review highlights recent advances in AI-driven methods for generating Boltzmann-weighted structural ensembles, which are crucial for understanding biomolecular dynamics and drug discovery. With the rise of deep learning models such as AlphaFold2, there has been a shift toward more accurate and efficient sampling of structural ensembles. The review discusses the integration of AI with traditional molecular dynamics techniques as well as experiments, the challenges of conformational sampling, and future directions for AI-driven research in structural biology, particularly in drug discovery and protein dynamics.
Collapse
Affiliation(s)
- Akashnathan Aranganathan
- Biophysics Program, University of Maryland, College Park, 20742, MD, USA; Institute of Physical Science and Technology, University of Maryland, College Park, 20742, MD, USA
| | - Xinyu Gu
- Institute of Physical Science and Technology, University of Maryland, College Park, 20742, MD, USA; University of Maryland Institute for Health Computing, Bethesda, 20852, MD, USA.
| | - Dedi Wang
- Genentech, 1 DNA Way, South San Francisco, 94080, CA, USA
| | - Bodhi P Vani
- Genentech, 1 DNA Way, South San Francisco, 94080, CA, USA
| | - Pratyush Tiwary
- Institute of Physical Science and Technology, University of Maryland, College Park, 20742, MD, USA; University of Maryland Institute for Health Computing, Bethesda, 20852, MD, USA; Department of Chemistry and Biochemistry, University of Maryland, College Park, 20742, MD, USA.
| |
Collapse
|
3
|
Phillips M, Muthukumar M, Ghosh K. Beyond monopole electrostatics in regulating conformations of intrinsically disordered proteins. PNAS NEXUS 2024; 3:pgae367. [PMID: 39253398 PMCID: PMC11382291 DOI: 10.1093/pnasnexus/pgae367] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Accepted: 08/13/2024] [Indexed: 09/11/2024]
Abstract
Conformations and dynamics of an intrinsically disordered protein (IDP) depend on its composition of charged and uncharged amino acids, and their specific placement in the protein sequence. In general, the charge (positive or negative) on an amino acid residue in the protein is not a fixed quantity. Each of the ionizable groups can exist in an equilibrated distribution of fully ionized state (monopole) and an ion-pair (dipole) state formed between the ionizing group and its counterion from the background electrolyte solution. The dipole formation (counterion condensation) depends on the protein conformation, which in turn depends on the distribution of charges and dipoles on the molecule. Consequently, effective charges of ionizable groups in the IDP backbone may differ from their chemical charges in isolation-a phenomenon termed charge-regulation. Accounting for the inevitable dipolar interactions, that have so far been ignored, and using a self-consistent procedure, we present a theory of charge-regulation as a function of sequence, temperature, and ionic strength. The theory quantitatively agrees with both charge reduction and salt-dependent conformation data of Prothymosin-alpha and makes several testable predictions. We predict charged groups are less ionized in sequences where opposite charges are well mixed compared to sequences where they are strongly segregated. Emergence of dipolar interactions from charge-regulation allows spontaneous coexistence of two phases having different conformations and charge states, sensitively depending on the charge patterning. These findings highlight sequence dependent charge-regulation and its potential exploitation by biological regulators such as phosphorylation and mutations in controlling protein conformation and function.
Collapse
Affiliation(s)
- Michael Phillips
- Department of Physics and Astronomy, University of Denver, Denver, CO 80208, USA
| | - Murugappan Muthukumar
- Department of Polymer Science and Engineering, University of Massachusetts, Amherst, MA 01003, USA
| | - Kingshuk Ghosh
- Department of Physics and Astronomy, University of Denver, Denver, CO 80208, USA
- Molecular and Cellular Biophysics, University of Denver, Denver, CO 80208, USA
| |
Collapse
|
4
|
Lotthammer JM, Ginell GM, Griffith D, Emenecker RJ, Holehouse AS. Direct prediction of intrinsically disordered protein conformational properties from sequence. Nat Methods 2024; 21:465-476. [PMID: 38297184 PMCID: PMC10927563 DOI: 10.1038/s41592-023-02159-5] [Citation(s) in RCA: 26] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2023] [Accepted: 12/20/2023] [Indexed: 02/02/2024]
Abstract
Intrinsically disordered regions (IDRs) are ubiquitous across all domains of life and play a range of functional roles. While folded domains are generally well described by a stable three-dimensional structure, IDRs exist in a collection of interconverting states known as an ensemble. This structural heterogeneity means that IDRs are largely absent from the Protein Data Bank, contributing to a lack of computational approaches to predict ensemble conformational properties from sequence. Here we combine rational sequence design, large-scale molecular simulations and deep learning to develop ALBATROSS, a deep-learning model for predicting ensemble dimensions of IDRs, including the radius of gyration, end-to-end distance, polymer-scaling exponent and ensemble asphericity, directly from sequences at a proteome-wide scale. ALBATROSS is lightweight, easy to use and accessible as both a locally installable software package and a point-and-click-style interface via Google Colab notebooks. We first demonstrate the applicability of our predictors by examining the generalizability of sequence-ensemble relationships in IDRs. Then, we leverage the high-throughput nature of ALBATROSS to characterize the sequence-specific biophysical behavior of IDRs within and between proteomes.
Collapse
Affiliation(s)
- Jeffrey M Lotthammer
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO, USA
- Center for Biomolecular Condensates, Washington University in St. Louis, St. Louis, MO, USA
| | - Garrett M Ginell
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO, USA
- Center for Biomolecular Condensates, Washington University in St. Louis, St. Louis, MO, USA
| | - Daniel Griffith
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO, USA
- Center for Biomolecular Condensates, Washington University in St. Louis, St. Louis, MO, USA
| | - Ryan J Emenecker
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO, USA
- Center for Biomolecular Condensates, Washington University in St. Louis, St. Louis, MO, USA
| | - Alex S Holehouse
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO, USA.
- Center for Biomolecular Condensates, Washington University in St. Louis, St. Louis, MO, USA.
| |
Collapse
|
5
|
An easy-to-use computational tool for predicting 3D properties of disordered proteins. Nat Methods 2024; 21:385-386. [PMID: 38297185 DOI: 10.1038/s41592-023-02160-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2024]
|
6
|
Ektefaie Y, Shen A, Bykova D, Marin M, Zitnik M, Farhat M. Evaluating generalizability of artificial intelligence models for molecular datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.25.581982. [PMID: 38464295 PMCID: PMC10925170 DOI: 10.1101/2024.02.25.581982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Deep learning has made rapid advances in modeling molecular sequencing data. Despite achieving high performance on benchmarks, it remains unclear to what extent deep learning models learn general principles and generalize to previously unseen sequences. Benchmarks traditionally interrogate model generalizability by generating metadata based (MB) or sequence-similarity based (SB) train and test splits of input data before assessing model performance. Here, we show that this approach mischaracterizes model generalizability by failing to consider the full spectrum of cross-split overlap, i.e., similarity between train and test splits. We introduce Spectra, a spectral framework for comprehensive model evaluation. For a given model and input data, Spectra plots model performance as a function of decreasing cross-split overlap and reports the area under this curve as a measure of generalizability. We apply Spectra to 18 sequencing datasets with associated phenotypes ranging from antibiotic resistance in tuberculosis to protein-ligand binding to evaluate the generalizability of 19 state-of-the-art deep learning models, including large language models, graph neural networks, diffusion models, and convolutional neural networks. We show that SB and MB splits provide an incomplete assessment of model generalizability. With Spectra, we find as cross-split overlap decreases, deep learning models consistently exhibit a reduction in performance in a task- and model-dependent manner. Although no model consistently achieved the highest performance across all tasks, we show that deep learning models can generalize to previously unseen sequences on specific tasks. Spectra paves the way toward a better understanding of how foundation models generalize in biology.
Collapse
Affiliation(s)
- Yasha Ektefaie
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Andrew Shen
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Department of Computer Science, Northwestern University, Evanston, IL, USA
| | - Daria Bykova
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Maximillian Marin
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Marinka Zitnik
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Harvard Data Science Initiative, Cambridge, MA, USA
| | - Maha Farhat
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Division of Pulmonary and Critical Care, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
| |
Collapse
|
7
|
Tučs A, Ito T, Kurumida Y, Kawada S, Nakazawa H, Saito Y, Umetsu M, Tsuda K. Extensive antibody search with whole spectrum black-box optimization. Sci Rep 2024; 14:552. [PMID: 38177656 PMCID: PMC10767033 DOI: 10.1038/s41598-023-51095-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 12/30/2023] [Indexed: 01/06/2024] Open
Abstract
In designing functional biological sequences with machine learning, the activity predictor tends to be inaccurate due to shortage of data. Top ranked sequences are thus unlikely to contain effective ones. This paper proposes to take prediction stability into account to provide domain experts with a reasonable list of sequences to choose from. In our approach, multiple prediction models are trained by subsampling the training set and the multi-objective optimization problem, where one objective is the average activity and the other is the standard deviation, is solved. The Pareto front represents a list of sequences with the whole spectrum of activity and stability. Using this method, we designed VHH (Variable domain of Heavy chain of Heavy chain) antibodies based on the dataset obtained from deep mutational screening. To solve multi-objective optimization, we employed our sequence design software MOQA that uses quantum annealing. By applying several selection criteria to 19,778 designed sequences, five sequences were selected for wet-lab validation. One sequence, 16 mutations away from the closest training sequence, was successfully expressed and found to possess desired binding specificity. Our whole spectrum approach provides a balanced way of dealing with the prediction uncertainty, and can possibly be applied to extensive search of functional sequences.
Collapse
Affiliation(s)
- Andrejs Tučs
- Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Japan
| | - Tomoyuki Ito
- Department of Biomolecular Engineering, Graduate School of Engineering, Tohoku University, Sendai, Japan
| | - Yoichi Kurumida
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
- Department of Data Science, School of Frontier Engineering, Kitasato University, Sagamihara, Japan
| | - Sakiya Kawada
- Department of Biomolecular Engineering, Graduate School of Engineering, Tohoku University, Sendai, Japan
| | - Hikaru Nakazawa
- Department of Biomolecular Engineering, Graduate School of Engineering, Tohoku University, Sendai, Japan
| | - Yutaka Saito
- Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Japan
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
- RIKEN Center for Advanced Intelligence Project, RIKEN, Tokyo, 103-0027, Japan
- AIST-Waseda University Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), Tokyo, Japan
- Department of Data Science, School of Frontier Engineering, Kitasato University, Sagamihara, Japan
| | - Mitsuo Umetsu
- Department of Biomolecular Engineering, Graduate School of Engineering, Tohoku University, Sendai, Japan.
- RIKEN Center for Advanced Intelligence Project, RIKEN, Tokyo, 103-0027, Japan.
| | - Koji Tsuda
- Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Japan.
- RIKEN Center for Advanced Intelligence Project, RIKEN, Tokyo, 103-0027, Japan.
- Center for Basic Research on Materials, National Institute for Materials Science (NIMS), Tsukuba, Japan.
| |
Collapse
|
8
|
Dou Z, Sun Y, Jiang X, Wu X, Li Y, Gong B, Wang L. Data-driven strategies for the computational design of enzyme thermal stability: trends, perspectives, and prospects. Acta Biochim Biophys Sin (Shanghai) 2023; 55:343-355. [PMID: 37143326 PMCID: PMC10160227 DOI: 10.3724/abbs.2023033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 11/23/2022] [Indexed: 03/05/2023] Open
Abstract
Thermal stability is one of the most important properties of enzymes, which sustains life and determines the potential for the industrial application of biocatalysts. Although traditional methods such as directed evolution and classical rational design contribute greatly to this field, the enormous sequence space of proteins implies costly and arduous experiments. The development of enzyme engineering focuses on automated and efficient strategies because of the breakthrough of high-throughput DNA sequencing and machine learning models. In this review, we propose a data-driven architecture for enzyme thermostability engineering and summarize some widely adopted datasets, as well as machine learning-driven approaches for designing the thermal stability of enzymes. In addition, we present a series of existing challenges while applying machine learning in enzyme thermostability design, such as the data dilemma, model training, and use of the proposed models. Additionally, a few promising directions for enhancing the performance of the models are discussed. We anticipate that the efficient incorporation of machine learning can provide more insights and solutions for the design of enzyme thermostability in the coming years.
Collapse
Affiliation(s)
- Zhixin Dou
- State Key Laboratory of Microbial TechnologyShandong UniversityQingdao266237China
| | - Yuqing Sun
- School of SoftwareShandong UniversityJinan250101China
| | - Xukai Jiang
- National Glycoengineering Research CenterShandong UniversityQingdao266237China
| | - Xiuyun Wu
- State Key Laboratory of Microbial TechnologyShandong UniversityQingdao266237China
| | - Yingjie Li
- State Key Laboratory of Microbial TechnologyShandong UniversityQingdao266237China
| | - Bin Gong
- School of SoftwareShandong UniversityJinan250101China
| | - Lushan Wang
- State Key Laboratory of Microbial TechnologyShandong UniversityQingdao266237China
| |
Collapse
|
9
|
Hay IM, Shamin M, Caroe ER, Mohammed ASA, Svergun DI, Jeffries CM, Graham SC, Sharpe HJ, Deane JE. Determinants of receptor tyrosine phosphatase homophilic adhesion: Structural comparison of PTPRK and PTPRM extracellular domains. J Biol Chem 2022; 299:102750. [PMID: 36436563 PMCID: PMC9800333 DOI: 10.1016/j.jbc.2022.102750] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Revised: 11/21/2022] [Accepted: 11/22/2022] [Indexed: 11/27/2022] Open
Abstract
Type IIB receptor protein tyrosine phosphatases are cell surface transmembrane proteins that engage in cell adhesion via their extracellular domains (ECDs) and cell signaling via their cytoplasmic phosphatase domains. The ECDs of type IIB receptor protein tyrosine phosphatases form stable, homophilic, and trans interactions between adjacent cell membranes. Previous work has demonstrated how one family member, PTPRM, forms head-to-tail homodimers. However, as the interface was composed of residues conserved across the family, the determinants of homophilic specificity remain unknown. Here, we have solved the X-ray crystal structure of the membrane-distal N-terminal domains of PTPRK that form a head-to-tail dimer consistent with intermembrane adhesion. Comparison with the PTPRM structure demonstrates interdomain conformational differences that may define homophilic specificity. Using small-angle X-ray scattering, we determined the solution structures of the full-length ECDs of PTPRM and PTPRK, identifying that both are rigid extended molecules that differ in their overall long-range conformation. Furthermore, we identified one residue, W351, within the interaction interface that differs between PTPRM and PTPRK and showed that mutation to glycine, the equivalent residue in PTPRM, abolishes PTPRK dimer formation in vitro. This comparison of two members of the receptor tyrosine phosphatase family suggests that homophilic specificity is driven by a combination of shape complementarity and specific but limited sequence differences.
Collapse
Affiliation(s)
- Iain M Hay
- Cambridge Institute for Medical Research, University of Cambridge, Cambridge, United Kingdom; Signalling Programme, Babraham Institute, Babraham Research Campus, Cambridge, United Kingdom
| | - Maria Shamin
- Cambridge Institute for Medical Research, University of Cambridge, Cambridge, United Kingdom
| | - Eve R Caroe
- Cambridge Institute for Medical Research, University of Cambridge, Cambridge, United Kingdom
| | - Ahmed S A Mohammed
- European Molecular Biology Laboratory (EMBL) Hamburg Site, Hamburg, Germany
| | - Dmitri I Svergun
- European Molecular Biology Laboratory (EMBL) Hamburg Site, Hamburg, Germany
| | - Cy M Jeffries
- European Molecular Biology Laboratory (EMBL) Hamburg Site, Hamburg, Germany
| | - Stephen C Graham
- Department of Pathology, University of Cambridge, Cambridge, United Kingdom
| | - Hayley J Sharpe
- Signalling Programme, Babraham Institute, Babraham Research Campus, Cambridge, United Kingdom.
| | - Janet E Deane
- Cambridge Institute for Medical Research, University of Cambridge, Cambridge, United Kingdom.
| |
Collapse
|
10
|
Freschlin CR, Fahlberg SA, Romero PA. Machine learning to navigate fitness landscapes for protein engineering. Curr Opin Biotechnol 2022; 75:102713. [PMID: 35413604 PMCID: PMC9177649 DOI: 10.1016/j.copbio.2022.102713] [Citation(s) in RCA: 46] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 01/05/2022] [Accepted: 02/28/2022] [Indexed: 11/19/2022]
Abstract
Machine learning (ML) is revolutionizing our ability to understand and predict the complex relationships between protein sequence, structure, and function. Predictive sequence-function models are enabling protein engineers to efficiently search the sequence space for useful proteins with broad applications in biotechnology. In this review, we highlight the recent advances in applying ML to protein engineering. We discuss supervised learning methods that infer the sequence-function mapping from experimental data and new sequence representation strategies for data-efficient modeling. We then describe the various ways in which ML can be incorporated into protein engineering workflows, including purely in silico searches, ML-assisted directed evolution, and generative models that can learn the underlying distribution of the protein function in a sequence space. ML-driven protein engineering will become increasingly powerful with continued advances in high-throughput data generation, data science, and deep learning.
Collapse
Affiliation(s)
- Chase R Freschlin
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Sarah A Fahlberg
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Philip A Romero
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA; Department of Chemical & Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
11
|
Emenecker RJ, Griffith D, Holehouse AS. Metapredict: a fast, accurate, and easy-to-use predictor of consensus disorder and structure. Biophys J 2021; 120:4312-4319. [PMID: 34480923 PMCID: PMC8553642 DOI: 10.1016/j.bpj.2021.08.039] [Citation(s) in RCA: 128] [Impact Index Per Article: 32.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Revised: 08/08/2021] [Accepted: 08/30/2021] [Indexed: 01/02/2023] Open
Abstract
Intrinsically disordered proteins and protein regions make up a substantial fraction of many proteomes in which they play a wide variety of essential roles. A critical first step in understanding the role of disordered protein regions in biological function is to identify those disordered regions correctly. Computational methods for disorder prediction have emerged as a core set of tools to guide experiments, interpret results, and develop hypotheses. Given the multiple different predictors available, consensus scores have emerged as a popular approach to mitigate biases or limitations of any single method. Consensus scores integrate the outcome of multiple independent disorder predictors and provide a per-residue value that reflects the number of tools that predict a residue to be disordered. Although consensus scores help mitigate the inherent problems of using any single disorder predictor, they are computationally expensive to generate. They also necessitate the installation of multiple different software tools, which can be prohibitively difficult. To address this challenge, we developed a deep-learning-based predictor of consensus disorder scores. Our predictor, metapredict, utilizes a bidirectional recurrent neural network trained on the consensus disorder scores from 12 proteomes. By benchmarking metapredict using two orthogonal approaches, we found that metapredict is among the most accurate disorder predictors currently available. Metapredict is also remarkably fast, enabling proteome-scale disorder prediction in minutes. Importantly, metapredict is a fully open source and is distributed as a Python package, a collection of command-line tools, and a web server, maximizing the potential practical utility of the predictor. We believe metapredict offers a convenient, accessible, accurate, and high-performance predictor for single-proteins and proteomes alike.
Collapse
Affiliation(s)
- Ryan J Emenecker
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, Missouri; Center for Science and Engineering Living Systems (CSELS), St. Louis, Missouri; Center for Engineering Mechanobiology, Washington University, St. Louis, Missouri
| | - Daniel Griffith
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, Missouri; Center for Science and Engineering Living Systems (CSELS), St. Louis, Missouri
| | - Alex S Holehouse
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, Missouri; Center for Science and Engineering Living Systems (CSELS), St. Louis, Missouri.
| |
Collapse
|