1
|
Refahi M, Sokhansanj BA, Mell JC, Brown JR, Yoo H, Hearne G, Rosen GL. Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization. Commun Biol 2025; 8:517. [PMID: 40155693 PMCID: PMC11953366 DOI: 10.1038/s42003-025-07902-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 03/07/2025] [Indexed: 04/01/2025] Open
Abstract
Analysis of genomic and metagenomic sequences is inherently more challenging than that of amino acid sequences due to the higher divergence among evolutionarily related nucleotide sequences, variable k-mer and codon usage within and among genomes of diverse species, and poorly understood selective constraints. We introduce Scorpio (Sequence Contrastive Optimization for Representation and Predictive Inference on DNA), a versatile framework designed for nucleotide sequences that employ contrastive learning to improve embeddings. By leveraging pre-trained genomic language models and k-mer frequency embeddings, Scorpio demonstrates competitive performance in diverse applications, including taxonomic and gene classification, antimicrobial resistance (AMR) gene identification, and promoter detection. A key strength of Scorpio is its ability to generalize to novel DNA sequences and taxa, addressing a significant limitation of alignment-based methods. Scorpio has been tested on multiple datasets with DNA sequences of varying lengths (long and short) and shows robust inference capabilities. Additionally, we provide an analysis of the biological information underlying this representation, including correlations between codon adaptation index as a gene expression factor, sequence similarity, and taxonomy, as well as the functional and structural information of genes.
Collapse
Affiliation(s)
| | - Bahrad A Sokhansanj
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Joshua C Mell
- College of Medicine, Drexel University, Philadelphia, PA, USA
| | - James R Brown
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Hyunwoo Yoo
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Gavin Hearne
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Gail L Rosen
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA.
| |
Collapse
|
2
|
von Bülow S, Tesei G, Lindorff-Larsen K. Machine learning methods to study sequence-ensemble-function relationships in disordered proteins. Curr Opin Struct Biol 2025; 92:103028. [PMID: 40081192 DOI: 10.1016/j.sbi.2025.103028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2024] [Revised: 02/13/2025] [Accepted: 02/14/2025] [Indexed: 03/15/2025]
Abstract
Recent years have seen tremendous developments in the use of machine learning models to link amino-acid sequence, structure, and function of folded proteins. These methods are, however, rarely applicable to the wide range of proteins and sequences that comprise intrinsically disordered regions. We here review developments in the study of sequence-ensemble-function relationships of disordered proteins that exploit or are used to train machine learning models. These include methods for generating conformational ensembles and designing new sequences, and for linking sequences to biophysical properties and biological functions. We highlight how these developments are built on a tight integration between experiment, theory and simulations, and account for evolutionary constraints, which operate on sequences of disordered regions differently than on those of folded domains.
Collapse
Affiliation(s)
- Sören von Bülow
- Structural Biology and NMR Laboratory & the Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, 2200, Copenhagen, Denmark
| | - Giulio Tesei
- Structural Biology and NMR Laboratory & the Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, 2200, Copenhagen, Denmark
| | - Kresten Lindorff-Larsen
- Structural Biology and NMR Laboratory & the Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, 2200, Copenhagen, Denmark.
| |
Collapse
|
3
|
Zhang F, Kurgan L. Evaluation of predictions of disordered binding regions in the CAID2 experiment. Comput Struct Biotechnol J 2024; 27:78-88. [PMID: 39811792 PMCID: PMC11732247 DOI: 10.1016/j.csbj.2024.12.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2024] [Revised: 12/12/2024] [Accepted: 12/13/2024] [Indexed: 01/16/2025] Open
Abstract
A large portion of the Intrinsically Disordered Regions (IDRs) in protein sequences interact with proteins, nucleic acids, and other types of ligands. Correspondingly, dozens of sequence-based predictors of binding IDRs were developed. A recently completed second community-based Critical Assessments of protein Intrinsic Disorder prediction (CAID2) evaluated 32 predictors of binding IDRs. However, CAID2 considered a rather narrow scenario by testing on 78 proteins with binding IDRs and not differentiating between different ligands, in spite that virtually all predictors target IDRs that interact with specific types of ligands. In that scenario, several intrinsic disorder predictors predict binding IDRs with accuracy equivalent to the best predictors of binding IDRs since large majority of IDRs in the 78 test proteins are binding. We substantially extended the CAID2's evaluation by using the entire CAID2 dataset of 348 proteins and considering several arguably more practical scenarios. We assessed whether predictors accurately differentiate binding IDRs from other types of IDRs and how they perform when predicting IDRs that interact with different ligand types. We found that intrinsic disorder predictors cannot accurately identify binding IDRs among other disordered regions, majority of the predictors of binding IDRs are ligand type agnostic (i.e., they cross predict binding in IDRs that interact with ligands that they do not cover), and only a handful of predictors of binding IDRs perform relatively well and generate reasonably low amounts of cross predictions. We also suggest a number of future research directions that would move this active field of research forward.
Collapse
Affiliation(s)
- Fuhao Zhang
- College of Information Engineering, Northwest A & F University, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
4
|
Chow CFW, Ghosh S, Hadarovich A, Toth-Petroczy A. SHARK enables sensitive detection of evolutionary homologs and functional analogs in unalignable and disordered sequences. Proc Natl Acad Sci U S A 2024; 121:e2401622121. [PMID: 39383002 PMCID: PMC11494347 DOI: 10.1073/pnas.2401622121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 08/30/2024] [Indexed: 10/11/2024] Open
Abstract
Intrinsically disordered regions (IDRs) are structurally flexible protein segments with regulatory functions in multiple contexts, such as in the assembly of biomolecular condensates. Since IDRs undergo more rapid evolution than ordered regions, identifying homology of such poorly conserved regions remains challenging for state-of-the-art alignment-based methods that rely on position-specific conservation of residues. Thus, systematic functional annotation and evolutionary analysis of IDRs have been limited, despite them comprising ~21% of proteins. To accurately assess homology between unalignable sequences, we developed an alignment-free sequence comparison algorithm, SHARK (Similarity/Homology Assessment by Relating K-mers). We trained SHARK-dive, a machine learning homology classifier, which achieved superior performance to standard alignment-based approaches in assessing evolutionary homology in unalignable sequences. Furthermore, it correctly identified dissimilar but functionally analogous IDRs in IDR-replacement experiments reported in the literature, whereas alignment-based tools were incapable of detecting such functional relationships. SHARK-dive not only predicts functionally similar IDRs at a proteome-wide scale but also identifies cryptic sequence properties and motifs that drive remote homology and analogy, thereby providing interpretable and experimentally verifiable hypotheses of the sequence determinants that underlie such relationships. SHARK-dive acts as an alternative to alignment to facilitate systematic analysis and functional annotation of the unalignable protein universe.
Collapse
Affiliation(s)
- Chi Fung Willis Chow
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden01307, Germany
- Center for Systems Biology Dresden, Dresden01307, Germany
- Cluster of Excellence Physics of Life, Technische Universität Dresden, Dresden01062, Germany
| | - Soumyadeep Ghosh
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden01307, Germany
- Center for Systems Biology Dresden, Dresden01307, Germany
| | - Anna Hadarovich
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden01307, Germany
- Center for Systems Biology Dresden, Dresden01307, Germany
| | - Agnes Toth-Petroczy
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden01307, Germany
- Center for Systems Biology Dresden, Dresden01307, Germany
- Cluster of Excellence Physics of Life, Technische Universität Dresden, Dresden01062, Germany
| |
Collapse
|
5
|
Phillips M, Muthukumar M, Ghosh K. Beyond monopole electrostatics in regulating conformations of intrinsically disordered proteins. PNAS NEXUS 2024; 3:pgae367. [PMID: 39253398 PMCID: PMC11382291 DOI: 10.1093/pnasnexus/pgae367] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Accepted: 08/13/2024] [Indexed: 09/11/2024]
Abstract
Conformations and dynamics of an intrinsically disordered protein (IDP) depend on its composition of charged and uncharged amino acids, and their specific placement in the protein sequence. In general, the charge (positive or negative) on an amino acid residue in the protein is not a fixed quantity. Each of the ionizable groups can exist in an equilibrated distribution of fully ionized state (monopole) and an ion-pair (dipole) state formed between the ionizing group and its counterion from the background electrolyte solution. The dipole formation (counterion condensation) depends on the protein conformation, which in turn depends on the distribution of charges and dipoles on the molecule. Consequently, effective charges of ionizable groups in the IDP backbone may differ from their chemical charges in isolation-a phenomenon termed charge-regulation. Accounting for the inevitable dipolar interactions, that have so far been ignored, and using a self-consistent procedure, we present a theory of charge-regulation as a function of sequence, temperature, and ionic strength. The theory quantitatively agrees with both charge reduction and salt-dependent conformation data of Prothymosin-alpha and makes several testable predictions. We predict charged groups are less ionized in sequences where opposite charges are well mixed compared to sequences where they are strongly segregated. Emergence of dipolar interactions from charge-regulation allows spontaneous coexistence of two phases having different conformations and charge states, sensitively depending on the charge patterning. These findings highlight sequence dependent charge-regulation and its potential exploitation by biological regulators such as phosphorylation and mutations in controlling protein conformation and function.
Collapse
Affiliation(s)
- Michael Phillips
- Department of Physics and Astronomy, University of Denver, Denver, CO 80208, USA
| | - Murugappan Muthukumar
- Department of Polymer Science and Engineering, University of Massachusetts, Amherst, MA 01003, USA
| | - Kingshuk Ghosh
- Department of Physics and Astronomy, University of Denver, Denver, CO 80208, USA
- Molecular and Cellular Biophysics, University of Denver, Denver, CO 80208, USA
| |
Collapse
|
6
|
Halpin JC, Keating AE. PairK: Pairwise k-mer alignment for quantifying protein motif conservation in disordered regions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.23.604860. [PMID: 39091826 PMCID: PMC11291154 DOI: 10.1101/2024.07.23.604860] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 08/04/2024]
Abstract
Protein-protein interactions are often mediated by a modular peptide recognition domain binding to a short linear motif (SLiM) in the disordered region of another protein. The ability to predict domain-SLiM interactions would allow researchers to map protein interaction networks, predict the effects of perturbations to those networks, and develop biologically meaningful hypotheses. Unfortunately, sequence database searches for SLiMs generally yield mostly biologically irrelevant motif matches or false positives. To improve the prediction of novel SLiM interactions, researchers employ filters to discriminate between biologically relevant and improbable motif matches. One promising criterion for identifying biologically relevant SLiMs is the sequence conservation of the motif, exploiting the fact that functional motifs are more likely to be conserved than spurious motif matches. However, the difficulty of aligning disordered regions has significantly hampered the utility of this approach. We present PairK (pairwise k-mer alignment), an MSA-free method to quantify motif conservation in disordered regions. PairK outperforms both standard MSA-based conservation scores and a modern LLM-based conservation score predictor on the task of identifying biologically important motif instances. PairK can quantify conservation over wider phylogenetic distances than MSAs, indicating that SLiMs may be more conserved than is implied by MSA-based metrics. PairK is available as open-source code at https://github.com/jacksonh1/pairk.
Collapse
Affiliation(s)
- Jackson C. Halpin
- MIT Department of Biology, 77 Massachusetts Ave., Cambridge, MA 02139
| | - Amy E. Keating
- MIT Department of Biology, 77 Massachusetts Ave., Cambridge, MA 02139
- MIT Department of Biological Engineering, 77 Massachusetts Ave., Cambridge, MA 02139
- Koch Institute for Integrative Cancer Research, 77 Massachusetts Ave., Cambridge, MA 02139
| |
Collapse
|
7
|
Pastic A, Nosella ML, Kochhar A, Liu ZH, Forman-Kay JD, D'Amours D. Chromosome compaction is triggered by an autonomous DNA-binding module within condensin. Cell Rep 2024; 43:114419. [PMID: 38985672 DOI: 10.1016/j.celrep.2024.114419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 04/16/2024] [Accepted: 06/14/2024] [Indexed: 07/12/2024] Open
Abstract
The compaction of chromatin into mitotic chromosomes is essential for faithful transmission of the genome during cell division. In eukaryotes, chromosome morphogenesis is regulated by the condensin complex, though the exact mechanism used to target condensin to chromatin and initiate condensation is not understood. Here, we reveal that condensin contains an intrinsically disordered region (IDR) that modulates its association with chromatin in early mitosis and exhibits phase separation. We describe DNA-binding motifs within the IDR that, upon deletion, inflict striking defects in chromosome condensation and segregation, ill-timed condensin turnover on chromatin, and cell death. Importantly, we demonstrate that the condensin IDR can impart cell cycle regulatory functions when transferred to other subunits within the complex, indicating its autonomous nature. Collectively, our study unveils the molecular basis for the initiation of chromosome condensation in early mitosis and how this process ultimately promotes genomic stability and faultless cell division.
Collapse
Affiliation(s)
- Alyssa Pastic
- Ottawa Institute of Systems Biology, Department of Cellular and Molecular Medicine, University of Ottawa, Ottawa, ON K1H 8M5, Canada
| | - Michael L Nosella
- Molecular Medicine Program, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada; Department of Biochemistry, University of Toronto, Toronto, ON M5S 1A8, Canada
| | - Annahat Kochhar
- Ottawa Institute of Systems Biology, Department of Cellular and Molecular Medicine, University of Ottawa, Ottawa, ON K1H 8M5, Canada
| | - Zi Hao Liu
- Molecular Medicine Program, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada; Department of Biochemistry, University of Toronto, Toronto, ON M5S 1A8, Canada
| | - Julie D Forman-Kay
- Molecular Medicine Program, The Hospital for Sick Children, Toronto, ON M5G 0A4, Canada; Department of Biochemistry, University of Toronto, Toronto, ON M5S 1A8, Canada
| | - Damien D'Amours
- Ottawa Institute of Systems Biology, Department of Cellular and Molecular Medicine, University of Ottawa, Ottawa, ON K1H 8M5, Canada.
| |
Collapse
|
8
|
Xiao YX, Lee SY, Aguilera-Uribe M, Samson R, Au A, Khanna Y, Liu Z, Cheng R, Aulakh K, Wei J, Farias AG, Reilly T, Birkadze S, Habsid A, Brown KR, Chan K, Mero P, Huang JQ, Billmann M, Rahman M, Myers C, Andrews BJ, Youn JY, Yip CM, Rotin D, Derry WB, Forman-Kay JD, Moses AM, Pritišanac I, Gingras AC, Moffat J. The TSC22D, WNK, and NRBP gene families exhibit functional buffering and evolved with Metazoa for cell volume regulation. Cell Rep 2024; 43:114417. [PMID: 38980795 DOI: 10.1016/j.celrep.2024.114417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2024] [Revised: 05/08/2024] [Accepted: 06/13/2024] [Indexed: 07/11/2024] Open
Abstract
The ability to sense and respond to osmotic fluctuations is critical for the maintenance of cellular integrity. We used gene co-essentiality analysis to identify an unappreciated relationship between TSC22D2, WNK1, and NRBP1 in regulating cell volume homeostasis. All of these genes have paralogs and are functionally buffered for osmo-sensing and cell volume control. Within seconds of hyperosmotic stress, TSC22D, WNK, and NRBP family members physically associate into biomolecular condensates, a process that is dependent on intrinsically disordered regions (IDRs). A close examination of these protein families across metazoans revealed that TSC22D genes evolved alongside a domain in NRBPs that specifically binds to TSC22D proteins, which we have termed NbrT (NRBP binding region with TSC22D), and this co-evolution is accompanied by rapid IDR length expansion in WNK-family kinases. Our study reveals that TSC22D, WNK, and NRBP genes evolved in metazoans to co-regulate rapid cell volume changes in response to osmolarity.
Collapse
Affiliation(s)
- Yu-Xi Xiao
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada; Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Seon Yong Lee
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Magali Aguilera-Uribe
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada; Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Reuben Samson
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Sinai Health, Toronto, ON, Canada
| | - Aaron Au
- Institute for Biomedical Engineering, University of Toronto, Toronto, ON, Canada; Department of Cell and Systems Biology, University of Toronto, Toronto, ON, Canada; Donnelly Centre, University of Toronto, Toronto, ON, Canada
| | - Yukti Khanna
- Otto-Loewi Research Center, Division of Medicinal Chemistry, Medical University of Graz, Neue Stiftingtalstrabe 6, 8010, Graz, Austria
| | - Zetao Liu
- Program in Cell Biology, The Hospital for Sick Children, Toronto, ON, Canada; Department of Biochemistry, University of Toronto, Toronto, ON, Canada
| | - Ran Cheng
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; Program in Developmental and Stem Cell Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Kamaldeep Aulakh
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Jiarun Wei
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Adrian Granda Farias
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada; Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Taylor Reilly
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada; Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Saba Birkadze
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada; Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Andrea Habsid
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Kevin R Brown
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Katherine Chan
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Patricia Mero
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Jie Qi Huang
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; Program in Molecular Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Maximilian Billmann
- Institute of Human Genetics, School of Medicine and University Hospital Bonn, University of Bonn, 53127 Bonn, Germany
| | - Mahfuzur Rahman
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA
| | - Chad Myers
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA
| | - Brenda J Andrews
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; Donnelly Centre, University of Toronto, Toronto, ON, Canada
| | - Ji-Young Youn
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; Program in Molecular Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Christopher M Yip
- Institute for Biomedical Engineering, University of Toronto, Toronto, ON, Canada; Donnelly Centre, University of Toronto, Toronto, ON, Canada
| | - Daniela Rotin
- Program in Cell Biology, The Hospital for Sick Children, Toronto, ON, Canada; Department of Biochemistry, University of Toronto, Toronto, ON, Canada
| | - W Brent Derry
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; Program in Developmental and Stem Cell Biology, The Hospital for Sick Children, Toronto, ON, Canada
| | - Julie D Forman-Kay
- Department of Biochemistry, University of Toronto, Toronto, ON, Canada; Program in Molecular Medicine, The Hospital for Sick Children, Toronto, ON, Canada
| | - Alan M Moses
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, Canada
| | - Iva Pritišanac
- Otto-Loewi Research Center, Division of Medicinal Chemistry, Medical University of Graz, Neue Stiftingtalstrabe 6, 8010, Graz, Austria
| | - Anne-Claude Gingras
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Sinai Health, Toronto, ON, Canada
| | - Jason Moffat
- Program in Genetics and Genome Biology, The Hospital for Sick Children, Toronto, ON, Canada; Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada; Institute for Biomedical Engineering, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
9
|
Sood A, Zhang B. Preserving condensate structure and composition by lowering sequence complexity. Biophys J 2024; 123:1815-1826. [PMID: 38824391 PMCID: PMC11267431 DOI: 10.1016/j.bpj.2024.05.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 04/25/2024] [Accepted: 05/28/2024] [Indexed: 06/03/2024] Open
Abstract
Biomolecular condensates play a vital role in organizing cellular chemistry. They selectively partition biomolecules, preventing unwanted cross talk and buffering against chemical noise. Intrinsically disordered proteins (IDPs) serve as primary components of these condensates due to their flexibility and ability to engage in multivalent interactions, leading to spontaneous aggregation. Theoretical advancements are critical at connecting IDP sequences with condensate emergent properties to establish the so-called molecular grammar. We proposed an extension to the stickers and spacers model, incorporating heterogeneous, nonspecific pairwise interactions between spacers alongside specific interactions among stickers. Our investigation revealed that although spacer interactions contribute to phase separation and co-condensation, their nonspecific nature leads to disorganized condensates. Specific sticker-sticker interactions drive the formation of condensates with well-defined networked structures and molecular composition. We discussed how evolutionary pressures might emerge to affect these interactions, leading to the prevalence of low-complexity domains in IDP sequences. These domains suppress spurious interactions and facilitate the formation of biologically meaningful condensates.
Collapse
Affiliation(s)
- Amogh Sood
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts
| | - Bin Zhang
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts.
| |
Collapse
|
10
|
Ginell GM, Emenecker RJ, Lotthammer JM, Usher ET, Holehouse AS. Direct prediction of intermolecular interactions driven by disordered regions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.03.597104. [PMID: 38895487 PMCID: PMC11185574 DOI: 10.1101/2024.06.03.597104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
Intrinsically disordered regions (IDRs) are critical for a wide variety of cellular functions, many of which involve interactions with partner proteins. Molecular recognition is typically considered through the lens of sequence-specific binding events. However, a growing body of work has shown that IDRs often interact with partners in a manner that does not depend on the precise order of the amino acid order, instead driven by complementary chemical interactions leading to disordered bound-state complexes. Despite this emerging paradigm, we lack tools to describe, quantify, predict, and interpret these types of structurally heterogeneous interactions from the underlying amino acid sequences. Here, we repurpose the chemical physics developed originally for molecular simulations to develop an approach for predicting intermolecular interactions between IDRs and partner proteins. Our approach enables the direct prediction of phase diagrams, the identification of chemically-specific interaction hotspots on IDRs, and a route to develop and test mechanistic hypotheses regarding IDR function in the context of molecular recognition. We use our approach to examine a range of systems and questions to highlight its versatility and applicability.
Collapse
Affiliation(s)
- Garrett M. Ginell
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO
- Center for Biomolecular Condensates (CBC), Washington University in St. Louis, St. Louis, MO
| | - Ryan. J Emenecker
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO
- Center for Biomolecular Condensates (CBC), Washington University in St. Louis, St. Louis, MO
| | - Jeffrey M. Lotthammer
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO
- Center for Biomolecular Condensates (CBC), Washington University in St. Louis, St. Louis, MO
| | - Emery T. Usher
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO
- Center for Biomolecular Condensates (CBC), Washington University in St. Louis, St. Louis, MO
| | - Alex S. Holehouse
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO
- Center for Biomolecular Condensates (CBC), Washington University in St. Louis, St. Louis, MO
| |
Collapse
|
11
|
Alston JJ, Soranno A, Holehouse AS. Conserved molecular recognition by an intrinsically disordered region in the absence of sequence conservation. RESEARCH SQUARE 2024:rs.3.rs-4477977. [PMID: 38883712 PMCID: PMC11177979 DOI: 10.21203/rs.3.rs-4477977/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2024]
Abstract
Intrinsically disordered regions (IDRs) are critical for cellular function yet often appear to lack sequence conservation when assessed by multiple sequence alignments. This raises the question of if and how function can be encoded and preserved in these regions despite massive sequence variation. To address this question, we have applied coarse-grained molecular dynamics simulations to investigate non-specific RNA binding of coronavirus nucleocapsid proteins. Coronavirus nucleocapsid proteins consist of multiple interspersed disordered and folded domains that bind RNA. Here, we focus on the first two domains of coronavirus nucleocapsid proteins: the disordered N-terminal domain (NTD) and the folded RNA binding domain (RBD). While the NTD is highly variable across evolution, the RBD is structurally conserved. This combination makes the NTD-RBD a convenient model system for exploring the interplay between an IDR adjacent to a folded domain and how changes in IDR sequence can influence molecular recognition of a partner. Our results reveal a surprising degree of sequence-specificity encoded by both the composition and the precise order of the amino acids in the NTD. The presence of an NTD can - depending on the sequence - either suppress or enhance RNA binding. Despite this sensitivity, large-scale variation in NTD sequences is possible while certain sequence features are retained. Consequently, a conformationally-conserved dynamic and disordered RNA:protein complex is found across nucleocapsid protein orthologs despite large-scale changes in both NTD sequence and RBD surface chemistry. Taken together, these insights shed light on the ability of disordered regions to preserve functional characteristics despite their sequence variability.
Collapse
Affiliation(s)
- Jhullian J. Alston
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO 63110, USA
- Center for Biomolecular Condensates, Washington University in St. Louis, St. Louis, MO, USA
- Present Address, Program In Cellular and Molecular Medicine (PCMM), Boston Children’s Hospital, Boston, MA, USA
| | - Andrea Soranno
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO 63110, USA
- Center for Biomolecular Condensates, Washington University in St. Louis, St. Louis, MO, USA
| | - Alex S. Holehouse
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO 63110, USA
- Center for Biomolecular Condensates, Washington University in St. Louis, St. Louis, MO, USA
| |
Collapse
|
12
|
Singleton MD, Eisen MB. Evolutionary analyses of intrinsically disordered regions reveal widespread signals of conservation. PLoS Comput Biol 2024; 20:e1012028. [PMID: 38662765 PMCID: PMC11075841 DOI: 10.1371/journal.pcbi.1012028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 05/07/2024] [Accepted: 03/28/2024] [Indexed: 05/08/2024] Open
Abstract
Intrinsically disordered regions (IDRs) are segments of proteins without stable three-dimensional structures. As this flexibility allows them to interact with diverse binding partners, IDRs play key roles in cell signaling and gene expression. Despite the prevalence and importance of IDRs in eukaryotic proteomes and various biological processes, associating them with specific molecular functions remains a significant challenge due to their high rates of sequence evolution. However, by comparing the observed values of various IDR-associated properties against those generated under a simulated model of evolution, a recent study found most IDRs across the entire yeast proteome contain conserved features. Furthermore, it showed clusters of IDRs with common "evolutionary signatures," i.e. patterns of conserved features, were associated with specific biological functions. To determine if similar patterns of conservation are found in the IDRs of other systems, in this work we applied a series of phylogenetic models to over 7,500 orthologous IDRs identified in the Drosophila genome to dissect the forces driving their evolution. By comparing models of constrained and unconstrained continuous trait evolution using the Brownian motion and Ornstein-Uhlenbeck models, respectively, we identified signals of widespread constraint, indicating conservation of distributed features is mechanism of IDR evolution common to multiple biological systems. In contrast to the previous study in yeast, however, we observed limited evidence of IDR clusters with specific biological functions, which suggests a more complex relationship between evolutionary constraints and function in the IDRs of multicellular organisms.
Collapse
Affiliation(s)
- Marc D. Singleton
- Howard Hughes Medical Institute, UC Berkeley, Berkeley, California, United States of America
| | - Michael B. Eisen
- Howard Hughes Medical Institute, UC Berkeley, Berkeley, California, United States of America
- Department of Molecular and Cell Biology, UC Berkeley, Berkeley, California, United States of America
| |
Collapse
|
13
|
Duncan AG, Mitchell JA, Moses AM. Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation. Bioinformatics 2024; 40:btae190. [PMID: 38588559 PMCID: PMC11042905 DOI: 10.1093/bioinformatics/btae190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 01/12/2024] [Accepted: 04/05/2024] [Indexed: 04/10/2024] Open
Abstract
MOTIVATION Supervised deep learning is used to model the complex relationship between genomic sequence and regulatory function. Understanding how these models make predictions can provide biological insight into regulatory functions. Given the complexity of the sequence to regulatory function mapping (the cis-regulatory code), it has been suggested that the genome contains insufficient sequence variation to train models with suitable complexity. Data augmentation is a widely used approach to increase the data variation available for model training, however current data augmentation methods for genomic sequence data are limited. RESULTS Inspired by the success of comparative genomics, we show that augmenting genomic sequences with evolutionarily related sequences from other species, which we term phylogenetic augmentation, improves the performance of deep learning models trained on regulatory genomic sequences to predict high-throughput functional assay measurements. Additionally, we show that phylogenetic augmentation can rescue model performance when the training set is down-sampled and permits deep learning on a real-world small dataset, demonstrating that this approach improves data efficiency. Overall, this data augmentation method represents a solution for improving model performance that is applicable to many supervised deep-learning problems in genomics. AVAILABILITY AND IMPLEMENTATION The open-source GitHub repository agduncan94/phylogenetic_augmentation_paper includes the code for rerunning the analyses here and recreating the figures.
Collapse
Affiliation(s)
- Andrew G Duncan
- Cell & Systems Biology, University of Toronto, Toronto, ON M5S 3G5, Canada
| | | | - Alan M Moses
- Cell & Systems Biology, University of Toronto, Toronto, ON M5S 3G5, Canada
| |
Collapse
|
14
|
Yu Y, Muthukumar S, Koo PK. EvoAug-TF: extending evolution-inspired data augmentations for genomic deep learning to TensorFlow. Bioinformatics 2024; 40:btae092. [PMID: 38366935 PMCID: PMC10918628 DOI: 10.1093/bioinformatics/btae092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2023] [Revised: 02/01/2024] [Accepted: 02/14/2024] [Indexed: 02/19/2024] Open
Abstract
SUMMARY Deep neural networks (DNNs) have been widely applied to predict the molecular functions of the non-coding genome. DNNs are data hungry and thus require many training examples to fit data well. However, functional genomics experiments typically generate limited amounts of data, constrained by the activity levels of the molecular function under study inside the cell. Recently, EvoAug was introduced to train a genomic DNN with evolution-inspired augmentations. EvoAug-trained DNNs have demonstrated improved generalization and interpretability with attribution analysis. However, EvoAug only supports PyTorch-based models, which limits its applications to a broad class of genomic DNNs based in TensorFlow. Here, we extend EvoAug's functionality to TensorFlow in a new package, we call EvoAug-TF. Through a systematic benchmark, we find that EvoAug-TF yields comparable performance with the original EvoAug package. AVAILABILITY AND IMPLEMENTATION EvoAug-TF is freely available for users and is distributed under an open-source MIT license. Researchers can access the open-source code on GitHub (https://github.com/p-koo/evoaug-tf). The pre-compiled package is provided via PyPI (https://pypi.org/project/evoaug-tf) with in-depth documentation on ReadTheDocs (https://evoaug-tf.readthedocs.io). The scripts for reproducing the results are available at (https://github.com/p-koo/evoaug-tf_analysis).
Collapse
Affiliation(s)
- Yiyang Yu
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, United States
| | | | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, United States
| |
Collapse
|
15
|
Tesei G, Trolle AI, Jonsson N, Betz J, Knudsen FE, Pesce F, Johansson KE, Lindorff-Larsen K. Conformational ensembles of the human intrinsically disordered proteome. Nature 2024; 626:897-904. [PMID: 38297118 DOI: 10.1038/s41586-023-07004-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Accepted: 12/19/2023] [Indexed: 02/02/2024]
Abstract
Intrinsically disordered proteins and regions (collectively, IDRs) are pervasive across proteomes in all kingdoms of life, help to shape biological functions and are involved in numerous diseases. IDRs populate a diverse set of transiently formed structures and defy conventional sequence-structure-function relationships1. Developments in protein science have made it possible to predict the three-dimensional structures of folded proteins at the proteome scale2. By contrast, there is a lack of knowledge about the conformational properties of IDRs, partly because the sequences of disordered proteins are poorly conserved and also because only a few of these proteins have been characterized experimentally. The inability to predict structural properties of IDRs across the proteome has limited our understanding of the functional roles of IDRs and how evolution shapes them. As a supplement to previous structural studies of individual IDRs3, we developed an efficient molecular model to generate conformational ensembles of IDRs and thereby to predict their conformational properties from sequences4,5. Here we use this model to simulate nearly all of the IDRs in the human proteome. Examining conformational ensembles of 28,058 IDRs, we show how chain compaction is correlated with cellular function and localization. We provide insights into how sequence features relate to chain compaction and, using a machine-learning model trained on our simulation data, show the conservation of conformational properties across orthologues. Our results recapitulate observations from previous studies of individual protein systems and exemplify how to link-at the proteome scale-conformational ensembles with cellular function and localization, amino acid sequence, evolutionary conservation and disease variants. Our freely available database of conformational properties will encourage further experimental investigation and enable the generation of hypotheses about the biological roles and evolution of IDRs.
Collapse
Affiliation(s)
- Giulio Tesei
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| | - Anna Ida Trolle
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Nicolas Jonsson
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Johannes Betz
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Frederik E Knudsen
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Francesco Pesce
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Kristoffer E Johansson
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Kresten Lindorff-Larsen
- Structural Biology and NMR Laboratory, Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
16
|
Yu Y, Muthukumar S, Koo PK. EvoAug-TF: Extending evolution-inspired data augmentations for genomic deep learning to TensorFlow. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.17.575961. [PMID: 38293144 PMCID: PMC10827165 DOI: 10.1101/2024.01.17.575961] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2024]
Abstract
Deep neural networks (DNNs) have been widely applied to predict the molecular functions of regulatory regions in the non-coding genome. DNNs are data hungry and thus require many training examples to fit data well. However, functional genomics experiments typically generate limited amounts of data, constrained by the activity levels of the molecular function under study inside the cell. Recently, EvoAug was introduced to train a genomic DNN with evolution-inspired augmentations. EvoAug-trained DNNs have demonstrated improved generalization and interpretability with attribution analysis. However, EvoAug only supports PyTorch-based models, which limits its applications to a broad class of genomic DNNs based in TensorFlow. Here, we extend EvoAug's functionality to TensorFlow in a new package we call EvoAug-TF. Through a systematic benchmark, we find that EvoAug-TF yields comparable performance with the original EvoAug package. Availability EvoAug-TF is freely available for users and is distributed under an open-source MIT license. Researchers can access the open-source code on GitHub (https://github.com/p-koo/evoaug-tf). The pre-compiled package is provided via PyPI (https://pypi.org/project/evoaug-tf) with in-depth documentation on ReadTheDocs (https://evoaug-tf.readthedocs.io). The scripts for reproducing the results are available at (https://github.com/p-koo/evoaug-tf_analysis).
Collapse
Affiliation(s)
- Yiyang Yu
- Columbia University, New york, NY, USA
| | | | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| |
Collapse
|
17
|
Sood A, Zhang B. Preserving condensate structure and composition by lowering sequence complexity. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.29.569249. [PMID: 38076908 PMCID: PMC10705451 DOI: 10.1101/2023.11.29.569249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/22/2023]
Abstract
Biological condensates play a vital role in organizing cellular chemistry. They selectively partition biomolecules, preventing unwanted cross-talk and buffering against chemical noise. Intrinsically disordered proteins (IDPs) serve as primary components of these condensates due to their flexibility and ability to engage in multivalent, non-specific interactions, leading to spontaneous aggregation. Theoretical advancements are critical at connecting IDP sequences with condensate emergent properties to establish the so-called molecular grammar. We proposed an extension to the stickers and spacers model, incorporating non-specific pairwise interactions between spacers alongside specific interactions among stickers. Our investigation revealed that while spacer interactions contribute to phase separation and co-condensation, their non-specific nature leads to disorganized condensates. Specific sticker-sticker interactions drive the formation of condensates with well-defined structures and molecular composition. We discussed how evolutionary pressures might emerge to affect these interactions, leading to the prevalence of low complexity domains in IDP sequences. These domains suppress spurious interactions and facilitate the formation of biologically meaningful condensates.
Collapse
Affiliation(s)
- Amogh Sood
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Bin Zhang
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA, USA
| |
Collapse
|
18
|
Alderson TR, Pritišanac I, Kolarić Đ, Moses AM, Forman-Kay JD. Systematic identification of conditionally folded intrinsically disordered regions by AlphaFold2. Proc Natl Acad Sci U S A 2023; 120:e2304302120. [PMID: 37878721 PMCID: PMC10622901 DOI: 10.1073/pnas.2304302120] [Citation(s) in RCA: 65] [Impact Index Per Article: 32.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Accepted: 08/30/2023] [Indexed: 10/27/2023] Open
Abstract
The AlphaFold Protein Structure Database contains predicted structures for millions of proteins. For the majority of human proteins that contain intrinsically disordered regions (IDRs), which do not adopt a stable structure, it is generally assumed that these regions have low AlphaFold2 confidence scores that reflect low-confidence structural predictions. Here, we show that AlphaFold2 assigns confident structures to nearly 15% of human IDRs. By comparison to experimental NMR data for a subset of IDRs that are known to conditionally fold (i.e., upon binding or under other specific conditions), we find that AlphaFold2 often predicts the structure of the conditionally folded state. Based on databases of IDRs that are known to conditionally fold, we estimate that AlphaFold2 can identify conditionally folding IDRs at a precision as high as 88% at a 10% false positive rate, which is remarkable considering that conditionally folded IDR structures were minimally represented in its training data. We find that human disease mutations are nearly fivefold enriched in conditionally folded IDRs over IDRs in general and that up to 80% of IDRs in prokaryotes are predicted to conditionally fold, compared to less than 20% of eukaryotic IDRs. These results indicate that a large majority of IDRs in the proteomes of human and other eukaryotes function in the absence of conditional folding, but the regions that do acquire folds are more sensitive to mutations. We emphasize that the AlphaFold2 predictions do not reveal functionally relevant structural plasticity within IDRs and cannot offer realistic ensemble representations of conditionally folded IDRs.
Collapse
Affiliation(s)
- T. Reid Alderson
- Department of Biochemistry, University of Toronto, Toronto, ONM5S 1A8, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ONM5S 1A8, Canada
| | - Iva Pritišanac
- Department of Cell and Systems Biology, University of Toronto, Toronto, ONM5S 35G, Canada
- Molecular Medicine Program, The Hospital for Sick Children, Toronto, ONM5G 0A4, Canada
- Department of Molecular Biology and Biochemistry, Gottfried Schatz Research Center for Cell Signaling, Metabolism and Aging, Medical University of Graz, Graz8010, Austria
| | - Đesika Kolarić
- Department of Molecular Biology and Biochemistry, Gottfried Schatz Research Center for Cell Signaling, Metabolism and Aging, Medical University of Graz, Graz8010, Austria
| | - Alan M. Moses
- Department of Cell and Systems Biology, University of Toronto, Toronto, ONM5S 35G, Canada
| | - Julie D. Forman-Kay
- Department of Biochemistry, University of Toronto, Toronto, ONM5S 1A8, Canada
- Molecular Medicine Program, The Hospital for Sick Children, Toronto, ONM5G 0A4, Canada
| |
Collapse
|
19
|
Ho W, Huang H, Huang J. IFF: Identifying key residues in intrinsically disordered regions of proteins using machine learning. Protein Sci 2023; 32:e4739. [PMID: 37498545 PMCID: PMC10443345 DOI: 10.1002/pro.4739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2023] [Revised: 06/21/2023] [Accepted: 07/25/2023] [Indexed: 07/28/2023]
Abstract
Conserved residues in protein homolog sequence alignments are structurally or functionally important. For intrinsically disordered proteins or proteins with intrinsically disordered regions (IDRs), however, alignment often fails because they lack a steric structure to constrain evolution. Although sequences vary, the physicochemical features of IDRs may be preserved in maintaining function. Therefore, a method to retrieve common IDR features may help identify functionally important residues. We applied unsupervised contrastive learning to train a model with self-attention neuronal networks on human IDR orthologs. Parameters in the model were trained to match sequences in ortholog pairs but not in other IDRs. The trained model successfully identifies previously reported critical residues from experimental studies, especially those with an overall pattern (e.g., multiple aromatic residues or charged blocks) rather than short motifs. This predictive model can be used to identify potentially important residues in other proteins, improving our understanding of their functions. The trained model can be run directly from the Jupyter Notebook in the GitHub repository using Binder (mybinder.org). The only required input is the primary sequence. The training scripts are available on GitHub (https://github.com/allmwh/IFF). The training datasets have been deposited in an Open Science Framework repository (https://osf.io/jk29b).
Collapse
Affiliation(s)
- Wen‐Lin Ho
- Institute of Biochemistry and Molecular Biology, National Yang Ming Chiao Tung UniversityTaipeiTaiwan
| | - Hsuan‐Cheng Huang
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung UniversityTaipeiTaiwan
| | - Jie‐rong Huang
- Institute of Biochemistry and Molecular Biology, National Yang Ming Chiao Tung UniversityTaipeiTaiwan
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung UniversityTaipeiTaiwan
- Department of Life Sciences and Institute of Genome SciencesNational Yang Ming Chiao Tung UniversityTaipeiTaiwan
| |
Collapse
|
20
|
Alston JJ, Soranno A, Holehouse AS. Conserved molecular recognition by an intrinsically disordered region in the absence of sequence conservation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.06.552128. [PMID: 37609146 PMCID: PMC10441348 DOI: 10.1101/2023.08.06.552128] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/24/2023]
Abstract
Intrinsically disordered regions (IDRs) are critical for cellular function, yet often appear to lack sequence conservation when assessed by multiple sequence alignments. This raises the question of if and how function can be encoded and preserved in these regions despite massive sequence variation. To address this question, we have applied coarse-grained molecular dynamics simulations to investigate non-specific RNA binding of coronavirus nucleocapsid proteins. Coronavirus nucleocapsid proteins consist of multiple interspersed disordered and folded domains that bind RNA. We focussed here on the first two domains of coronavirus nucleocapsid proteins, the disordered N-terminal domain (NTD) followed by the folded RNA binding domain (RBD). While the NTD is highly variable across evolution, the RBD is structurally conserved. This combination makes the NTD-RBD a convenient model system to explore the interplay between an IDR adjacent to a folded domain, and how changes in IDR sequence can influence molecular recognition of a partner. Our results reveal a surprising degree of sequence-specificity encoded by both the composition and the precise order of the amino acids in the NTD. The presence of an NTD can - depending on the sequence - either suppress or enhance RNA binding. Despite this sensitivity, large-scale variation in NTD sequences is possible while certain sequence features are retained. Consequently, a conformationally-conserved fuzzy RNA:protein complex is found across nucleocapsid protein orthologs, despite large-scale changes in both NTD sequence and RBD surface chemistry. Taken together, these insights shed light on the ability of disordered regions to preserve functional characteristics despite their sequence variability.
Collapse
|
21
|
Intrinsically Disordered Proteins: An Overview. Int J Mol Sci 2022; 23:ijms232214050. [PMID: 36430530 PMCID: PMC9693201 DOI: 10.3390/ijms232214050] [Citation(s) in RCA: 51] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2022] [Revised: 11/07/2022] [Accepted: 11/08/2022] [Indexed: 11/16/2022] Open
Abstract
Many proteins and protein segments cannot attain a single stable three-dimensional structure under physiological conditions; instead, they adopt multiple interconverting conformational states. Such intrinsically disordered proteins or protein segments are highly abundant across proteomes, and are involved in various effector functions. This review focuses on different aspects of disordered proteins and disordered protein regions, which form the basis of the so-called "Disorder-function paradigm" of proteins. Additionally, various experimental approaches and computational tools used for characterizing disordered regions in proteins are discussed. Finally, the role of disordered proteins in diseases and their utility as potential drug targets are explored.
Collapse
|