1
|
Shulgina Y, Trinidad MI, Langeberg CJ, Nisonoff H, Chithrananda S, Skopintsev P, Nissley AJ, Patel J, Boger RS, Shi H, Yoon PH, Doherty EE, Pande T, Iyer AM, Doudna JA, Cate JHD. RNA language models predict mutations that improve RNA function. Nat Commun 2024; 15:10627. [PMID: 39638800 PMCID: PMC11621547 DOI: 10.1038/s41467-024-54812-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2024] [Accepted: 11/20/2024] [Indexed: 12/07/2024] Open
Abstract
Structured RNA lies at the heart of many central biological processes, from gene expression to catalysis. RNA structure prediction is not yet possible due to a lack of high-quality reference data associated with organismal phenotypes that could inform RNA function. We present GARNET (Gtdb Acquired RNa with Environmental Temperatures), a new database for RNA structural and functional analysis anchored to the Genome Taxonomy Database (GTDB). GARNET links RNA sequences to experimental and predicted optimal growth temperatures of GTDB reference organisms. Using GARNET, we develop sequence- and structure-aware RNA generative models, with overlapping triplet tokenization providing optimal encoding for a GPT-like model. Leveraging hyperthermophilic RNAs in GARNET and these RNA generative models, we identify mutations in ribosomal RNA that confer increased thermostability to the Escherichia coli ribosome. The GTDB-derived data and deep learning models presented here provide a foundation for understanding the connections between RNA sequence, structure, and function.
Collapse
Affiliation(s)
- Yekaterina Shulgina
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA, USA
| | - Marena I Trinidad
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Howard Hughes Medical Institute, University of California, Berkeley, CA, USA
| | - Conner J Langeberg
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA, USA
| | - Hunter Nisonoff
- Center for Computational Biology, University of California, Berkeley, CA, USA
| | - Seyone Chithrananda
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
| | - Petr Skopintsev
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA, USA
| | - Amos J Nissley
- Department of Chemistry, University of California, Berkeley, CA, USA
| | - Jaymin Patel
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
| | - Ron S Boger
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Biophysics Graduate Program, University of California, Berkeley, CA, USA
| | - Honglue Shi
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Howard Hughes Medical Institute, University of California, Berkeley, CA, USA
| | - Peter H Yoon
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
| | - Erin E Doherty
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA, USA
| | - Tara Pande
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
| | - Aditya M Iyer
- Department of Physics, University of California, Berkeley, CA, USA
| | - Jennifer A Doudna
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA, USA
- Howard Hughes Medical Institute, University of California, Berkeley, CA, USA
- Department of Chemistry, University of California, Berkeley, CA, USA
- MBIB Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Gladstone Institutes, University of California, San Francisco, CA, USA
| | - Jamie H D Cate
- Innovative Genomics Institute, University of California, Berkeley, CA, USA.
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA.
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA, USA.
- Department of Chemistry, University of California, Berkeley, CA, USA.
- MBIB Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
| |
Collapse
|
2
|
Shulgina Y, Trinidad MI, Langeberg CJ, Nisonoff H, Chithrananda S, Skopintsev P, Nissley AJ, Patel J, Boger RS, Shi H, Yoon PH, Doherty EE, Pande T, Iyer AM, Doudna JA, Cate JHD. RNA language models predict mutations that improve RNA function. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.05.588317. [PMID: 38617247 PMCID: PMC11014562 DOI: 10.1101/2024.04.05.588317] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/16/2024]
Abstract
Structured RNA lies at the heart of many central biological processes, from gene expression to catalysis. While advances in deep learning enable the prediction of accurate protein structural models, RNA structure prediction is not possible at present due to a lack of abundant high-quality reference data1. Furthermore, available sequence data are generally not associated with organismal phenotypes that could inform RNA function2-4. We created GARNET (Gtdb Acquired RNa with Environmental Temperatures), a new database for RNA structural and functional analysis anchored to the Genome Taxonomy Database (GTDB)5. GARNET links RNA sequences derived from GTDB genomes to experimental and predicted optimal growth temperatures of GTDB reference organisms. This enables construction of deep and diverse RNA sequence alignments to be used for machine learning. Using GARNET, we define the minimal requirements for a sequence- and structure-aware RNA generative model. We also develop a GPT-like language model for RNA in which overlapping triplet tokenization provides optimal encoding. Leveraging hyperthermophilic RNAs in GARNET and these RNA generative models, we identified mutations in ribosomal RNA that confer increased thermostability to the Escherichia coli ribosome. The GTDB-derived data and deep learning models presented here provide a foundation for understanding the connections between RNA sequence, structure, and function.
Collapse
Affiliation(s)
- Yekaterina Shulgina
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA, USA
| | - Marena I Trinidad
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Howard Hughes Medical Institute, University of California, Berkeley, CA, USA
| | - Conner J Langeberg
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA, USA
| | - Hunter Nisonoff
- Center for Computational Biology, University of California, Berkeley, CA, United States
| | - Seyone Chithrananda
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
| | - Petr Skopintsev
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA, USA
| | - Amos J Nissley
- Department of Chemistry, University of California, Berkeley, CA, USA
| | - Jaymin Patel
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
| | - Ron S Boger
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Biophysics Graduate Program, University of California, Berkeley, CA, USA
| | - Honglue Shi
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Howard Hughes Medical Institute, University of California, Berkeley, CA, USA
| | - Peter H Yoon
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
- Department of Chemistry, University of California, Berkeley, CA, USA
| | - Erin E Doherty
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA, USA
| | - Tara Pande
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
| | - Aditya M Iyer
- Department of Physics, University of California, Berkeley, CA, USA
| | - Jennifer A Doudna
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA, USA
- Howard Hughes Medical Institute, University of California, Berkeley, CA, USA
- Department of Chemistry, University of California, Berkeley, CA, USA
- MBIB Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Gladstone Institutes, University of California, San Francisco, CA, USA
| | - Jamie H D Cate
- Innovative Genomics Institute, University of California, Berkeley, CA, USA
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
- California Institute for Quantitative Biosciences, University of California, Berkeley, CA, USA
- Department of Chemistry, University of California, Berkeley, CA, USA
- MBIB Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| |
Collapse
|