1
|
Zhang Y, Lang M, Jiang J, Gao Z, Xu F, Litfin T, Chen K, Singh J, Huang X, Song G, Tian Y, Zhan J, Chen J, Zhou Y. Multiple sequence alignment-based RNA language model and its application to structural inference. Nucleic Acids Res 2024; 52:e3. [PMID: 37941140 PMCID: PMC10783488 DOI: 10.1093/nar/gkad1031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Accepted: 10/21/2023] [Indexed: 11/10/2023] Open
Abstract
Compared with proteins, DNA and RNA are more difficult languages to interpret because four-letter coded DNA/RNA sequences have less information content than 20-letter coded protein sequences. While BERT (Bidirectional Encoder Representations from Transformers)-like language models have been developed for RNA, they are ineffective at capturing the evolutionary information from homologous sequences because unlike proteins, RNA sequences are less conserved. Here, we have developed an unsupervised multiple sequence alignment-based RNA language model (RNA-MSM) by utilizing homologous sequences from an automatic pipeline, RNAcmap, as it can provide significantly more homologous sequences than manually annotated Rfam. We demonstrate that the resulting unsupervised, two-dimensional attention maps and one-dimensional embeddings from RNA-MSM contain structural information. In fact, they can be directly mapped with high accuracy to 2D base pairing probabilities and 1D solvent accessibilities, respectively. Further fine-tuning led to significantly improved performance on these two downstream tasks compared with existing state-of-the-art techniques including SPOT-RNA2 and RNAsnap2. By comparison, RNA-FM, a BERT-based RNA language model, performs worse than one-hot encoding with its embedding in base pair and solvent-accessible surface area prediction. We anticipate that the pre-trained RNA-MSM model can be fine-tuned on many other tasks related to RNA structure and function.
Collapse
Affiliation(s)
- Yikun Zhang
- School of Electronic and Computer Engineering, Peking University, Shenzhen 518055, China
- AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzen 518055, China
| | - Mei Lang
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China
| | - Jiuhong Jiang
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China
| | - Zhiqiang Gao
- Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
- Peng Cheng Laboratory, Shenzhen 518066, China
| | - Fan Xu
- Peng Cheng Laboratory, Shenzhen 518066, China
| | - Thomas Litfin
- Institute for Glycomics, Griffith University, Parklands Dr, Southport, QLD 4215, Australia
| | - Ke Chen
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China
| | - Jaswinder Singh
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China
| | | | - Guoli Song
- Peng Cheng Laboratory, Shenzhen 518066, China
| | | | - Jian Zhan
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China
| | - Jie Chen
- School of Electronic and Computer Engineering, Peking University, Shenzhen 518055, China
- Peng Cheng Laboratory, Shenzhen 518066, China
| | - Yaoqi Zhou
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518107, China
- Institute for Glycomics, Griffith University, Parklands Dr, Southport, QLD 4215, Australia
| |
Collapse
|
2
|
Zhou Y, Litfin T, Zhan J. 3 = 1 + 2: how the divide conquered de novo protein structure prediction and what is next? Natl Sci Rev 2023; 10:nwad259. [PMID: 38033736 PMCID: PMC10684263 DOI: 10.1093/nsr/nwad259] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 09/18/2023] [Indexed: 12/02/2023] Open
Affiliation(s)
- Yaoqi Zhou
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, China
- Institute for Glycomics, Griffith University, Australia
| | - Thomas Litfin
- Institute for Glycomics, Griffith University, Australia
| | - Jian Zhan
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, China
| |
Collapse
|
3
|
Singh J, Paliwal K, Litfin T, Singh J, Zhou Y. Predicting RNA distance-based contact maps by integrated deep learning on physics-inferred secondary structure and evolutionary-derived mutational coupling. Bioinformatics 2022; 38:3900-3910. [PMID: 35751593 PMCID: PMC9364379 DOI: 10.1093/bioinformatics/btac421] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 04/30/2022] [Accepted: 06/28/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Recently, AlphaFold2 achieved high experimental accuracy for the majority of proteins in Critical Assessment of Structure Prediction (CASP 14). This raises the hope that one day, we may achieve the same feat for RNA structure prediction for those structured RNAs, which is as fundamentally and practically important similar to protein structure prediction. One major factor in the recent advancement of protein structure prediction is the highly accurate prediction of distance-based contact maps of proteins. RESULTS Here, we showed that by integrated deep learning with physics-inferred secondary structures, co-evolutionary information and multiple sequence-alignment sampling, we can achieve RNA contact-map prediction at a level of accuracy similar to that in protein contact-map prediction. More importantly, highly accurate prediction for top L long-range contacts can be assured for those RNAs with a high effective number of homologous sequences (Neff > 50). The initial use of the predicted contact map as distance-based restraints confirmed its usefulness in 3D structure prediction. AVAILABILITY AND IMPLEMENTATION SPOT-RNA-2D is available as a web server at https://sparks-lab.org/server/spot-rna-2d/ and as a standalone program at https://github.com/jaswindersingh2/SPOT-RNA-2D. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Thomas Litfin
- Institute for Glycomics, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| | - Jaspreet Singh
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia
| | - Yaoqi Zhou
- To whom correspondence should be addressed. or or
| |
Collapse
|
4
|
Solayman M, Litfin T, Singh J, Paliwal K, Zhou Y, Zhan J. Probing RNA structures and functions by solvent accessibility: an overview from experimental and computational perspectives. Brief Bioinform 2022; 23:6554125. [PMID: 35348613 PMCID: PMC9116373 DOI: 10.1093/bib/bbac112] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Revised: 03/03/2022] [Accepted: 03/04/2022] [Indexed: 12/30/2022] Open
Abstract
Characterizing RNA structures and functions have mostly been focused on 2D, secondary and 3D, tertiary structures. Recent advances in experimental and computational techniques for probing or predicting RNA solvent accessibility make this 1D representation of tertiary structures an increasingly attractive feature to explore. Here, we provide a survey of these recent developments, which indicate the emergence of solvent accessibility as a simple 1D property, adding to secondary and tertiary structures for investigating complex structure–function relations of RNAs.
Collapse
Affiliation(s)
- Md Solayman
- Institute for Glycomics, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| | - Thomas Litfin
- Institute for Glycomics, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| | - Jaswinder Singh
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia
| | - Yaoqi Zhou
- Institute for Glycomics, Griffith University, Parklands Dr. Southport, QLD 4222, Australia.,Institute for Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China.,Peking University Shenzhen Graduate School, Shenzhen 518055, China
| | - Jian Zhan
- Institute for Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
| |
Collapse
|
5
|
Zhou K, Litfin T, Solayman M, Zhao H, Zhou Y, Zhan J. High-throughput split-protein profiling by combining transposon mutagenesis and regulated protein-protein interactions with deep sequencing. Int J Biol Macromol 2022; 203:543-552. [PMID: 35120933 DOI: 10.1016/j.ijbiomac.2022.01.173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 01/27/2022] [Accepted: 01/28/2022] [Indexed: 11/05/2022]
Abstract
Splitting a protein at a position may lead to self- or assisted-complementary fragments depending on whether two resulting fragments can reconstitute to maintain the native function spontaneously or require assistance from two interacting molecules. Assisted complementary fragments with high contrast are an important tool for probing biological interactions. However, only a small number of assisted-complementary split-variants have been identified due to manual, labour-intensive optimization of a candidate gene. Here, we introduce a technique for high-throughput split-protein profiling (HiTS) that allows fast identification of self- and assisted complementary positions by transposon mutagenesis, a rapamycin-regulated FRB-FKBP protein interaction pair, and deep sequencing. We test this technique by profiling three antibiotic-resistant genes (fosfomycin-resistant gene, fosA3, erythromycin-resistant gene, ermB, and chloramphenicol-resistant gene, catI). Self- and assisted complementary fragments discovered by the high-throughput technique were subsequently confirmed by low-throughput testing of individual split positions. Thus, the HiTS technique provides a quicker alternative for discovering the proteins with suitable self- and assisted-complementary split positions when combining with a readout such as fluorescence, bioluminescence, cell survival, gene transcription or genome editing.
Collapse
Affiliation(s)
- Kai Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr Southport, QLD 4222, Australia; Institute for Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
| | - Thomas Litfin
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr Southport, QLD 4222, Australia
| | - Md Solayman
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr Southport, QLD 4222, Australia
| | - Huijun Zhao
- Centre for Clean Environment and Energy, Griffith University, Gold Coast Campus, Queensland 4222, Australia
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr Southport, QLD 4222, Australia; Institute for Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China.
| | - Jian Zhan
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr Southport, QLD 4222, Australia; Institute for Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China.
| |
Collapse
|
6
|
Singh J, Litfin T, Singh J, Paliwal K, Zhou Y. SPOT-Contact-LM: improving single-sequence-based prediction of protein contact map using a transformer language model. Bioinformatics 2022; 38:1888-1894. [PMID: 35104320 PMCID: PMC9113311 DOI: 10.1093/bioinformatics/btac053] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Revised: 11/21/2021] [Accepted: 01/26/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Accurate prediction of protein contact-map is essential for accurate protein structure and function prediction. As a result, many methods have been developed for protein contact map prediction. However, most methods rely on protein-sequence-evolutionary information, which may not exist for many proteins due to lack of naturally occurring homologous sequences. Moreover, generating evolutionary profiles is computationally intensive. Here, we developed a contact-map predictor utilizing the output of a pre-trained language model ESM-1b as an input along with a large training set and an ensemble of residual neural networks. RESULTS We showed that the proposed method makes a significant improvement over a single-sequence-based predictor SSCpred with 15% improvement in the F1-score for the independent CASP14-FM test set. It also outperforms evolutionary-profile-based methods trRosetta and SPOT-Contact with 48.7% and 48.5% respective improvement in the F1-score on the proteins without homologs (Neff = 1) in the independent SPOT-2018 set. The new method provides a much faster and reasonably accurate alternative to evolution-based methods, useful for large-scale prediction. AVAILABILITY AND IMPLEMENTATION Stand-alone-version of SPOT-Contact-LM is available at https://github.com/jas-preet/SPOT-Contact-Single. Direct prediction can also be made at https://sparks-lab.org/server/spot-contact-single. The datasets used in this research can also be downloaded from the GitHub. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Thomas Litfin
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia
| | - Jaswinder Singh
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia
| | | | - Yaoqi Zhou
- To whom correspondence should be addressed. or or
| |
Collapse
|
7
|
Solayman M, Litfin T, Zhou Y, Zhan J. High-throughput mapping of RNA solvent accessibility at the single-nucleotide resolution by RtcB ligation between a fixed 5'-OH-end linker and unique 3'-P-end fragments from hydroxyl radical cleavage. RNA Biol 2022; 19:1179-1189. [PMID: 36369947 PMCID: PMC9662193 DOI: 10.1080/15476286.2022.2145098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Given the challenges for the experimental determination of RNA tertiary structures, probing solvent accessibility has become increasingly important to gain functional insights. Among various chemical probes developed, backbone-cleaving hydroxyl radical is the only one that can provide unbiased detection of all accessible nucleotides. However, the readouts have been based on reverse transcription (RT) stop at the cleaving sites, which are prone to false positives due to PCR amplification bias, early drop-off of reverse transcriptase, and the use of random primers in RT reaction. Here, we introduced a fixed-primer method called RL-Seq by performing RtcB Ligation (RL) between a fixed 5'-OH-end linker and unique 3'-P-end fragments from hydroxyl radical cleavage prior to high-throughput sequencing. The application of this method to E. coli ribosomes confirmed its ability to accurately probe solvent accessibility with high sensitivity (low required sequencing depth) and accuracy (strong correlation to structure-derived values) at the single-nucleotide resolution. Moreover, a near-perfect correlation was found between the experiments with and without using unique molecular identifiers, indicating negligible PCR biases in RL-Seq. Further improvement of RL-Seq and its potential transcriptome-wide applications are discussed.
Collapse
Affiliation(s)
- Md Solayman
- Institute for Glycomics, Griffith University, Parklands Dr, Southport, QLD, Australia
| | - Thomas Litfin
- Institute for Glycomics, Griffith University, Parklands Dr, Southport, QLD, Australia
| | - Yaoqi Zhou
- Institute for Glycomics, Griffith University, Parklands Dr, Southport, QLD, Australia,Institute for Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, China,CONTACT Yaoqi Zhou Institute for Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, 518055, China
| | - Jian Zhan
- Institute for Glycomics, Griffith University, Parklands Dr, Southport, QLD, Australia,Institute for Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, China,Jian Zhan Institute for Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen518055, China
| |
Collapse
|
8
|
Zhang T, Singh J, Litfin T, Zhan J, Paliwal K, Zhou Y. RNAcmap: A Fully Automatic Pipeline for Predicting Contact Maps of RNAs by Evolutionary Coupling Analysis. Bioinformatics 2021; 37:3494-3500. [PMID: 34021744 DOI: 10.1093/bioinformatics/btab391] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2020] [Revised: 03/27/2021] [Accepted: 05/18/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The accuracy of RNA secondary and tertiary structure prediction can be significantly improved by using structural restraints derived from evolutionary coupling or direct coupling analysis. Currently, these coupling analyses relied on manually curated multiple sequence alignments collected in the Rfam database, which contains 3016 families. By comparison, millions of non-coding RNA sequences are known. Here, we established RNAcmap, a fully automatic pipeline that enables evolutionary coupling analysis for any RNA sequences. The homology search was based on the covariance model built by INFERNAL according to two secondary structure predictors: a folding-based algorithm RNAfold and the latest deep-learning method SPOT-RNA. RESULTS We showed that the performance of RNAcmap is less dependent on the specific evolutionary coupling tool but is more dependent on the accuracy of secondary structure predictor with the best performance given by RNAcmap (SPOT-RNA). The performance of RNAcmap (SPOT-RNA) is comparable to that based on Rfam-supplied alignment and consistent for those sequences that are not in Rfam collections. Further improvement can be made with a simple meta predictor RNAcmap (SPOT-RNA/RNAfold) depending on which secondary structure predictor can find more homologous sequences. Reliable base-pairing information generated from RNAcmap, for RNAs with high effective homologous sequences, in particular, will be useful for aiding RNA structure prediction. AVAILABILITY RNAcmap is available as a web server at https://sparks-lab.org/server/rnacmap/ and as a standalone application along with the datasets at https://github.com/sparks-lab-org/RNAcmap_standalone. A platform independent and fully configured docker image of RNAcmap is also provided at https://hub.docker.com/r/jaswindersingh2/rnacmap.
Collapse
Affiliation(s)
- Tongchuan Zhang
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| | - Jaswinder Singh
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia
| | - Thomas Litfin
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| | - Jian Zhan
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr. Southport, QLD 4222, Australia.,Institute for Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
| |
Collapse
|
9
|
Singh J, Litfin T, Paliwal K, Singh J, Hanumanthappa AK, Zhou Y. SPOT-1D-Single: Improving the Single-Sequence-Based Prediction of Protein Secondary Structure, Backbone Angles, Solvent Accessibility and Half-Sphere Exposures using a Large Training Set and Ensembled Deep Learning. Bioinformatics 2021; 37:3464-3472. [PMID: 33983382 DOI: 10.1093/bioinformatics/btab316] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2020] [Revised: 04/06/2021] [Accepted: 04/26/2021] [Indexed: 02/01/2023] Open
Abstract
MOTIVATION Knowing protein secondary and other one-dimensional structural properties are essential for accurate protein structure and function prediction. As a result, many methods have been developed for predicting these one-dimensional structural properties. However, most methods relied on evolutionary information that may not exist for many proteins due to a lack of sequence homologs. Moreover, it is computationally intensive for obtaining evolutionary information as the library of protein sequences continues to expand exponentially. Here we developed a new single-sequence method called SPOT-1D-Single based on a large training dataset of 39120 proteins deposited prior to 2016 and an ensemble of hybrid Long-Short-Term-Memory bidirectional neural network and convolutional neural network. RESULTS We showed that SPOT-1D-Single consistently improves over SPIDER3-Single and ProteinUnet for secondary structure, solvent accessibility, contact number, and backbone angles prediction for all seven independent test sets (TEST2018, SPOT-2016, SPOT-2016-HQ, SPOT-2018, SPOT-2018-HQ, CASP12, and CASP13 free-modeling targets). For example, the predicted three-state secondary structure's accuracy ranges from 72.12-74.28% by SPOT-1D-Single, compared to 69.1-72.6% by SPIDER3-Single and 70.6-73% by ProteinUnet. SPOT-1D-Single also predicts SS3 and SS8 with 6.24% and 6.98% better accuracy than SPOT-1D on SPOT-2018 proteins with no homologs (Neff=1), respectively. The new method's improvement over existing techniques is due to a larger training set combined with ensembled learning. AVAILABILITY Standalone-version of SPOT-1D-Single is available at https://github.com/jas-preet/SPOT-1D-Single. Direct prediction can also be made at https://sparks-lab.org/server/spot-1d-single. The datasets used in this research can also be downloaded from GitHub.
Collapse
Affiliation(s)
- Jaspreet Singh
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia
| | - Thomas Litfin
- School of Information and Communication Technology, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia
| | - Jaswinder Singh
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia
| | - Anil Kumar Hanumanthappa
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia
| | - Yaoqi Zhou
- School of Information and Communication Technology, Griffith University, Parklands Dr. Southport, QLD 4222, Australia.,Institute for Glycomics, Griffith University, Parklands Dr. Southport, QLD 4222, Australia.,Institue for Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
| |
Collapse
|
10
|
Singh J, Paliwal K, Zhang T, Singh J, Litfin T, Zhou Y. Improved RNA Secondary Structure and Tertiary Base-pairing Prediction Using Evolutionary Profile, Mutational Coupling and Two-dimensional Transfer Learning. Bioinformatics 2021; 37:2589-2600. [PMID: 33704363 DOI: 10.1093/bioinformatics/btab165] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Revised: 02/05/2021] [Accepted: 03/08/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION The recent discovery of numerous non-coding RNAs (long non-coding RNAs, in particular) has transformed our perception about the roles of RNAs in living organisms. Our ability to understand them, however, is hampered by our inability to solve their secondary and tertiary structures in high resolution efficiently by existing experimental techniques. Computational prediction of RNA secondary structure, on the other hand, has received much-needed improvement, recently, through deep learning of a large approximate data, followed by transfer learning with gold-standard base-pairing structures from high-resolution 3-D structures. Here, we expand this single-sequence-based learning to the use of evolutionary profiles and mutational coupling. RESULTS The new method allows large improvement not only in canonical base-pairs (RNA secondary structures) but more so in base-pairing associated with tertiary interactions such as pseudoknots, noncanonical and lone base-pairs. In particular, it is highly accurate for those RNAs of more than 1000 homologous sequences by achieving >0.8 F1-score (harmonic mean of sensitivity and precision) for 14/16 RNAs tested. The method can also significantly improve base-pairing prediction by incorporating artificial but functional homologous sequences generated from deep mutational scanning without any modification. The fully automatic method (publicly available as server and standalone software) should provide the scientific community a new powerful tool to capture not only the secondary structure but also tertiary base-pairing information for building three-dimensional models. It also highlights the future of accurately solving the base-pairing structure by using a large number of natural and/or artificial homologous sequences. AVAILABILITY Standalone-version of SPOT-RNA2 is available at https://github.com/jaswindersingh2/SPOT-RNA2. Direct prediction can also be made at https://sparks-lab.org/server/spot-rna2/. The datasets used in this research can also be downloaded from the GITHUB and the webserver mentioned above.
Collapse
Affiliation(s)
- Jaswinder Singh
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia
| | - Tongchuan Zhang
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| | - Jaspreet Singh
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia
| | - Thomas Litfin
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr. Southport, QLD 4222, Australia
| |
Collapse
|
11
|
Hanson J, Litfin T, Paliwal K, Zhou Y. Identifying molecular recognition features in intrinsically disordered regions of proteins by transfer learning. Bioinformatics 2020; 36:1107-1113. [PMID: 31504193 DOI: 10.1093/bioinformatics/btz691] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2019] [Revised: 07/24/2019] [Accepted: 08/31/2019] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Protein intrinsic disorder describes the tendency of sequence residues to not fold into a rigid three-dimensional shape by themselves. However, some of these disordered regions can transition from disorder to order when interacting with another molecule in segments known as molecular recognition features (MoRFs). Previous analysis has shown that these MoRF regions are indirectly encoded within the prediction of residue disorder as low-confidence predictions [i.e. in a semi-disordered state P(D)≈0.5]. Thus, what has been learned for disorder prediction may be transferable to MoRF prediction. Transferring the internal characterization of protein disorder for the prediction of MoRF residues would allow us to take advantage of the large training set available for disorder prediction, enabling the training of larger analytical models than is currently feasible on the small number of currently available annotated MoRF proteins. In this paper, we propose a new method for MoRF prediction by transfer learning from the SPOT-Disorder2 ensemble models built for disorder prediction. RESULTS We confirm that directly training on the MoRF set with a randomly initialized model yields substantially poorer performance on independent test sets than by using the transfer-learning-based method SPOT-MoRF, for both deep and simple networks. Its comparison to current state-of-the-art techniques reveals its superior performance in identifying MoRF binding regions in proteins across two independent testing sets, including our new dataset of >800 protein chains. These test chains share <30% sequence similarity to all training and validation proteins used in SPOT-Disorder2 and SPOT-MoRF, and provide a much-needed large-scale update on the performance of current MoRF predictors. The method is expected to be useful in locating functional disordered regions in proteins. AVAILABILITY AND IMPLEMENTATION SPOT-MoRF and its data are available as a web server and as a standalone program at: http://sparks-lab.org/jack/server/SPOT-MoRF/index.php. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jack Hanson
- Signal Processing Laboratory, Griffith University, Brisbane, QLD 4122, Australia
| | - Thomas Litfin
- Institute for Glycomics, School of Information and Communication Technology, Griffith University, Southport, QLD 4222, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, Griffith University, Brisbane, QLD 4122, Australia
| | - Yaoqi Zhou
- Institute for Glycomics, School of Information and Communication Technology, Griffith University, Southport, QLD 4222, Australia
| |
Collapse
|
12
|
Hanson J, Paliwal KK, Litfin T, Zhou Y. SPOT-Disorder2: Improved Protein Intrinsic Disorder Prediction by Ensembled Deep Learning. Genomics Proteomics Bioinformatics 2020; 17:645-656. [PMID: 32173600 PMCID: PMC7212484 DOI: 10.1016/j.gpb.2019.01.004] [Citation(s) in RCA: 84] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/26/2018] [Revised: 01/18/2019] [Accepted: 02/15/2019] [Indexed: 01/13/2023]
Abstract
Intrinsically disordered or unstructured proteins (or regions in proteins) have been found to be important in a wide range of biological functions and implicated in many diseases. Due to the high cost and low efficiency of experimental determination of intrinsic disorder and the exponential increase of unannotated protein sequences, developing complementary computational prediction methods has been an active area of research for several decades. Here, we employed an ensemble of deep Squeeze-and-Excitation residual inception and long short-term memory (LSTM) networks for predicting protein intrinsic disorder with input from evolutionary information and predicted one-dimensional structural properties. The method, called SPOT-Disorder2, offers substantial and consistent improvement not only over our previous technique based on LSTM networks alone, but also over other state-of-the-art techniques in three independent tests with different ratios of disordered to ordered amino acid residues, and for sequences with either rich or limited evolutionary information. More importantly, semi-disordered regions predicted in SPOT-Disorder2 are more accurate in identifying molecular recognition features (MoRFs) than methods directly designed for MoRFs prediction. SPOT-Disorder2 is available as a web server and as a standalone program at https://sparks-lab.org/server/spot-disorder2/.
Collapse
Affiliation(s)
- Jack Hanson
- Signal Processing Laboratory, Griffith University, Brisbane 4111, Australia
| | - Kuldip K Paliwal
- Signal Processing Laboratory, Griffith University, Brisbane 4111, Australia
| | - Thomas Litfin
- School of Information and Communication Technology, Griffith University, Gold Coast 4222, Australia
| | - Yaoqi Zhou
- School of Information and Communication Technology, Griffith University, Gold Coast 4222, Australia; Institute for Glycomics, Griffith University, Gold Coast 4222, Australia.
| |
Collapse
|
13
|
Cai Y, Li X, Sun Z, Lu Y, Zhao H, Hanson J, Paliwal K, Litfin T, Zhou Y, Yang Y. SPOT-Fold: Fragment-Free Protein Structure Prediction Guided by Predicted Backbone Structure and Contact Map. J Comput Chem 2019; 41:745-750. [PMID: 31845383 DOI: 10.1002/jcc.26132] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2019] [Revised: 10/07/2019] [Accepted: 12/01/2019] [Indexed: 02/01/2023]
Abstract
Protein structure determination has long been one of the most challenging problems in molecular biology for the past 60 years. Here we present an ab initio protein tertiary-structure prediction method assisted by predicted contact maps from SPOT-Contact and predicted dihedral angles from SPIDER 3. These predicted properties were then fed to the crystallography and NMR system (CNS) for restrained structure modeling. The resulted structures are first evaluated by the potential energy calculated by CNS, followed by dDFIRE energy function for model selections. The method called SPOT-Fold has been tested on 241 CASP targets between 67 and 670 amino acid residues, 60 randomly selected globular proteins under 100 amino acids. The method has a comparable accuracy to other contact-map-based modeling techniques. © 2019 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Yufeng Cai
- School of Data and Computer Science, Sun Yat-Sen University, 132 East Circle at University City, Guangzhou, 510006, China
| | - Xiongjun Li
- School of Data and Computer Science, Sun Yat-Sen University, 132 East Circle at University City, Guangzhou, 510006, China
| | - Zhe Sun
- School of Data and Computer Science, Sun Yat-Sen University, 132 East Circle at University City, Guangzhou, 510006, China
| | - Yutong Lu
- School of Data and Computer Science, Sun Yat-Sen University, 132 East Circle at University City, Guangzhou, 510006, China
| | - Huiying Zhao
- Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou, 510000, China
| | - Jack Hanson
- Signal Processing Laboratory, Griffith University, Brisbane, Queensland, 4122, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, Griffith University, Brisbane, Queensland, 4122, Australia
| | - Thomas Litfin
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Southport, Queensland, 4222, Australia
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Southport, Queensland, 4222, Australia
| | - Yuedong Yang
- School of Data and Computer Science, Sun Yat-Sen University, 132 East Circle at University City, Guangzhou, 510006, China
| |
Collapse
|
14
|
Hanson J, Paliwal K, Litfin T, Yang Y, Zhou Y. Accurate prediction of protein contact maps by coupling residual two-dimensional bidirectional long short-term memory with convolutional neural networks. Bioinformatics 2019; 34:4039-4045. [PMID: 29931279 DOI: 10.1093/bioinformatics/bty481] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2018] [Accepted: 06/13/2018] [Indexed: 11/12/2022] Open
Abstract
Motivation Accurate prediction of a protein contact map depends greatly on capturing as much contextual information as possible from surrounding residues for a target residue pair. Recently, ultra-deep residual convolutional networks were found to be state-of-the-art in the latest Critical Assessment of Structure Prediction techniques (CASP12) for protein contact map prediction by attempting to provide a protein-wide context at each residue pair. Recurrent neural networks have seen great success in recent protein residue classification problems due to their ability to propagate information through long protein sequences, especially Long Short-Term Memory (LSTM) cells. Here, we propose a novel protein contact map prediction method by stacking residual convolutional networks with two-dimensional residual bidirectional recurrent LSTM networks, and using both one-dimensional sequence-based and two-dimensional evolutionary coupling-based information. Results We show that the proposed method achieves a robust performance over validation and independent test sets with the Area Under the receiver operating characteristic Curve (AUC) > 0.95 in all tests. When compared to several state-of-the-art methods for independent testing of 228 proteins, the method yields an AUC value of 0.958, whereas the next-best method obtains an AUC of 0.909. More importantly, the improvement is over contacts at all sequence-position separations. Specifically, a 8.95%, 5.65% and 2.84% increase in precision were observed for the top L∕10 predictions over the next best for short, medium and long-range contacts, respectively. This confirms the usefulness of ResNets to congregate the short-range relations and 2D-BRLSTM to propagate the long-range dependencies throughout the entire protein contact map 'image'. Availability and implementation SPOT-Contact server url: http://sparks-lab.org/jack/server/SPOT-Contact/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jack Hanson
- Signal Processing Laboratory, Griffith University, Brisbane, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, Griffith University, Brisbane, Australia
| | - Thomas Litfin
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Southport, Australia
| | - Yuedong Yang
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Southport, Australia
- School of Data and Computer Science, Sun-Yat Sen University, Guangzhou, Guangdong, China
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Southport, Australia
| |
Collapse
|
15
|
Hadley B, Litfin T, Day CJ, Haselhorst T, Zhou Y, Tiralongo J. Nucleotide Sugar Transporter SLC35 Family Structure and Function. Comput Struct Biotechnol J 2019; 17:1123-1134. [PMID: 31462968 PMCID: PMC6709370 DOI: 10.1016/j.csbj.2019.08.002] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Revised: 08/05/2019] [Accepted: 08/05/2019] [Indexed: 12/22/2022] Open
Abstract
The covalent attachment of sugars to growing glycan chains is heavily reliant on a specific family of solute transporters (SLC35), the nucleotide sugar transporters (NSTs) that connect the synthesis of activated sugars in the nucleus or cytosol, to glycosyltransferases that reside in the lumen of the endoplasmic reticulum (ER) and/or Golgi apparatus. This review provides a timely update on recent progress in the NST field, specifically we explore several NSTs of the SLC35 family whose substrate specificity and function have been poorly understood, but where recent significant progress has been made. This includes SLC35 A4, A5 and D3, as well as progress made towards understanding the association of SLC35A2 with SLC35A3 and how this relates to their potential regulation, and how the disruption to the dilysine motif in SLC35B4 causes mislocalisation, calling into question multisubstrate NSTs and their subcellular localisation and function. We also report on the recently described first crystal structure of an NST, the SLC35D2 homolog Vrg-4 from yeast. Using this crystal structure, we have generated a new model of SLC35A1, (CMP-sialic acid transporter, CST), with structural and mechanistic predictions based on all known CST-related data, and includes an overview of reported mutations that alter transport and/or substrate recognition (both de novo and site-directed). We also present a model of the CST-del177 isoform that potentially explains why the human CST isoform remains active while the hamster CST isoform is inactive, and we provide a possible alternate access mechanism that accounts for the CST being functional as either a monomer or a homodimer. Finally we provide an update on two NST crystal structures that were published subsequent to the submission and during review of this report.
Collapse
Affiliation(s)
- Barbara Hadley
- Institute for Glycomics, Griffith University, Gold Coast Campus, Queensland 4222, Australia
| | - Thomas Litfin
- School of Information and Communication Technology, Griffith University, Gold Coast Campus, Queensland 4212, Australia
| | - Chris J. Day
- Institute for Glycomics, Griffith University, Gold Coast Campus, Queensland 4222, Australia
| | - Thomas Haselhorst
- Institute for Glycomics, Griffith University, Gold Coast Campus, Queensland 4222, Australia
| | - Yaoqi Zhou
- Institute for Glycomics, Griffith University, Gold Coast Campus, Queensland 4222, Australia
- School of Information and Communication Technology, Griffith University, Gold Coast Campus, Queensland 4212, Australia
| | - Joe Tiralongo
- Institute for Glycomics, Griffith University, Gold Coast Campus, Queensland 4222, Australia
- Corresponding author.
| |
Collapse
|
16
|
Affiliation(s)
- Thomas Litfin
- School of Information and Communication Technology, Griffith University, Southport, QLD 4222, Australia
| | - Yuedong Yang
- School of Data and Computer Science, Sun-Yat Sen University, Guangzhou, Guangdong 510006, China
| | - Yaoqi Zhou
- School of Information and Communication Technology, Griffith University, Southport, QLD 4222, Australia
- Institute for Glycomics, Griffith University, Southport, QLD 4222, Australia
| |
Collapse
|
17
|
Hanson J, Paliwal K, Litfin T, Yang Y, Zhou Y. Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks. Bioinformatics 2018; 35:2403-2410. [DOI: 10.1093/bioinformatics/bty1006] [Citation(s) in RCA: 115] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Revised: 11/02/2018] [Accepted: 12/06/2018] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Sequence-based prediction of one dimensional structural properties of proteins has been a long-standing subproblem of protein structure prediction. Recently, prediction accuracy has been significantly improved due to the rapid expansion of protein sequence and structure libraries and advances in deep learning techniques, such as residual convolutional networks (ResNets) and Long-Short-Term Memory Cells in Bidirectional Recurrent Neural Networks (LSTM-BRNNs). Here we leverage an ensemble of LSTM-BRNN and ResNet models, together with predicted residue-residue contact maps, to continue the push towards the attainable limit of prediction for 3- and 8-state secondary structure, backbone angles (θ, τ, ϕ and ψ), half-sphere exposure, contact numbers and solvent accessible surface area (ASA).
Results
The new method, named SPOT-1D, achieves similar, high performance on a large validation set and test set (≈1000 proteins in each set), suggesting robust performance for unseen data. For the large test set, it achieves 87% and 77% in 3- and 8-state secondary structure prediction and 0.82 and 0.86 in correlation coefficients between predicted and measured ASA and contact numbers, respectively. Comparison to current state-of-the-art techniques reveals substantial improvement in secondary structure and backbone angle prediction. In particular, 44% of 40-residue fragment structures constructed from predicted backbone Cα-based θ and τ angles are less than 6 Å root-mean-squared-distance from their native conformations, nearly 20% better than the next best. The method is expected to be useful for advancing protein structure and function prediction.
Availability and implementation
SPOT-1D and its data is available at: http://sparks-lab.org/.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jack Hanson
- Signal Processing Laboratory, Griffith University, Brisbane, QLD, Australia
| | - Kuldip Paliwal
- Signal Processing Laboratory, Griffith University, Brisbane, QLD, Australia
| | - Thomas Litfin
- School of Information and Communication Technology, Griffith University, Gold Coast, QLD, Australia
| | - Yuedong Yang
- School of Data and Computer Science, Sun-Yat Sen University, Guangzhou, Guangdong, China
| | - Yaoqi Zhou
- School of Information and Communication Technology, Griffith University, Gold Coast, QLD, Australia
- Institute for Glycomics, Griffith University, Gold Coast, QLD, Australia
| |
Collapse
|
18
|
Litfin T, Zhou Y, Yang Y. SPOT-ligand 2: improving structure-based virtual screening by binding-homology search on an expanded structural template library. Bioinformatics 2017; 33:1238-1240. [PMID: 28057679 DOI: 10.1093/bioinformatics/btw829] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2016] [Accepted: 12/27/2016] [Indexed: 11/12/2022] Open
Abstract
Motivation The high cost of drug discovery motivates the development of accurate virtual screening tools. Binding-homology, which takes advantage of known protein-ligand binding pairs, has emerged as a powerful discrimination technique. In order to exploit all available binding data, modelled structures of ligand-binding sequences may be used to create an expanded structural binding template library. Results SPOT-Ligand 2 has demonstrated significantly improved screening performance over its previous version by expanding the template library 15 times over the previous one. It also performed better than or similar to other binding-homology approaches on the DUD and DUD-E benchmarks. Availability and Implementation The server is available online at http://sparks-lab.org . Contacts yaoqi.zhou@griffith.edu.au or yuedong.yang@griffith.edu.au. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Yaoqi Zhou
- School of Information and Communication Technology.,Institute for Glycomics, Griffith University, Southport, Queensland 4215, Australia
| | - Yuedong Yang
- School of Information and Communication Technology.,Institute for Glycomics, Griffith University, Southport, Queensland 4215, Australia
| |
Collapse
|
19
|
Abstract
BACKGROUND Analgesic nephropathy (AN) is chronic renal impairment as a direct consequence of chronic heavy analgesia ingestion. An association between non-steroidal anti-inflammatory agents and chronic kidney disease (CKD) has long been suspected. Despite ample observational data obtained in recent decades the relationship remains uncertain. This systematic review intends to summarize the available literature and to define the role of non-steroidal anti-inflammatories in the natural history of AN. METHODS We conducted a systematic literature search for articles describing the association between non-steroidal anti-inflammatory abuse and renal insufficiency. No restriction was placed on publication date, but papers were limited to English language, observational design, and human studies. RESULTS Nine articles met our inclusion criteria and were discussed in this review. This includes 5 cohort studies and 4 case-control trials, with a combined population of 12,418 study subjects and 23,877 controls. Eight of the nine reports failed to identify any increased risk of chronic renal impairment with heavy non-steroidal anti-inflammatory consumption. Study methods were heterogeneous and the overall quality of data was relatively poor. CONCLUSION A relationship between non-steroidal anti-inflammatory medicines and the development of CKD has never been proven. Based on the available scientific evidence non-steroidal anti-inflammatory agents do not appear to be implicated in the pathogenesis of AN.
Collapse
Affiliation(s)
- Julian Yaxley
- a Department of Nephrology , Townsville Hospital , Townsville , QLD , Australia
| | - Thomas Litfin
- b Faculty of Science , University of Melbourne , Melbourne , VIC , Australia
| |
Collapse
|