1
|
Li H, Jin Z, Gao S, Kuang S, Lei C, Nie Z. Precise detection of G-quadruplexs in living systems: principles, applications, and perspectives. Chem Sci 2025:d5sc00918a. [PMID: 40417301 PMCID: PMC12096178 DOI: 10.1039/d5sc00918a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2025] [Accepted: 05/15/2025] [Indexed: 05/27/2025] Open
Abstract
G-quadruplexes (G4s) are non-canonical nucleic acid secondary structures that play a crucial role in regulating essential cellular processes such as replication, transcription, and translation. The formation of G4s is dynamically controlled by the physiological state of the cell. Accurate detection of G4 structures in live cells, as well as studies of their dynamic changes and the kinetics of specific G4s, are essential for understanding their biological roles, exploring potential links between aberrant G4 expression and disease, and developing G4-targeted diagnostic and therapeutic strategies. This perspective briefly overviews G4 formation mechanisms and their known biological functions. We then summarize the leading techniques and methodologies available for G4 detection, discussing the principles and applications of each approach. In addition, we outline strategies for the global detection of intracellular G4s, methods for conformational recognition, and approaches for targeting specific sequences. Finally, we discuss the technical limitations and challenges currently facing the field of G4 detection and offer perspectives on potential future directions. We hope this review will inspire further research into the biological functions of G4s and their applications in disease diagnosis and therapy.
Collapse
Affiliation(s)
- Huanhuan Li
- State Key Laboratory of Chemo and Biosensing, Hunan Provincial Key Laboratory of Biomacromolecular Chemical Biology, Hunan University Changsha 410082 People's Republic of China
| | - Zelong Jin
- State Key Laboratory of Chemo and Biosensing, Hunan Provincial Key Laboratory of Biomacromolecular Chemical Biology, Hunan University Changsha 410082 People's Republic of China
| | - Shuxin Gao
- State Key Laboratory of Chemo and Biosensing, Hunan Provincial Key Laboratory of Biomacromolecular Chemical Biology, Hunan University Changsha 410082 People's Republic of China
| | - Shi Kuang
- State Key Laboratory of Chemo and Biosensing, Hunan Provincial Key Laboratory of Biomacromolecular Chemical Biology, Hunan University Changsha 410082 People's Republic of China
| | - Chunyang Lei
- State Key Laboratory of Chemo and Biosensing, Hunan Provincial Key Laboratory of Biomacromolecular Chemical Biology, Hunan University Changsha 410082 People's Republic of China
| | - Zhou Nie
- State Key Laboratory of Chemo and Biosensing, Hunan Provincial Key Laboratory of Biomacromolecular Chemical Biology, Hunan University Changsha 410082 People's Republic of China
| |
Collapse
|
2
|
Yamada K, Suga K, Abe N, Hashimoto K, Tsutsumi S, Inagaki M, Hashiya F, Abe H, Hamada M. Multi-objective computational optimization of human 5' UTR sequences. Brief Bioinform 2025; 26:bbaf225. [PMID: 40413870 DOI: 10.1093/bib/bbaf225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2024] [Revised: 03/26/2025] [Accepted: 04/07/2025] [Indexed: 05/27/2025] Open
Abstract
The computational design of messenger RNA (mRNA) sequences is a critical technology for both scientific research and industrial applications. Recent advances in prediction and optimization models have enabled the automatic scoring and optimization of $5^\prime $ UTR sequences, key upstream elements of mRNA. However, fully automated design of $5^\prime $ UTR sequences with more than two objective scores has not yet been explored. In this study, we present a computational pipeline that optimizes human $5^\prime $ UTR sequences in a multi-objective framework, addressing up to four distinct and conflicting objectives. Our work represents an important advancement in the multi-objective computational design of mRNA sequences, paving the way for more sophisticated mRNA engineering.
Collapse
Affiliation(s)
- Keisuke Yamada
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, 3-4-1, Okubo Shinjuku-ku, Tokyo 169-8555, Japan
- Department of Bioengineering, University of Pennsylvania, 210 South 33rd Street, Philadelphia, PA 19104, United States
| | - Kanta Suga
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, 3-4-1, Okubo Shinjuku-ku, Tokyo 169-8555, Japan
| | - Naoko Abe
- Department of Chemistry, Graduate School of Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8602, Aichi, Japan
| | - Koji Hashimoto
- Department of Chemistry, Graduate School of Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8602, Aichi, Japan
- Graduate School of Arts and Sciences, The University of Tokyo, 3-8-1 Komaba, Meguro-ku, Tokyo 153-8902, Japan
| | - Susumu Tsutsumi
- Department of Chemistry, Graduate School of Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8602, Aichi, Japan
| | - Masahito Inagaki
- Department of Chemistry, Graduate School of Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8602, Aichi, Japan
| | - Fumitaka Hashiya
- Research Center for Materials Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8602, Aichi, Japan
| | - Hiroshi Abe
- Department of Chemistry, Graduate School of Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8602, Aichi, Japan
- Institute for Glyco-core Research (iGCORE), Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8601, Aichi, Japan
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, 3-4-1, Okubo Shinjuku-ku, Tokyo 169-8555, Japan
- Cellular and Molecular Biotechnology Research Institute (CMB), National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7, Aomi, Koto-ku, Tokyo 135-0064, Japan
- Graduate School of Medicine, Nippon Medical School, 1-1-5, Sendagi, Bunkyo-ku, Tokyo 113-8602, Japan
| |
Collapse
|
3
|
Song D, Luo J, Duan X, Jin F, Lu YJ. Identification of G-quadruplex nucleic acid structures by high-throughput sequencing: A review. Int J Biol Macromol 2025; 297:139896. [PMID: 39818384 DOI: 10.1016/j.ijbiomac.2025.139896] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Revised: 01/04/2025] [Accepted: 01/13/2025] [Indexed: 01/18/2025]
Abstract
G-quadruplexes (G4s) are non-canonical nucleic acid secondary structures formed by guanine-rich DNA or RNA sequences. These structures play pivotal roles in cellular processes, including DNA replication, transcription, RNA splicing, and protein translation. High-throughput sequencing has significantly advanced the study of G4s by enabling genome-wide mapping and detailed characterization. This review provides a comprehensive overview of current methods for G4 identification using high-throughput sequencing, focusing on key techniques such as G4-seq, G4-ChIP-seq, G4-CUT&Tag, LiveG4ID-seq, G4assess, HepG4-seq, rG4-seq, RT-stop profiling with DMS-m7G footprinting, G4RP-seq, Keth-seq, and SHALIPE-seq. We discuss the principles, advantages, limitations, and applications of these methods, highlighting their contribution to our understanding of G4 biology. The review also emphasizes the need for improved tools to explore the dynamic behavior of G4s, particularly in living organisms.
Collapse
Affiliation(s)
- Delong Song
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Junren Luo
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Xuan Duan
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China
| | - Fujun Jin
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China; Smart Medical Innovation Technology Center, Guangdong University of Technology, Guangzhou 510006, China.
| | - Yu-Jing Lu
- School of Biomedical and Pharmaceutical Sciences, Guangdong University of Technology, Guangzhou 510006, China; Smart Medical Innovation Technology Center, Guangdong University of Technology, Guangzhou 510006, China.
| |
Collapse
|
4
|
Obermann T, Sakshaug T, Kanagaraj VV, Abentung A, Sousa MMLD, Hagen L, Sarno A, Bjørås M, Scheffler K. Genomic 8-oxoguanine modulates gene transcription independent of its repair by DNA glycosylases OGG1 and MUTYH. Redox Biol 2025; 79:103461. [PMID: 39662289 PMCID: PMC11697278 DOI: 10.1016/j.redox.2024.103461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2024] [Revised: 12/03/2024] [Accepted: 12/05/2024] [Indexed: 12/13/2024] Open
Abstract
8-oxo-7,8-dihydroguanine (OG) is one of the most abundant oxidative lesions in the genome and is associated with genome instability. Its mutagenic potential is counteracted by a concerted action of 8-oxoguanine DNA glycosylase (OGG1) and mutY homolog DNA glycosylase (MUTYH). It has been suggested that OG and its repair has epigenetic-like properties and mediates transcription, but genome-wide evidence of this interdependence is lacking. Here, we applied an improved OG-sequencing approach reducing artificial background oxidation and RNA-sequencing to correlate genome-wide distribution of OG with gene transcription in OGG1 and/or MUTYH-deficient cells. Our data identified moderate enrichment of OG in the genome that is mainly dependent on the genomic context and not affected by DNA glycosylase-initiated repair. Interestingly, no association was found between genomic OG deposition and gene expression changes upon loss of OGG1 and MUTYH. Regardless of DNA glycosylase activity, OG in promoter regions correlated with expression of genes related to metabolic processes and damage response pathways indicating that OG functions as a cellular stress sensor to regulate transcription. Our work provides novel insights into the mechanism underlying transcriptional regulation by OG and DNA glycosylases OGG1 and MUTYH and suggests that oxidative DNA damage accumulation and its repair utilize different pathways.
Collapse
Affiliation(s)
- Tobias Obermann
- Department of Clinical and Molecular Medicine, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, 7491 Trondheim, Norway
| | - Teri Sakshaug
- Department of Neuromedicine and Movement Science, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, 7491, Trondheim, Norway
| | - Vishnu Vignesh Kanagaraj
- Department of Clinical and Molecular Medicine, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, 7491 Trondheim, Norway
| | - Andreas Abentung
- Department of Clinical and Molecular Medicine, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, 7491 Trondheim, Norway; Department of Neurology and Clinical Neurophysiology, University Hospital of Trondheim, 7006, Trondheim, Norway
| | - Mirta Mittelstedt Leal de Sousa
- Department of Clinical and Molecular Medicine, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, 7491 Trondheim, Norway; Proteomics and Modomics Experimental Core (PROMEC), NTNU and the Central Norway Regional Health Authority, N-7491, Trondheim, Norway
| | - Lars Hagen
- Department of Clinical and Molecular Medicine, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, 7491 Trondheim, Norway; Proteomics and Modomics Experimental Core (PROMEC), NTNU and the Central Norway Regional Health Authority, N-7491, Trondheim, Norway
| | - Antonio Sarno
- Department of Clinical and Molecular Medicine, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, 7491 Trondheim, Norway
| | - Magnar Bjørås
- Department of Clinical and Molecular Medicine, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, 7491 Trondheim, Norway; Centre for Embryology and Healthy Development, University of Oslo, Oslo, 0373, Norway; Department of Microbiology, Oslo University Hospital and University of Oslo, Oslo, 0424, Norway
| | - Katja Scheffler
- Department of Neuromedicine and Movement Science, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, 7491, Trondheim, Norway; Department of Neurology and Clinical Neurophysiology, University Hospital of Trondheim, 7006, Trondheim, Norway.
| |
Collapse
|
5
|
Cherednichenko O, Poptsova M. Data augmentation with generative models improves detection of Non-B DNA structures. Comput Biol Med 2025; 184:109440. [PMID: 39550912 DOI: 10.1016/j.compbiomed.2024.109440] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2024] [Revised: 11/10/2024] [Accepted: 11/12/2024] [Indexed: 11/19/2024]
Abstract
Non-B DNA structures, or flipons, are important functional elements that regulate a large spectrum of cellular programs. Experimental technologies for flipon detection are limited to the subsets that are active at the time of an experiment and cannot capture whole-genome functional set. Thus, the task of generating reliable whole-genome annotations of non-B DNA structures is put on deep learning models, however their quality depends on the available experimental data for training. The data augmentation approach as the combination of synthetic and real data is widely used in various fields. Deep generative models demonstrated promising results in data augmentation improving classifiers' performance. Here we aimed at testing performance of diffusion models in comparison to other generative models in generating synthetic non-B DNA structures for data augmentation approach. We tested denoising diffusion probabilistic and implicit models (DDPM and DDIM), Wasserstein generative adversarial network (WGAN), vector quantised variational autoencoder (VQ-VAE) and showed that data augmentation improves the quality of classifiers. Diffusion models overall show the best results, but when considering three criteria of generative trilemma - quality of generated samples, diversity and sampling speed, we conclude that trade-off is possible between generative diffusion model and other architectures such as WGAN and VQ-VAE.
Collapse
Affiliation(s)
| | - Maria Poptsova
- International Laboratory of Bioinformatics, HSE University, Moscow, Russia.
| |
Collapse
|
6
|
Liew D, Lim ZW, Yong EH. Machine learning-based prediction of DNA G-quadruplex folding topology with G4ShapePredictor. Sci Rep 2024; 14:24238. [PMID: 39414858 PMCID: PMC11484705 DOI: 10.1038/s41598-024-74826-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Accepted: 09/30/2024] [Indexed: 10/18/2024] Open
Abstract
Deoxyribonucleic acid (DNA) is able to form non-canonical four-stranded helical structures with diverse folding patterns known as G-quadruplexes (G4s). G4 topologies are classified based on their relative strand orientation following the 5' to 3' phosphate backbone polarity. Broadly, G4 topologies are either parallel (4+0), antiparallel (2+2), or hybrid (3+1). G4s play crucial roles in biological processes such as DNA repair, DNA replication, transcription and have thus emerged as biological targets in drug design. While computational models have been developed to predict G4 formation, there is currently no existing model capable of predicting G4 folding topology based on its nucleic acid sequence. Therefore, we introduce G4ShapePredictor (G4SP), an application featuring a collection of multi-classification machine learning models that are trained on a custom G4 dataset combining entries from existing literature and in-house circular dichroism experiments. G4ShapePredictor is designed to accurately predict G4 folding topologies in potassium ( K + ) buffer based on its primary sequence and is able to incorporate a threshold optimization strategy allowing users to maximise precision. Furthermore, we have identified three topological sequence motifs that suggest specific G4 folding topologies of (4+0), (2+2) or (3+1) when utilising the decision-making mechanisms of G4ShapePredictor.
Collapse
Affiliation(s)
- Donn Liew
- Division of Physics and Applied Physics, School of Physical and Mathematical Sciences, Nanyang Technological University, 637371, Singapore, Singapore
| | - Zi Way Lim
- Division of Physics and Applied Physics, School of Physical and Mathematical Sciences, Nanyang Technological University, 637371, Singapore, Singapore
| | - Ee Hou Yong
- Division of Physics and Applied Physics, School of Physical and Mathematical Sciences, Nanyang Technological University, 637371, Singapore, Singapore.
| |
Collapse
|
7
|
Cui Y, Liu H, Ming Y, Zhang Z, Liu L, Liu R. Prediction of strand-specific and cell-type-specific G-quadruplexes based on high-resolution CUT&Tag data. Brief Funct Genomics 2024; 23:265-275. [PMID: 37357985 DOI: 10.1093/bfgp/elad024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 05/20/2023] [Accepted: 06/01/2023] [Indexed: 06/27/2023] Open
Abstract
G-quadruplex (G4), a non-classical deoxyribonucleic acid structure, is widely distributed in the genome and involved in various biological processes. In vivo, high-throughput sequencing has indicated that G4s are significantly enriched at functional regions in a cell-type-specific manner. Therefore, the prediction of G4s based on computational methods is necessary instead of the time-consuming and laborious experimental methods. Recently, G4 CUT&Tag has been developed to generate higher-resolution sequencing data than ChIP-seq, which provides more accurate training samples for model construction. In this paper, we present a new dataset construction method based on G4 CUT&Tag sequencing data and an XGBoost prediction model based on the machine learning boost method. The results show that our model performs well within and across cell types. Furthermore, sequence analysis indicates that the formation of G4 structure is greatly affected by the flanking sequences, and the GC content of the G4 flanking sequences is higher than non-G4. Moreover, we also identified G4 motifs in the high-resolution dataset, among which we found several motifs for known transcription factors (TFs), such as SP2 and BPC. These TFs may directly or indirectly affect the formation of the G4 structure.
Collapse
Affiliation(s)
- Yizhi Cui
- School of Computer Science and Engineering, Beijing Technology and Business University, Beijing, 100048, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324003, Zhejiang, China
| | - Hongzhi Liu
- School of Computer Science and Engineering, Beijing Technology and Business University, Beijing, 100048, China
| | - Yutong Ming
- School of Computer Science and Engineering, Beijing Technology and Business University, Beijing, 100048, China
| | - Zheng Zhang
- Department of Computer Science and Software Engineering, Auburn University, Auburn, 36830, Alabama, USA
| | - Li Liu
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324003, Zhejiang, China
| | - Ruijun Liu
- School of Computer Science and Engineering, Beijing Technology and Business University, Beijing, 100048, China
| |
Collapse
|
8
|
Yang B, Guneri D, Yu H, Wright EP, Chen W, Waller ZE, Ding Y. Prediction of DNA i-motifs via machine learning. Nucleic Acids Res 2024; 52:2188-2197. [PMID: 38364855 PMCID: PMC10954440 DOI: 10.1093/nar/gkae092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 01/24/2024] [Accepted: 01/29/2024] [Indexed: 02/18/2024] Open
Abstract
i-Motifs (iMs), are secondary structures formed in cytosine-rich DNA sequences and are involved in multiple functions in the genome. Although putative iM forming sequences are widely distributed in the human genome, the folding status and strength of putative iMs vary dramatically. Much previous research on iM has focused on assessing the iM folding properties using biophysical experiments. However, there are no dedicated computational tools for predicting the folding status and strength of iM structures. Here, we introduce a machine learning pipeline, iM-Seeker, to predict both folding status and structural stability of DNA iMs. The programme iM-Seeker incorporates a Balanced Random Forest classifier trained on genome-wide iMab antibody-based CUT&Tag sequencing data to predict the folding status and an Extreme Gradient Boosting regressor to estimate the folding strength according to both literature biophysical data and our in-house biophysical experiments. iM-Seeker predicts DNA iM folding status with a classification accuracy of 81% and estimates the folding strength with coefficient of determination (R2) of 0.642 on the test set. Model interpretation confirms that the nucleotide composition of the C-rich sequence significantly affects iM stability, with a positive correlation with sequences containing cytosine and thymine and a negative correlation with guanine and adenine.
Collapse
Affiliation(s)
- Bibo Yang
- Department of Cell and Developmental Biology, John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK
| | - Dilek Guneri
- School of Pharmacy, University College London, London WC1N 1AX, UK
| | - Haopeng Yu
- Department of Cell and Developmental Biology, John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK
| | - Elisé P Wright
- Molecular Physiology School of Medicine, and Molecular Medicine Research Group, University of Western Sydney, Campbelltown, NSW 1797, Australia
| | - Wenqian Chen
- School of Pharmacy, University College London, London WC1N 1AX, UK
| | - Zoë A E Waller
- School of Pharmacy, University College London, London WC1N 1AX, UK
| | - Yiliang Ding
- Department of Cell and Developmental Biology, John Innes Centre, Norwich Research Park, Norwich NR4 7UH, UK
| |
Collapse
|
9
|
Qian SH, Shi MW, Xiong YL, Zhang Y, Zhang ZH, Song XM, Deng XY, Chen ZX. EndoQuad: a comprehensive genome-wide experimentally validated endogenous G-quadruplex database. Nucleic Acids Res 2024; 52:D72-D80. [PMID: 37904589 PMCID: PMC10767823 DOI: 10.1093/nar/gkad966] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 09/22/2023] [Accepted: 10/14/2023] [Indexed: 11/01/2023] Open
Abstract
G-quadruplexes (G4s) are non-canonical four-stranded structures and are emerging as novel genetic regulatory elements. However, a comprehensive genomic annotation of endogenous G4s (eG4s) and systematic characterization of their regulatory network are still lacking, posing major challenges for eG4 research. Here, we present EndoQuad (https://EndoQuad.chenzxlab.cn/) to address these pressing issues by integrating high-throughput experimental data. First, based on high-quality genome-wide eG4s mapping datasets (human: 1181; mouse: 24; chicken: 2) generated by G4 ChIP-seq/CUT&Tag, we generate a reference set of genome-wide eG4s. Our multi-omics analyses show that most eG4s are identified in one or a few cell types. The eG4s with higher occurrences across samples are more structurally stable, evolutionarily conserved, enriched in promoter regions, mark highly expressed genes and associate with complex regulatory programs, demonstrating higher confidence level for further experiments. Finally, we integrate millions of functional genomic variants and prioritize eG4s with regulatory functions in disease and cancer contexts. These efforts have culminated in the comprehensive and interactive database of experimentally validated DNA eG4s. As such, EndoQuad enables users to easily access, download and repurpose these data for their own research. EndoQuad will become a one-stop resource for eG4 research and lay the foundation for future functional studies.
Collapse
Affiliation(s)
- Sheng Hu Qian
- Hubei Hongshan Laboratory, College of Life Science and Technology, College of Biomedicine and Health, Interdisciplinary Sciences Institute, Huazhong Agricultural University, Wuhan 430070, PR China
| | - Meng-Wei Shi
- Hubei Hongshan Laboratory, College of Life Science and Technology, College of Biomedicine and Health, Interdisciplinary Sciences Institute, Huazhong Agricultural University, Wuhan 430070, PR China
| | - Yu-Li Xiong
- Hubei Hongshan Laboratory, College of Life Science and Technology, College of Biomedicine and Health, Interdisciplinary Sciences Institute, Huazhong Agricultural University, Wuhan 430070, PR China
| | - Yuan Zhang
- Hubei Hongshan Laboratory, College of Life Science and Technology, College of Biomedicine and Health, Interdisciplinary Sciences Institute, Huazhong Agricultural University, Wuhan 430070, PR China
| | - Ze-Hao Zhang
- Hubei Hongshan Laboratory, College of Life Science and Technology, College of Biomedicine and Health, Interdisciplinary Sciences Institute, Huazhong Agricultural University, Wuhan 430070, PR China
| | - Xue-Mei Song
- Hubei Hongshan Laboratory, College of Life Science and Technology, College of Biomedicine and Health, Interdisciplinary Sciences Institute, Huazhong Agricultural University, Wuhan 430070, PR China
| | - Xin-Yin Deng
- Hubei Hongshan Laboratory, College of Life Science and Technology, College of Biomedicine and Health, Interdisciplinary Sciences Institute, Huazhong Agricultural University, Wuhan 430070, PR China
| | - Zhen-Xia Chen
- Hubei Hongshan Laboratory, College of Life Science and Technology, College of Biomedicine and Health, Interdisciplinary Sciences Institute, Huazhong Agricultural University, Wuhan 430070, PR China
- Shenzhen Institute of Nutrition and Health, Huazhong Agricultural University, Shenzhen 518000, China
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518000, China
| |
Collapse
|
10
|
Vannutelli A, Ouangraoua A, Perreault JP. Toward a Better Understanding of G4 Evolution in the 3 Living Kingdoms. Evol Bioinform Online 2023; 19:11769343231212075. [PMID: 38046653 PMCID: PMC10693206 DOI: 10.1177/11769343231212075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Accepted: 10/18/2023] [Indexed: 12/05/2023] Open
Abstract
Background G-quadruplexes (G4s) are secondary structures in DNA and RNA that impact various cellular processes, such as transcription, splicing, and translation. Due to their numerous functions, G4s are involved in many diseases, making their study important. Yet, G4s evolution remains largely unknown, due to their low sequence similarity and the poor quality of their sequence alignments across several species. To address this, we designed a strategy that avoids direct G4s alignment to study G4s evolution in the 3 species kingdoms. We also explored the coevolution between RBPs and G4s. Methods We retrieved one-to-one orthologous genes from the Ensembl Compara database and computed groups of one-to-one orthologous genes. For each group, we aligned gene sequences and identified G4 families as groups of overlapping G4s in the alignment. We analyzed these G4 families using Count, a tool to infer feature evolution into a gene or a species tree. Additionally, we utilized these G4 families to predict G4s by homology. To establish a control dataset, we performed mono-, di- and tri-nucleotide shuffling. Results Only a few conserved G4s occur among all living kingdoms. In eukaryotes, G4s exhibit slight conservation among vertebrates, and few are conserved between plants. In archaea and bacteria, at most, only 2 G4s are common. The G4 homology-based prediction increases the number of conserved G4s in common ancestors. The coevolution between RNA-binding proteins and G4s was investigated and revealed a modest impact of RNA-binding proteins evolution on G4 evolution. However, the details of this relationship remain unclear. Conclusion Even if G4 evolution still eludes us, the present study provides key information to compute groups of homologous G4 and to reveal the evolution history of G4 families.
Collapse
Affiliation(s)
- Anaïs Vannutelli
- Département de biochimie et de génomique fonctionnelle, faculté de médecine et des sciences de la santé, pavillon de recherche appliquée sur le cancer, Université de Sherbrooke, Sherbrooke, QC, Canada
- Département d’informatique, faculté des sciences, Université de Sherbrooke, Sherbrooke, QC, Canada
| | - Aïda Ouangraoua
- Département d’informatique, faculté des sciences, Université de Sherbrooke, Sherbrooke, QC, Canada
| | - Jean-Pierre Perreault
- Département de biochimie et de génomique fonctionnelle, faculté de médecine et des sciences de la santé, pavillon de recherche appliquée sur le cancer, Université de Sherbrooke, Sherbrooke, QC, Canada
| |
Collapse
|
11
|
Matos-Rodrigues G, Hisey JA, Nussenzweig A, Mirkin SM. Detection of alternative DNA structures and its implications for human disease. Mol Cell 2023; 83:3622-3641. [PMID: 37863029 DOI: 10.1016/j.molcel.2023.08.018] [Citation(s) in RCA: 25] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 08/01/2023] [Accepted: 08/16/2023] [Indexed: 10/22/2023]
Abstract
Around 3% of the genome consists of simple DNA repeats that are prone to forming alternative (non-B) DNA structures, such as hairpins, cruciforms, triplexes (H-DNA), four-stranded guanine quadruplexes (G4-DNA), and others, as well as composite RNA:DNA structures (e.g., R-loops, G-loops, and H-loops). These DNA structures are dynamic and favored by the unwinding of duplex DNA. For many years, the association of alternative DNA structures with genome function was limited by the lack of methods to detect them in vivo. Here, we review the recent advancements in the field and present state-of-the-art technologies and methods to study alternative DNA structures. We discuss the limitations of these methods as well as how they are beginning to provide insights into causal relationships between alternative DNA structures, genome function and stability, and human disease.
Collapse
Affiliation(s)
| | - Julia A Hisey
- Department of Biology, Tufts University, Medford, MA, USA
| | - André Nussenzweig
- Laboratory of Genome Integrity, National Cancer Institute, NIH, Bethesda, MD, USA.
| | | |
Collapse
|
12
|
Sato K, Knipscheer P. G-quadruplex resolution: From molecular mechanisms to physiological relevance. DNA Repair (Amst) 2023; 130:103552. [PMID: 37572578 DOI: 10.1016/j.dnarep.2023.103552] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Revised: 07/29/2023] [Accepted: 08/01/2023] [Indexed: 08/14/2023]
Abstract
Guanine-rich DNA sequences can fold into stable four-stranded structures called G-quadruplexes or G4s. Research in the past decade demonstrated that G4 structures are widespread in the genome and prevalent in regulatory regions of actively transcribed genes. The formation of G4s has been tightly linked to important biological processes including regulation of gene expression and genome maintenance. However, they can also pose a serious threat to genome integrity especially by impeding DNA replication, and G4-associated somatic mutations have been found accumulated in the cancer genomes. Specialised DNA helicases and single stranded DNA binding proteins that can resolve G4 structures play a crucial role in preventing genome instability. The large variety of G4 unfolding proteins suggest the presence of multiple G4 resolution mechanisms in cells. Recently, there has been considerable progress in our detailed understanding of how G4s are resolved, especially during DNA replication. In this review, we first discuss the current knowledge of the genomic G4 landscapes and the impact of G4 structures on DNA replication and genome integrity. We then describe the recent progress on the mechanisms that resolve G4 structures and their physiological relevance. Finally, we discuss therapeutic opportunities to target G4 structures.
Collapse
Affiliation(s)
- Koichi Sato
- Oncode Institute, Hubrecht Institute-KNAW & University Medical Center Utrecht, Utrecht, the Netherlands.
| | - Puck Knipscheer
- Oncode Institute, Hubrecht Institute-KNAW & University Medical Center Utrecht, Utrecht, the Netherlands; Department of Human Genetics, Leiden University Medical Center, Leiden, the Netherlands.
| |
Collapse
|
13
|
Korsakova A, Phan AT. Prediction of G4 formation in live cells with epigenetic data: a deep learning approach. NAR Genom Bioinform 2023; 5:lqad071. [PMID: 37636021 PMCID: PMC10448861 DOI: 10.1093/nargab/lqad071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 05/25/2023] [Accepted: 07/26/2023] [Indexed: 08/29/2023] Open
Abstract
G-quadruplexes (G4s) are secondary structures abundant in DNA that may play regulatory roles in cells. Despite the ubiquity of the putative G-quadruplex-forming sequences (PQS) in the human genome, only a small fraction forms G4 structures in cells. Folded G4, histone methylation and chromatin accessibility are all parts of the complex cis regulatory landscape. We propose an approach for prediction of G4 formation in cells that incorporates epigenetic and chromatin accessibility data. The novel approach termed epiG4NN efficiently predicts cell-specific G4 formation in live cells based on a local epigenomic snapshot. Our results confirm the close relationship between H3K4me3 histone methylation, chromatin accessibility and G4 structure formation. Trained on A549 cell data, epiG4NN was then able to predict G4 formation in HEK293T and K562 cell lines. We observe the dependency of model performance with different epigenetic features on the underlying experimental condition of G4 detection. We expect that this approach will contribute to the systematic understanding of correlations between structural and epigenomic feature landscape.
Collapse
Affiliation(s)
- Anna Korsakova
- School of Physical and Mathematical Sciences, Nanyang Technological University, 637371, Singapore
| | - Anh Tuân Phan
- School of Physical and Mathematical Sciences, Nanyang Technological University, 637371, Singapore
- NTU Institute of Structural Biology, Nanyang Technological University, 636921, Singapore
| |
Collapse
|
14
|
Elimelech-Zohar K, Orenstein Y. An overview on nucleic-acid G-quadruplex prediction: from rule-based methods to deep neural networks. Brief Bioinform 2023:bbad252. [PMID: 37438149 DOI: 10.1093/bib/bbad252] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Revised: 05/11/2023] [Accepted: 06/18/2023] [Indexed: 07/14/2023] Open
Abstract
Nucleic-acid G-quadruplexes (G4s) play vital roles in many cellular processes. Due to their importance, researchers have developed experimental assays to measure nucleic-acid G4s in high throughput. The generated high-throughput datasets gave rise to unique opportunities to develop machine-learning-based methods, and in particular deep neural networks, to predict G4s in any given nucleic-acid sequence and any species. In this paper, we review the success stories of deep-neural-network applications for G4 prediction. We first cover the experimental technologies that generated the most comprehensive nucleic-acid G4 high-throughput datasets in recent years. We then review classic rule-based methods for G4 prediction. We proceed by reviewing the major machine-learning and deep-neural-network applications to nucleic-acid G4 datasets and report a novel comparison between them. Next, we present the interpretability techniques used on the trained neural networks to learn key molecular principles underlying nucleic-acid G4 folding. As a new result, we calculate the overlap between measured DNA and RNA G4s and compare the performance of DNA- and RNA-G4 predictors on RNA- and DNA-G4 datasets, respectively, to demonstrate the potential of transfer learning from DNA G4s to RNA G4s. Last, we conclude with open questions in the field of nucleic-acid G4 prediction and computational modeling.
Collapse
Affiliation(s)
| | - Yaron Orenstein
- Department of Computer Science, Bar-Ilan University, Ramat Gan, 5290002, Israel
- The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat Gan, 5290002, Israel
| |
Collapse
|
15
|
Hosseini M, Palmer A, Manka W, Grady PGS, Patchigolla V, Bi J, O'Neill RJ, Chi Z, Aguiar D. Deep statistical modelling of nanopore sequencing translocation times reveals latent non-B DNA structures. Bioinformatics 2023; 39:i242-i251. [PMID: 37387144 DOI: 10.1093/bioinformatics/btad220] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Non-canonical (or non-B) DNA are genomic regions whose three-dimensional conformation deviates from the canonical double helix. Non-B DNA play an important role in basic cellular processes and are associated with genomic instability, gene regulation, and oncogenesis. Experimental methods are low-throughput and can detect only a limited set of non-B DNA structures, while computational methods rely on non-B DNA base motifs, which are necessary but not sufficient indicators of non-B structures. Oxford Nanopore sequencing is an efficient and low-cost platform, but it is currently unknown whether nanopore reads can be used for identifying non-B structures. RESULTS We build the first computational pipeline to predict non-B DNA structures from nanopore sequencing. We formalize non-B detection as a novelty detection problem and develop the GoFAE-DND, an autoencoder that uses goodness-of-fit (GoF) tests as a regularizer. A discriminative loss encourages non-B DNA to be poorly reconstructed and optimizing Gaussian GoF tests allows for the computation of P-values that indicate non-B structures. Based on whole genome nanopore sequencing of NA12878, we show that there exist significant differences between the timing of DNA translocation for non-B DNA bases compared with B-DNA. We demonstrate the efficacy of our approach through comparisons with novelty detection methods using experimental data and data synthesized from a new translocation time simulator. Experimental validations suggest that reliable detection of non-B DNA from nanopore sequencing is achievable. AVAILABILITY AND IMPLEMENTATION Source code is available at https://github.com/bayesomicslab/ONT-nonb-GoFAE-DND.
Collapse
Affiliation(s)
- Marjan Hosseini
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269-4155, United States
| | - Aaron Palmer
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269-4155, United States
| | - William Manka
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269-4155, United States
| | - Patrick G S Grady
- Institute for Systems Genomics and Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT 06269-3003, United States
| | - Venkata Patchigolla
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269-4155, United States
| | - Jinbo Bi
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269-4155, United States
| | - Rachel J O'Neill
- Institute for Systems Genomics and Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT 06269-3003, United States
| | - Zhiyi Chi
- Department of Statistics, University of Connecticut, Storrs, CT 06269-4120, United States
| | - Derek Aguiar
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269-4155, United States
| |
Collapse
|
16
|
Abstract
Repetitive elements in the human genome, once considered 'junk DNA', are now known to adopt more than a dozen alternative (that is, non-B) DNA structures, such as self-annealed hairpins, left-handed Z-DNA, three-stranded triplexes (H-DNA) or four-stranded guanine quadruplex structures (G4 DNA). These dynamic conformations can act as functional genomic elements involved in DNA replication and transcription, chromatin organization and genome stability. In addition, recent studies have revealed a role for these alternative structures in triggering error-generating DNA repair processes, thereby actively enabling genome plasticity. As a driving force for genetic variation, non-B DNA structures thus contribute to both disease aetiology and evolution.
Collapse
Affiliation(s)
- Guliang Wang
- Division of Pharmacology and Toxicology, College of Pharmacy, The University of Texas at Austin, Dell Paediatric Research Institute, Austin, TX, USA
| | - Karen M Vasquez
- Division of Pharmacology and Toxicology, College of Pharmacy, The University of Texas at Austin, Dell Paediatric Research Institute, Austin, TX, USA.
| |
Collapse
|
17
|
G4Beacon: An In Vivo G4 Prediction Method Using Chromatin and Sequence Information. Biomolecules 2023; 13:biom13020292. [PMID: 36830661 PMCID: PMC9953394 DOI: 10.3390/biom13020292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2022] [Revised: 01/28/2023] [Accepted: 02/01/2023] [Indexed: 02/09/2023] Open
Abstract
G-quadruplex (G4) structures are critical epigenetic regulatory elements, which usually form in guanine-rich regions in DNA. However, predicting the formation of G4 structures within living cells remains a challenge. Here, we present an ultra-robust machine learning method, G4Beacon, which utilizes the Gradient-Boosting Decision Tree (GBDT) algorithm, coupled with the ATAC-seq data and the surrounding sequences of in vitro G4s, to accurately predict the formation ability of these in vitro G4s in different cell types. As a result, our model achieved excellent performance even when the test set was extremely skewed. Besides this, G4Beacon can also identify the in vivo G4s of other cell lines precisely with the model built on a special cell line, regardless of the experimental techniques or platforms. Altogether, G4Beacon is an accurate, reliable, and easy-to-use method for the prediction of in vivo G4s of various cell lines.
Collapse
|
18
|
Shi X, Teng H, Sun Z. An updated overview of experimental and computational approaches to identify non-canonical DNA/RNA structures with emphasis on G-quadruplexes and R-loops. Brief Bioinform 2022; 23:bbac441. [PMID: 36208174 PMCID: PMC9677470 DOI: 10.1093/bib/bbac441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Revised: 08/22/2022] [Accepted: 09/13/2022] [Indexed: 12/14/2022] Open
Abstract
Multiple types of non-canonical nucleic acid structures play essential roles in DNA recombination and replication, transcription, and genomic instability and have been associated with several human diseases. Thus, an increasing number of experimental and bioinformatics methods have been developed to identify these structures. To date, most reviews have focused on the features of non-canonical DNA/RNA structure formation, experimental approaches to mapping these structures, and the association of these structures with diseases. In addition, two reviews of computational algorithms for the prediction of non-canonical nucleic acid structures have been published. One of these reviews focused only on computational approaches for G4 detection until 2020. The other mainly summarized the computational tools for predicting cruciform, H-DNA and Z-DNA, in which the algorithms discussed were published before 2012. Since then, several experimental and computational methods have been developed. However, a systematic review including the conformation, sequencing mapping methods and computational prediction strategies for these structures has not yet been published. The purpose of this review is to provide an updated overview of conformation, current sequencing technologies and computational identification methods for non-canonical nucleic acid structures, as well as their strengths and weaknesses. We expect that this review will aid in understanding how these structures are characterised and how they contribute to related biological processes and diseases.
Collapse
Affiliation(s)
- Xiaohui Shi
- Key Laboratory of Clinical Laboratory Diagnosis and Translational Research of Zhejiang Province, The first Affiliated Hospital of WMU; Beijing Institutes of Life Science, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Ouhai District, Wenzhou 325000, China
| | - Huajing Teng
- Department of Radiation Oncology, Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education) at Peking University Cancer Hospital and Institute, Ouhai District, Wenzhou 325000, China
| | - Zhongsheng Sun
- Key Laboratory of Clinical Laboratory Diagnosis and Translational Research of Zhejiang Province, The first Affiliated Hospital of WMU; Beijing Institutes of Life Science, Chinese Academy of Sciences; CAS Center for Excellence in Biotic Interactions and State Key Laboratory of Integrated Management of Pest Insects and Rodents, University of Chinese Academy of Sciences; Institute of Genomic Medicine, Wenzhou Medical University; IBMC-BGI Center, the Cancer Hospital of the University of Chinese Academy of Sciences (Zhejiang Cancer Hospital); Institute of Basic Medicine and Cancer (IBMC), Chinese Academy of Sciences, Ouhai District, Wenzhou 325000, China
| |
Collapse
|
19
|
Fang S, Liu S, Yang D, Yang L, Hu CD, Wan J. Decoding regulatory associations of G-quadruplex with epigenetic and transcriptomic functional components. Front Genet 2022; 13:957023. [PMID: 36092921 PMCID: PMC9452811 DOI: 10.3389/fgene.2022.957023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 07/29/2022] [Indexed: 02/02/2023] Open
Abstract
G-quadruplex (G4) has been previously observed to be associated with gene expression. In this study, we performed integrative analysis on G4 multi-omics data from in-silicon prediction and ChIP-seq in human genome. Potential G4 sites were classified into three distinguished groups, such as one group of high-confidence G4-forming locations (G4-II) and groups only containing either ChIP-seq detected G4s (G4-I) or predicted G4 motif candidates (G4-III). We explored the associations of different-confidence G4 groups with other epigenetic regulatory elements, including CpG islands, chromatin status, enhancers, super-enhancers, G4 locations compared to the genes, and DNA methylation. Our elastic net regression model revealed that G4 structures could correlate with gene expression in two opposite ways depending on their locations to the genes as well as G4-forming DNA strand. Some transcription factors were identified to be over-represented with G4 emergence. The motif analysis discovered distinct consensus sequences enriched in the G4 feet, the flanking regions of two groups of G4s. We found high GC content in the feet of high-confidence G4s (G4-II) when compared to high TA content in solely predicted G4 feet of G4-III. Overall, we uncovered the comprehensive associations of G4 formations or predictions with other epigenetic and transcriptional elements which potentially coordinate gene transcription.
Collapse
Affiliation(s)
- Shuyi Fang
- Department of BioHealth Informatics, Indiana University School of Informatics and Computing, Indiana University—Purdue University Indianapolis, Indianapolis, IN, United States
| | - Sheng Liu
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States
- The Collaborative Core for Cancer Bioinformatics (CB) shared by Indiana University Simon Comprehensive Cancer Center and Purdue University Center for Cancer Research, Indianapolis, IN, United States
| | - Danzhou Yang
- Department of Medicinal Chemistry and Molecular Pharmacology, Purdue University, West Lafayette, IN, United States
- Purdue University Center for Cancer Research, Purdue University, West Lafayette, IN, United States
| | - Lei Yang
- Department of Pediatrics, Indiana University School of Medicine, Indianapolis, IN, United States
- Herman B Wells Center for Pediatric Research, Indiana University School of Medicine, Indianapolis, IN, United States
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN, United States
| | - Chang-Deng Hu
- Department of Medicinal Chemistry and Molecular Pharmacology, Purdue University, West Lafayette, IN, United States
- Purdue University Center for Cancer Research, Purdue University, West Lafayette, IN, United States
| | - Jun Wan
- Department of BioHealth Informatics, Indiana University School of Informatics and Computing, Indiana University—Purdue University Indianapolis, Indianapolis, IN, United States
- Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, United States
- The Collaborative Core for Cancer Bioinformatics (CB) shared by Indiana University Simon Comprehensive Cancer Center and Purdue University Center for Cancer Research, Indianapolis, IN, United States
- Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN, United States
| |
Collapse
|
20
|
Cagirici HB, Budak H, Sen TZ. G4Boost: a machine learning-based tool for quadruplex identification and stability prediction. BMC Bioinformatics 2022; 23:240. [PMID: 35717172 PMCID: PMC9206279 DOI: 10.1186/s12859-022-04782-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Accepted: 06/09/2022] [Indexed: 11/10/2022] Open
Abstract
Background G-quadruplexes (G4s), formed within guanine-rich nucleic acids, are secondary structures involved in important biological processes. Although every G4 motif has the potential to form a stable G4 structure, not every G4 motif would, and accurate energy-based methods are needed to assess their structural stability. Here, we present a decision tree-based prediction tool, G4Boost, to identify G4 motifs and predict their secondary structure folding probability and thermodynamic stability based on their sequences, nucleotide compositions, and estimated structural topologies.
Results G4Boost predicted the quadruplex folding state with an accuracy greater then 93% and an F1-score of 0.96, and the folding energy with an RMSE of 4.28 and R2 of 0.95 only by the means of sequence intrinsic feature. G4Boost was successfully applied and validated to predict the stability of experimentally-determined G4 structures, including for plants and humans. Conclusion G4Boost outperformed the three machine-learning based prediction tools, DeepG4, Quadron, and G4RNA Screener, in terms of both accuracy and F1-score, and can be highly useful for G4 prediction to understand gene regulation across species including plants and humans. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04782-z.
Collapse
Affiliation(s)
- H Busra Cagirici
- US Department of Agriculture - Agricultural Research Service, Crop Improvement Genetics Research Unit, Western Regional Research Center, 800 Buchanan St, Albany, CA, 94710, USA
| | | | - Taner Z Sen
- US Department of Agriculture - Agricultural Research Service, Crop Improvement Genetics Research Unit, Western Regional Research Center, 800 Buchanan St, Albany, CA, 94710, USA.
| |
Collapse
|
21
|
Yu H, Qi Y, Ding Y. Deep Learning in RNA Structure Studies. Front Mol Biosci 2022; 9:869601. [PMID: 35677883 PMCID: PMC9168262 DOI: 10.3389/fmolb.2022.869601] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 05/04/2022] [Indexed: 01/27/2023] Open
Abstract
Deep learning, or artificial neural networks, is a type of machine learning algorithm that can decipher underlying relationships from large volumes of data and has been successfully applied to solve structural biology questions, such as RNA structure. RNA can fold into complex RNA structures by forming hydrogen bonds, thereby playing an essential role in biological processes. While experimental effort has enabled resolving RNA structure at the genome-wide scale, deep learning has been more recently introduced for studying RNA structure and its functionality. Here, we discuss successful applications of deep learning to solve RNA problems, including predictions of RNA structures, non-canonical G-quadruplex, RNA-protein interactions and RNA switches. Following these cases, we give a general guide to deep learning for solving RNA structure problems.
Collapse
Affiliation(s)
- Haopeng Yu
- Department of Cell and Developmental Biology, John Innes Centre, Norwich Research Park, Norwich, United Kingdom
| | | | - Yiliang Ding
- Department of Cell and Developmental Biology, John Innes Centre, Norwich Research Park, Norwich, United Kingdom
| |
Collapse
|
22
|
Rossi F, Paiardini A. A Machine Learning Perspective on DNA and RNA G-quadruplexes. Curr Bioinform 2022. [DOI: 10.2174/1574893617666220224105702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Abstract:
G-quadruplexes (G4s) are particular structures found in guanine-rich DNA and RNA sequences that exhibit a wide diversity of three-dimensional conformations and exert key functions in the control of gene expression. G4s are able to interact with numerous small molecules and endogenous proteins, and their dysregulation can lead to a variety of disorders and diseases. Characterization and prediction of G4-forming sequences could elucidate their mechanism of action and could thus represent an important step in the discovery of potential therapeutic drugs. In this perspective, we propose an overview of G4s, discussing the state of the art of methodologies and tools developed to characterize and predict the presence of these structures in genomic sequences. In particular, we report on machine learning (ML) approaches and artificial neural networks (ANNs) that could open new avenues for the accurate analysis of quadruplexes, given their potential to derive informative features by learning from large, high-density datasets.
Collapse
Affiliation(s)
- Fabiana Rossi
- Department of Biochemical Sciences \'A. Rossi Fanelli\', University of Rome La Sapienza, Rome, Italy
| | - Alessandro Paiardini
- Department of Biochemical Sciences \'A. Rossi Fanelli\', University of Rome La Sapienza, Rome, Italy
| |
Collapse
|