1
|
Yao L, Xie P, Dong D, Guo Y, Guan J, Zhang W, Chung CR, Zhao Z, Chiang YC, Lee TY. Caps-ac4C: An effective computational framework for identifying N4-acetylcytidine sites in human mRNA based on deep learning. J Mol Biol 2025; 437:168961. [PMID: 39884569 DOI: 10.1016/j.jmb.2025.168961] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2024] [Revised: 01/20/2025] [Accepted: 01/21/2025] [Indexed: 02/01/2025]
Abstract
N4-acetylcytidine (ac4C) is a crucial post-transcriptional modification in human mRNA, involving the acetylation of the nitrogen atom at the fourth position of cytidine. This modification, catalyzed by N-acetyltransferases such as NAT10, is primarily found in mRNA's coding regions and enhances translation efficiency and mRNA stability. ac4C is closely associated with various diseases, including cancer. Therefore, accurately identifying ac4C in human mRNA is essential for gaining deeper insights into disease pathogenesis and provides potential pathways for the development of novel medical interventions. In silico methods for identifying ac4C are gaining increasing attention due to their cost-effectiveness, requiring minimal human and material resources. In this study, we propose an efficient and accurate computational framework, Caps-ac4C, for the precise detection of ac4C in human mRNA. Caps-ac4C utilizes chaos game representation to encode RNA sequences into "images" and employs capsule networks to learn global and local features from these RNA "images". Experimental results demonstrate that Caps-ac4C achieves state-of-the-art performance, achieving 95.47% accuracy and 0.912 MCC on the test set, surpassing the current best methods by 10.69% accuracy and 0.216 MCC. In summary, Caps-ac4C represents the most accurate tool for predicting ac4C sites in human mRNA, highlighting its significant contribution to RNA modification research. For user convenience, we developed a user-friendly web server, which can be accessed for free at:https://awi.cuhk.edu.cn/~Caps-ac4C/index.php.
Collapse
Affiliation(s)
- Lantian Yao
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China; School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China.
| | - Peilin Xie
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China; School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Danhong Dong
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Yilin Guo
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Jiahui Guan
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China; School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Wenyang Zhang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Chia-Ru Chung
- Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
| | - Zhihao Zhao
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China
| | - Ying-Chih Chiang
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China; School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Road, 518172 Shenzhen, China.
| | - Tzong-Yi Lee
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan.
| |
Collapse
|
2
|
Deng L, Gòdia M, Derks MFL, Harlizius B, Farhangi S, Tang Z, Groenen MAM, Madsen O. Comprehensive expression genome-wide association study of long non-coding RNAs in four porcine tissues. Genomics 2025; 117:111026. [PMID: 40049421 DOI: 10.1016/j.ygeno.2025.111026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2024] [Revised: 02/27/2025] [Accepted: 03/03/2025] [Indexed: 03/10/2025]
Abstract
BACKGROUND Long non-coding RNAs (lncRNAs), a type of non-coding RNA molecules, are known to play critical regulatory roles in various biological processes. However, the functions of the majority of lncRNAs remain largely unknown, and little is understood about the regulation of lncRNA expression. In this study, high-throughput DNA genotyping and RNA sequencing were applied to investigate genomic regions associated with lncRNA expression, commonly referred to as lncRNA expression quantitative trait loci (eQTLs). We analyzed the liver, lung, spleen, and muscle transcriptomes of 100 three-way crossbred sows to identify lncRNA transcripts, explore genomic regions that might influence lncRNA expression, and identify potential regulators interacting with these regions. RESULT We identified 6380 lncRNA transcripts and 3733 lncRNA genes. Correlation tests between the expression of lncRNAs and protein-coding genes were performed. Subsequently, functional enrichment analyses were carried out on protein-coding genes highly correlated with lncRNAs. Our correlation results of these protein-coding genes uncovered terms that are related to tissue specific functions. Additionally, heatmaps of lncRNAs and protein-coding genes at different correlation levels revealed several distinct clusters. An expression genome-wide association study (eGWAS) was conducted using 535,896 genotypes and 1829, 1944, 2089, and 2074 expressed lncRNA genes for liver, spleen, lung, and muscle, respectively. This analysis identified 520,562 significant associations and 6654, 4525, 4842, and 7125 eQTLs for the respective tissues. Only a small portion of these eQTLs were classified as cis-eQTLs. Fifteen regions with the highest eQTL density were selected as eGWAS hotspots and potential mechanisms of lncRNA regulation in these hotspots were explored. However, we did not identify any interactions between the transcription factors or miRNAs in the hotspots and the lncRNAs, nor did we observe a significant enrichment of regulatory elements in these hotspots. While we could not pinpoint the key factors regulating lncRNA expression, our results suggest that the regulation of lncRNAs involves more complex mechanisms. CONCLUSION Our findings provide insights into several features and potential functions of lncRNAs in various tissues. However, the mechanisms by which lncRNA eQTLs regulate lncRNA expression remain unclear. Further research is needed to explore the regulation of lncRNA expression and the mechanisms underlying lncRNA interactions with small molecules and regulatory proteins.
Collapse
Affiliation(s)
- Liyan Deng
- Animal Breeding and Genomics, Wageningen University & Research, Wageningen, the Netherlands; Kunpeng Institute of Modern Agriculture at Foshan, Agricultural Genomics Institute, Chinese Academy of Agricultural Sciences, Foshan 528226, China; Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China.
| | - Marta Gòdia
- Animal Breeding and Genomics, Wageningen University & Research, Wageningen, the Netherlands
| | - Martijn F L Derks
- Animal Breeding and Genomics, Wageningen University & Research, Wageningen, the Netherlands; Topigs Norsvin Research Center, 's-Hertogenbosch, the Netherlands
| | | | - Samin Farhangi
- Animal Breeding and Genomics, Wageningen University & Research, Wageningen, the Netherlands
| | - Zhonglin Tang
- Kunpeng Institute of Modern Agriculture at Foshan, Agricultural Genomics Institute, Chinese Academy of Agricultural Sciences, Foshan 528226, China; Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Martien A M Groenen
- Animal Breeding and Genomics, Wageningen University & Research, Wageningen, the Netherlands
| | - Ole Madsen
- Animal Breeding and Genomics, Wageningen University & Research, Wageningen, the Netherlands.
| |
Collapse
|
3
|
Pike AMC, Amal S, Maginnis MS, Wilczek MP. Evaluating Neural Network Performance in Predicting Disease Status and Tissue Source of JC Polyomavirus from Patient Isolates Based on the Hypervariable Region of the Viral Genome. Viruses 2024; 17:12. [PMID: 39861801 PMCID: PMC11769028 DOI: 10.3390/v17010012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2024] [Revised: 12/23/2024] [Accepted: 12/24/2024] [Indexed: 01/27/2025] Open
Abstract
JC polyomavirus (JCPyV) establishes a persistent, asymptomatic kidney infection in most of the population. However, JCPyV can reactivate in immunocompromised individuals and cause progressive multifocal leukoencephalopathy (PML), a fatal demyelinating disease with no approved treatment. Mutations in the hypervariable non-coding control region (NCCR) of the JCPyV genome have been linked to disease outcomes and neuropathogenesis, yet few metanalyses document these associations. Many online sequence entries, including those on NCBI databases, lack sufficient sample information, limiting large-scale analyses of NCCR sequences. Machine learning techniques, however, can augment available data for analysis. This study employs a previously compiled dataset of 989 JCPyV NCCR sequences from GenBank with associated patient PML status and viral tissue source to train multilayer perceptrons for predicting missing information within the dataset. The PML status and tissue source models were 100% and 87.8% accurate, respectively. Within the dataset, 348 samples had an unconfirmed PML status, where 259 were predicted as No PML and 89 as PML sequences. Of the 63 sequences with unconfirmed tissue sources, eight samples were predicted as urine, 13 as blood, and 42 as cerebrospinal fluid. These models can improve viral sequence identification and provide insights into viral mutations and pathogenesis.
Collapse
Affiliation(s)
- Aiden M. C. Pike
- Maine Space Grant Consortium, Augusta, ME 04330, USA;
- Life Sciences, Health, and Engineering Department, The Roux Institute, Northeastern University, Portland, ME 04101, USA
- Department of Molecular and Biomedical Sciences, University of Maine, Orono, ME 04469, USA;
| | - Saeed Amal
- The Roux Institute, Northeastern University, Portland, ME 04101, USA;
- Department of Bioengineering, College of Engineering, Northeastern University, Boston, MA 02115, USA
| | - Melissa S. Maginnis
- Department of Molecular and Biomedical Sciences, University of Maine, Orono, ME 04469, USA;
- Graduate School in Biomedical Science and Engineering, University of Maine, Orono, ME 04469, USA
| | - Michael P. Wilczek
- Life Sciences, Health, and Engineering Department, The Roux Institute, Northeastern University, Portland, ME 04101, USA
- Observational Health Data Sciences and Informatics Center, The Roux Institute, Northeastern University, Portland, ME 04101, USA
- Department of Chemistry and Chemical Biology, College of Science, Northeastern University, Boston, MA 02115, USA
| |
Collapse
|
4
|
Santucci K, Cheng Y, Xu SM, Janitz M. Enhancing novel isoform discovery: leveraging nanopore long-read sequencing and machine learning approaches. Brief Funct Genomics 2024; 23:683-694. [PMID: 39158328 DOI: 10.1093/bfgp/elae031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 07/29/2024] [Accepted: 07/31/2024] [Indexed: 08/20/2024] Open
Abstract
Long-read sequencing technologies can capture entire RNA transcripts in a single sequencing read, reducing the ambiguity in constructing and quantifying transcript models in comparison to more common and earlier methods, such as short-read sequencing. Recent improvements in the accuracy of long-read sequencing technologies have expanded the scope for novel splice isoform detection and have also enabled a far more accurate reconstruction of complex splicing patterns and transcriptomes. Additionally, the incorporation and advancements of machine learning and deep learning algorithms in bioinformatic software have significantly improved the reliability of long-read sequencing transcriptomic studies. However, there is a lack of consensus on what bioinformatic tools and pipelines produce the most precise and consistent results. Thus, this review aims to discuss and compare the performance of available methods for novel isoform discovery with long-read sequencing technologies, with 25 tools being presented. Furthermore, this review intends to demonstrate the need for developing standard analytical pipelines, tools, and transcript model conventions for novel isoform discovery and transcriptomic studies.
Collapse
Affiliation(s)
- Kristina Santucci
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia
| | - Yuning Cheng
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia
| | - Si-Mei Xu
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia
| | - Michael Janitz
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW 2052, Australia
| |
Collapse
|
5
|
Li A, Zhou H, Xiong S, Li J, Mallik S, Fei R, Liu Y, Zhou H, Wang X, Hei X, Wang L. PLEKv2: predicting lncRNAs and mRNAs based on intrinsic sequence features and the coding-net model. BMC Genomics 2024; 25:756. [PMID: 39095710 PMCID: PMC11295476 DOI: 10.1186/s12864-024-10662-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Accepted: 07/25/2024] [Indexed: 08/04/2024] Open
Abstract
BACKGROUND Long non-coding RNAs (lncRNAs) are RNA transcripts of more than 200 nucleotides that do not encode canonical proteins. Their biological structure is similar to messenger RNAs (mRNAs). To distinguish between lncRNA and mRNA transcripts quickly and accurately, we upgraded the PLEK alignment-free tool to its next version, PLEKv2, and constructed models tailored for both animals and plants. RESULTS PLEKv2 can achieve 98.7% prediction accuracy for human datasets. Compared with classical tools and deep learning-based models, this is 8.1%, 3.7%, 16.6%, 1.4%, 4.9%, and 48.9% higher than CPC2, CNCI, Wen et al.'s CNN, LncADeep, PLEK, and NcResNet, respectively. The accuracy of PLEKv2 was > 90% for cross-species prediction. PLEKv2 is more effective and robust than CPC2, CNCI, LncADeep, PLEK, and NcResNet for primate datasets (including chimpanzees, macaques, and gorillas). Moreover, PLEKv2 is not only suitable for non-human primates that are closely related to humans, but can also predict the coding ability of RNA sequences in plants such as Arabidopsis. CONCLUSIONS The experimental results illustrate that the model constructed by PLEKv2 can distinguish lncRNAs and mRNAs better than PLEK. The PLEKv2 software is freely available at https://sourceforge.net/projects/plek2/ .
Collapse
Affiliation(s)
- Aimin Li
- Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, Shaanxi, 710048, China.
| | - Haotian Zhou
- Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, Shaanxi, 710048, China
| | - Siqi Xiong
- Department of Information Engineering, College of Technology, Hubei Engineering University, Xiaogan, Hubei, 432000, China.
| | - Junhuai Li
- Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, Shaanxi, 710048, China
| | - Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Rong Fei
- Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, Shaanxi, 710048, China
| | - Yajun Liu
- Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, Shaanxi, 710048, China
| | - Hongfang Zhou
- Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, Shaanxi, 710048, China
| | - Xiaofan Wang
- Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, Shaanxi, 710048, China
| | - Xinhong Hei
- Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, Shaanxi, 710048, China
| | - Lei Wang
- Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, Shaanxi, 710048, China
| |
Collapse
|
6
|
Yao L, Xie P, Guan J, Chung CR, Huang Y, Pang Y, Wu H, Chiang YC, Lee TY. CapsEnhancer: An Effective Computational Framework for Identifying Enhancers Based on Chaos Game Representation and Capsule Network. J Chem Inf Model 2024; 64:5725-5736. [PMID: 38946113 PMCID: PMC11267569 DOI: 10.1021/acs.jcim.4c00546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Revised: 06/21/2024] [Accepted: 06/21/2024] [Indexed: 07/02/2024]
Abstract
Enhancers are a class of noncoding DNA, serving as crucial regulatory elements in governing gene expression by binding to transcription factors. The identification of enhancers holds paramount importance in the field of biology. However, traditional experimental methods for enhancer identification demand substantial human and material resources. Consequently, there is a growing interest in employing computational methods for enhancer prediction. In this study, we propose a two-stage framework based on deep learning, termed CapsEnhancer, for the identification of enhancers and their strengths. CapsEnhancer utilizes chaos game representation to encode DNA sequences into unique images and employs a capsule network to extract local and global features from sequence "images". Experimental results demonstrate that CapsEnhancer achieves state-of-the-art performance in both stages. In the first and second stages, the accuracy surpasses the previous best methods by 8 and 3.5%, reaching accuracies of 94.5 and 95%, respectively. Notably, this study represents the pioneering application of computer vision methods to enhancer identification tasks. Our work not only contributes novel insights to enhancer identification but also provides a fresh perspective for other biological sequence analysis tasks.
Collapse
Affiliation(s)
- Lantian Yao
- Kobilka
Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China
- School
of Science and Engineering, The Chinese
University of Hong Kong, Shenzhen 518172, China
| | - Peilin Xie
- Kobilka
Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China
| | - Jiahui Guan
- School
of Medicine, The Chinese University of Hong
Kong, Shenzhen 518172, China
| | - Chia-Ru Chung
- Department
of Computer Science and Information Engineering, National Central University, Taoyuan 320317, Taiwan
| | - Yixian Huang
- School
of Medicine, The Chinese University of Hong
Kong, Shenzhen 518172, China
| | - Yuxuan Pang
- Division
of Health Medical Intelligence, Human Genome Center, The Institute
of Medical Science, The University of Tokyo, Tokyo 108-8639, Japan
| | - Huacong Wu
- School
of Medicine, The Chinese University of Hong
Kong, Shenzhen 518172, China
| | - Ying-Chih Chiang
- Kobilka
Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China
- School
of Medicine, The Chinese University of Hong
Kong, Shenzhen 518172, China
| | - Tzong-Yi Lee
- Institute
of Bioinformatics and Systems Biology, National
Yang Ming Chiao Tung University, Hsinchu 300093, Taiwan
- Center
for Intelligent Drug Systems and Smart Bio-devices (IDS2B), National Yang Ming Chiao Tung University, Hsinchu 300093, Taiwan
| |
Collapse
|
7
|
Diao B, Luo J, Guo Y. A comprehensive survey on deep learning-based identification and predicting the interaction mechanism of long non-coding RNAs. Brief Funct Genomics 2024; 23:314-324. [PMID: 38576205 DOI: 10.1093/bfgp/elae010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 02/25/2024] [Accepted: 03/14/2024] [Indexed: 04/06/2024] Open
Abstract
Long noncoding RNAs (lncRNAs) have been discovered to be extensively involved in eukaryotic epigenetic, transcriptional, and post-transcriptional regulatory processes with the advancements in sequencing technology and genomics research. Therefore, they play crucial roles in the body's normal physiology and various disease outcomes. Presently, numerous unknown lncRNA sequencing data require exploration. Establishing deep learning-based prediction models for lncRNAs provides valuable insights for researchers, substantially reducing time and costs associated with trial and error and facilitating the disease-relevant lncRNA identification for prognosis analysis and targeted drug development as the era of artificial intelligence progresses. However, most lncRNA-related researchers lack awareness of the latest advancements in deep learning models and model selection and application in functional research on lncRNAs. Thus, we elucidate the concept of deep learning models, explore several prevalent deep learning algorithms and their data preferences, conduct a comprehensive review of recent literature studies with exemplary predictive performance over the past 5 years in conjunction with diverse prediction functions, critically analyze and discuss the merits and limitations of current deep learning models and solutions, while also proposing prospects based on cutting-edge advancements in lncRNA research.
Collapse
Affiliation(s)
- Biyu Diao
- Department of Breast Surgery, The First Affiliated Hospital of Ningbo University, No. 59, Liuting Street, Haishu District, Ningbo 315000, China
| | - Jin Luo
- Department of Breast Surgery, The First Affiliated Hospital of Ningbo University, No. 59, Liuting Street, Haishu District, Ningbo 315000, China
| | - Yu Guo
- Department of Breast Surgery, The First Affiliated Hospital of Ningbo University, No. 59, Liuting Street, Haishu District, Ningbo 315000, China
| |
Collapse
|
8
|
Hazan JM, Amador R, Ali-Nasser T, Lahav T, Shotan SR, Steinberg M, Cohen Z, Aran D, Meiri D, Assaraf YG, Guigó R, Bester AC. Integration of transcription regulation and functional genomic data reveals lncRNA SNHG6's role in hematopoietic differentiation and leukemia. J Biomed Sci 2024; 31:27. [PMID: 38419051 PMCID: PMC10900714 DOI: 10.1186/s12929-024-01015-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 02/22/2024] [Indexed: 03/02/2024] Open
Abstract
BACKGROUND Long non-coding RNAs (lncRNAs) are pivotal players in cellular processes, and their unique cell-type specific expression patterns render them attractive biomarkers and therapeutic targets. Yet, the functional roles of most lncRNAs remain enigmatic. To address the need to identify new druggable lncRNAs, we developed a comprehensive approach integrating transcription factor binding data with other genetic features to generate a machine learning model, which we have called INFLAMeR (Identifying Novel Functional LncRNAs with Advanced Machine Learning Resources). METHODS INFLAMeR was trained on high-throughput CRISPR interference (CRISPRi) screens across seven cell lines, and the algorithm was based on 71 genetic features. To validate the predictions, we selected candidate lncRNAs in the human K562 leukemia cell line and determined the impact of their knockdown (KD) on cell proliferation and chemotherapeutic drug response. We further performed transcriptomic analysis for candidate genes. Based on these findings, we assessed the lncRNA small nucleolar RNA host gene 6 (SNHG6) for its role in myeloid differentiation. Finally, we established a mouse K562 leukemia xenograft model to determine whether SNHG6 KD attenuates tumor growth in vivo. RESULTS The INFLAMeR model successfully reconstituted CRISPRi screening data and predicted functional lncRNAs that were previously overlooked. Intensive cell-based and transcriptomic validation of nearly fifty genes in K562 revealed cell type-specific functionality for 85% of the predicted lncRNAs. In this respect, our cell-based and transcriptomic analyses predicted a role for SNHG6 in hematopoiesis and leukemia. Consistent with its predicted role in hematopoietic differentiation, SNHG6 transcription is regulated by hematopoiesis-associated transcription factors. SNHG6 KD reduced the proliferation of leukemia cells and sensitized them to differentiation. Treatment of K562 leukemic cells with hemin and PMA, respectively, demonstrated that SNHG6 inhibits red blood cell differentiation but strongly promotes megakaryocyte differentiation. Using a xenograft mouse model, we demonstrate that SNHG6 KD attenuated tumor growth in vivo. CONCLUSIONS Our approach not only improved the identification and characterization of functional lncRNAs through genomic approaches in a cell type-specific manner, but also identified new lncRNAs with roles in hematopoiesis and leukemia. Such approaches can be readily applied to identify novel targets for precision medicine.
Collapse
Affiliation(s)
- Joshua M Hazan
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - Raziel Amador
- Centre for Genomic Regulation (CRG), Doctor Aiguader 88, 08003, Barcelona, Catalonia, Spain
- Universitat de Barcelona (UB), Barcelona, Catalonia, Spain
| | - Tahleel Ali-Nasser
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - Tamar Lahav
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - Stav Roni Shotan
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - Miryam Steinberg
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - Ziv Cohen
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
- The Taub Faculty of Computer Science, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - Dvir Aran
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
- The Taub Faculty of Computer Science, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - David Meiri
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - Yehuda G Assaraf
- The Fred Wyszkowski Cancer Research Laboratory, Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel
| | - Roderic Guigó
- Centre for Genomic Regulation (CRG), Doctor Aiguader 88, 08003, Barcelona, Catalonia, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Catalonia, Spain
| | - Assaf C Bester
- Department of Biology, Technion-Israel Institute of Technology, 3200003, Haifa, Israel.
| |
Collapse
|
9
|
Dhakal P, Tayara H, Chong KT. An ensemble of stacking classifiers for improved prediction of miRNA-mRNA interactions. Comput Biol Med 2023; 164:107242. [PMID: 37473564 DOI: 10.1016/j.compbiomed.2023.107242] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 06/21/2023] [Accepted: 07/07/2023] [Indexed: 07/22/2023]
Abstract
MicroRNAs (miRNAs) are small non-coding RNA molecules that play a crucial role in regulating gene expression at the post-transcriptional level by binding to potential target sites of messenger RNAs (mRNAs), facilitated by the Argonaute family of proteins. Selecting the conservative candidate target sites (CTS) is a challenging step, considering that most of the existing computational algorithms primarily focus on canonical site types, which is a time-consuming and inefficient utilization of miRNA target site interactions. We developed a stacking classifier algorithm that addresses the CTS selection criteria using feature-encoding techniques that generates feature vectors, including k-mer nucleotide composition, dinucleotide composition, pseudo-nucleotide composition, and sequence order coupling. This innovative stacking classifier algorithm surpassed previous state-of-the-art algorithms in predicting functional miRNA targets. We evaluated the performance of the proposed model on 10 independent test datasets and obtained an average accuracy of 79.77%, which is a significant improvement of 7.26 % over previous models. This improvement shows that the proposed method has great potential for distinguishing highly functional miRNA targets and can serve as a valuable tool in biomedical and drug development research.
Collapse
Affiliation(s)
- Priyash Dhakal
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea.
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea; Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea.
| |
Collapse
|
10
|
Wei C, Ye Z, Zhang J, Li A. CPPVec: an accurate coding potential predictor based on a distributed representation of protein sequence. BMC Genomics 2023; 24:264. [PMID: 37198531 DOI: 10.1186/s12864-023-09365-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Accepted: 05/07/2023] [Indexed: 05/19/2023] Open
Abstract
Long non-coding RNAs (lncRNAs) play a crucial role in numbers of biological processes and have received wide attention during the past years. Since the rapid development of high-throughput transcriptome sequencing technologies (RNA-seq) lead to a large amount of RNA data, it is urgent to develop a fast and accurate coding potential predictor. Many computational methods have been proposed to address this issue, they usually exploit information on open reading frame (ORF), protein sequence, k-mer, evolutionary signatures, or homology. Despite the effectiveness of these approaches, there is still much room to improve. Indeed, none of these methods exploit the contextual information of RNA sequence, for example, k-mer features that counts the occurrence frequencies of continuous nucleotides (k-mer) in the whole RNA sequence cannot reflect local contextual information of each k-mer. In view of this shortcoming, here, we present a novel alignment-free method, CPPVec, which exploits the contextual information of RNA sequence for coding potential prediction for the first time, it can be easily implemented by distributed representation (e.g., doc2vec) of protein sequence translated from the longest ORF. The experimental findings demonstrate that CPPVec is an accurate coding potential predictor and significantly outperforms existing state-of-the-art methods.
Collapse
Affiliation(s)
- Chao Wei
- School of Computer Science, Hubei University of Technology, Wuhan, China.
| | - Zhiwei Ye
- School of Computer Science, Hubei University of Technology, Wuhan, China
| | - Junying Zhang
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Aimin Li
- School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, China
| |
Collapse
|
11
|
Schleif WS, Sarasua SM, DeLuca JM. Preanalytic and Analytic Quality System Considerations in Noncoding RNA Biomarker Development for Clinical Diagnostics. Genet Test Mol Biomarkers 2023; 27:172-182. [PMID: 37257182 DOI: 10.1089/gtmb.2022.0086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2023] Open
Abstract
A frequent topic of biomedical research is the potential clinical use of non-coding (nc) RNAs as quantitative biomarkers for a broad spectrum of health and disease. However, ncRNA analyses have not been pressed into widespread diagnostic use. Strong preclinical evidence suggests obstacles in the translation and reproducibility of this type of biomarker which may result from preanalytical and analytical variation in the non-standardized processes used to collect, process, and store samples, as well as the substantive differences between small and long ncRNA. We performed a narrative review of selected literature, through the lens of key laboratory-developed test (LDT) regulations under the Clinical Laboratory Improvement Amendments (CLIA) in the United States, to study critical gaps in ncRNA validation studies. This review describes the leading candidate ncRNA subclasses, their biogenesis and cellular function, and identifies specific pre-analytical variables with disproportionate impact on testing performance. We summarize these findings with strategic recommendations to clinicians and biomedical scientists involved in the design, conduct, and translation of ncRNA biomarker development.
Collapse
Affiliation(s)
- William S Schleif
- Healthcare Genetics Program, School of Nursing, College of Health, Education, and Human Development, Clemson University, Clemson, South Carolina, USA
- Program in Pediatric Biospecimen Science, Johns Hopkins All Children's Institute for Clinical and Translational Research, St. Petersburg, Florida, USA
| | - Sara M Sarasua
- Healthcare Genetics Program, School of Nursing, College of Health, Education, and Human Development, Clemson University, Clemson, South Carolina, USA
| | - Jane M DeLuca
- Healthcare Genetics Program, School of Nursing, College of Health, Education, and Human Development, Clemson University, Clemson, South Carolina, USA
| |
Collapse
|
12
|
A novel feature and sample joint transfer learning method with feature selection in semi-supervised scenarios for identifying the sequence of some species with less known genetic data. Soft comput 2023. [DOI: 10.1007/s00500-022-07773-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
|
13
|
Alsinet T, Argelich J, Béjar R, Gibert D, Planes J. Argumentation Reasoning with Graph Isomorphism Networks for Reddit Conversation Analysis. INT J COMPUT INT SYS 2022. [DOI: 10.1007/s44196-022-00147-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
AbstractThe automated analysis of different trends in online debating forums is an interesting tool for sampling the agreement between citizens in different topics. In previous work, we have defined computational models to measure different values in these online debating forums. One component in these models has been the identification of the set of accepted posts by an argumentation problem that characterizes this accepted set through a particular argumentation acceptance semantics. A second component is the classification of posts into two groups: the ones that agree with the root post of the debate, and the ones that disagree with it. Once we compute the set of accepted posts, we compute the different measures we are interested to get from the debate, as functions defined over the bipartition of the posts and the set of accepted posts. In this work, we propose to explore the use of graph neural networks (GNNs), based on graph isomorphism networks, to solve the problem of computing these measures, using as input the debate tree, instead of using our previous argumentation reasoning system. We focus on the particular online debate forum Reddit, and on the computation of a measure of the polarization in the debate. We explore the use of two different approaches: one where a single GNN model computes directly the polarization of the debate, and another one where the polarization is computed using two different GNNs: the first one to compute the accepted posts of the debate, and the second one to compute the bipartition of the posts of the debate. Our results over a set of Reddit debates show that GNNs can be used to compute the polarization measure with an acceptable error, even if the number of layers of the network is bounded by a constant. We observed that the model based on a single GNN shows the lowest error, yet the one based on two GNNs has more flexibility to compute additional measures from the debates. We also compared the execution time of our GNN-based models with a previous approach based on a distributed algorithm for the computation of the accepted posts, and observed a better performance.
Collapse
|
14
|
Liu J, Zhou D. Minimum Functional Length Analysis of K-Mer Based on BPNN. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2920-2925. [PMID: 34310316 DOI: 10.1109/tcbb.2021.3098512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
BP neural network (BPNN), as a multilayer feed-forward network, can realize the deep cognition to target data and high accuracy to output results. However, there were still no related research of k-mer based on BPNN yet. In present study, BPNN was used to train and test binary classification data of each classification mode respectively. All k-mer were divided into two categories according to the X + Y content or completely random mode. Results showed that 1) For classification mode of X + Y content, the accuracy of k-mers classification was 100 percent, no matter k ≤ 6 or k ≥ 7; 2) For completely random classification mode, the accuracy of classification is 100 percent for k-mers of k ≤ 6; But for k-mers of k ≥ 7, the accuracy is less than 100 percent, and with the increase of k value, the accuracy of classification gradually decreases (gradually approaches 50 percent). The k-mers of k ≥ 7 should be the basic functional fragment of nucleic acid, and perform basic nucleic acid function in the DNA sequence. The k-mers of k ≤ 6 should be the basic component fragment of nucleic acid, and no longer perform basic nucleic acid function.
Collapse
|
15
|
A grey convolutional neural network model for traffic flow prediction under traffic accidents. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.05.072] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
16
|
Bhattacharya D, Kleeblatt DC, Statt A, Reinhart WF. Predicting aggregate morphology of sequence-defined macromolecules with recurrent neural networks. SOFT MATTER 2022; 18:5037-5051. [PMID: 35748651 DOI: 10.1039/d2sm00452f] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Self-assembly of dilute sequence-defined macromolecules is a complex phenomenon in which the local arrangement of chemical moieties can lead to the formation of long-range structure. The dependence of this structure on the sequence necessarily implies that a mapping between the two exists, yet it has been difficult to model so far. Predicting the aggregation behavior of these macromolecules is challenging due to the lack of effective order parameters, a vast design space, inherent variability, and high computational costs associated with currently available simulation techniques. Here, we accurately predict the morphology of aggregates self-assembled from sequence-defined macromolecules using supervised machine learning. We find that regression models with implicit representation learning perform significantly better than those based on engineered features such as k-mer counting, and a recurrent-neural-network-based regressor performs the best out of nine model architectures we tested. Furthermore, we demonstrate the high-throughput screening of monomer sequences using the regression model to identify candidates for self-assembly into selected morphologies. Our strategy is shown to successfully identify multiple suitable sequences in every test we performed, so we hope the insights gained here can be extended to other increasingly complex design scenarios in the future, such as the design of sequences under polydispersity and at varying environmental conditions.
Collapse
Affiliation(s)
- Debjyoti Bhattacharya
- Materials Science and Engineering, Pennsylvania State University, University Park, PA 16802, USA.
| | - Devon C Kleeblatt
- Materials Science and Engineering, Pennsylvania State University, University Park, PA 16802, USA.
| | - Antonia Statt
- Materials Science and Engineering, Grainger College of Engineering, University of Illinois, Urbana-Champaign, IL 61801, USA
| | - Wesley F Reinhart
- Materials Science and Engineering, Pennsylvania State University, University Park, PA 16802, USA.
- Institute for Computational and Data Sciences, Pennsylvania State University, University Park, PA 16802, USA
| |
Collapse
|
17
|
Ammunét T, Wang N, Khan S, Elo LL. Deep learning tools are top performers in long non-coding RNA prediction. Brief Funct Genomics 2022; 21:230-241. [PMID: 35136929 PMCID: PMC9123429 DOI: 10.1093/bfgp/elab045] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2021] [Revised: 11/08/2021] [Accepted: 12/02/2021] [Indexed: 11/23/2022] Open
Abstract
The increasing amount of transcriptomic data has brought to light vast numbers of potential novel RNA transcripts. Accurately distinguishing novel long non-coding RNAs (lncRNAs) from protein-coding messenger RNAs (mRNAs) has challenged bioinformatic tool developers. Most recently, tools implementing deep learning architectures have been developed for this task, with the potential of discovering sequence features and their interactions still not surfaced in current knowledge. We compared the performance of deep learning tools with other predictive tools that are currently used in lncRNA coding potential prediction. A total of 15 tools representing the variety of available methods were investigated. In addition to known annotated transcripts, we also evaluated the use of the tools in actual studies with real-life data. The robustness and scalability of the tools' performance was tested with varying sized test sets and test sets with different proportions of lncRNAs and mRNAs. In addition, the ease-of-use for each tested tool was scored. Deep learning tools were top performers in most metrics and labelled transcripts similarly with each other in the real-life dataset. However, the proportion of lncRNAs and mRNAs in the test sets affected the performance of all tools. Computational resources were utilized differently between the top-ranking tools, thus the nature of the study may affect the decision of choosing one well-performing tool over another. Nonetheless, the results suggest favouring the novel deep learning tools over other tools currently in broad use.
Collapse
Affiliation(s)
- Tea Ammunét
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland
| | - Ning Wang
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland
| | - Sofia Khan
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland
| | - Laura L Elo
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland
- Institute of Biomedicine, University of Turku, Turku, Finland
| |
Collapse
|
18
|
Gutiérrez-Cárdenas J, Wang Z. Prediction of binding miRNAs involved with immune genes to the SARS-CoV-2 by using sequence features extraction and One-class SVM. INFORMATICS IN MEDICINE UNLOCKED 2022; 30:100958. [PMID: 35528315 PMCID: PMC9057929 DOI: 10.1016/j.imu.2022.100958] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 04/25/2022] [Accepted: 04/25/2022] [Indexed: 10/24/2022] Open
Abstract
The prediction of host human miRNA binding to the SARS-COV-2-CoV-2 RNA sequence is of particular interest. This biological process could lead to virus repression, serve as biomarkers for diagnosis, or as potential treatments for this disease. One source of concern is attempting to uncover the viral regions in which this binding could occur, as well as how these miRNAs binding could affect the SARS-COV-2 virus's processes. Using extracted sequence features from this base pairing, we predicted the relationships between miRNAs that interact with genes involved in immune function and bind to the SARS-COV-2 genome in their 5' UTR region. We compared two supervised models, SVM and Random Forest, with an unsupervised One-Class SVM. When the results of the confusion matrices were inspected, the results of the supervised models were misleading, resulting in a Type II error. However, with the latter model, we achieved an average accuracy of 92%, sensitivity of 96.18%, and specificity of 78%. We hypothesize that studying the bind of miRNAs that affect immunological genes and bind to the SARS-COV-2 virus will lead to potential genetic therapies for fighting the disease or understanding how the immune system is affected when this type of viral infection occurs.
Collapse
Affiliation(s)
- Juan Gutiérrez-Cárdenas
- Universidad de Lima, Lima, Peru
- College of Science, Engineering and Technology, University of South Africa, Florida, 1710, South Africa
| | - Zenghui Wang
- College of Science, Engineering and Technology, University of South Africa, Florida, 1710, South Africa
| |
Collapse
|
19
|
Periwal N, Sharma P, Arora P, Pandey S, Kaur B, Sood V. A novel binary k-mer approach for classification of coding and non-coding RNAs across diverse species. Biochimie 2022; 199:112-122. [PMID: 35476940 DOI: 10.1016/j.biochi.2022.04.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 03/12/2022] [Accepted: 04/21/2022] [Indexed: 12/01/2022]
Abstract
Classification among coding sequences (CDS) and non-coding RNA (ncRNA) sequences is a challenge and several machine learning models have been developed for the same. Since the frequency of curated CDS is many-folds as compared to that of the ncRNAs, we devised a novel approach to work with the complete datasets from fifteen diverse species. In our proposed binary approach, we replaced all the 'A's and 'T's with '0's and 'G's and 'C's with '1's to obtain a binary form of CDS and ncRNAs. The k-mer analysis of these binary sequences revealed that the frequency of binary patterns among the CDS and ncRNAs can be used as features to distinguish among them. Using insights from these distinguishing frequencies, we used k-nearest neighbor classifier to classify among them. Our strategy is not only time-efficient but leads to significantly increased performance metrics in terms of Matthews Correlation Coefficient (MCC), Accuracy, F1 score, Precision, Recall and AUC-ROC, for species like P. paniscus, M. mulatta, M. lucifugus, G. gallus, C. japonica, C. abingdonii, A. carolinensis, D. melanogaster and C. elegans when compared with the conventional ATGC approach. Additionally, we also show that the performance obtained for diverse species tested on the model based on H. sapiens, correlated with the geological evolutionary timeline, thereby further strengthening our approach. Therefore, we propose that CDS and ncRNAs can be efficiently classified using "2-character" binary frequency as compared to "4-character" frequency of ATGC approach. Thus, our highly efficient binary approach can replace the more complex ATGC approach successfully.
Collapse
Affiliation(s)
- Neha Periwal
- Department of Biochemistry, School of Chemical and Life Sciences, Jamia Hamdard, Delhi, 110062, India
| | - Priya Sharma
- Department of Biochemistry, School of Chemical and Life Sciences, Jamia Hamdard, Delhi, 110062, India
| | - Pooja Arora
- Department of Zoology, Hansraj College, University of Delhi, Delhi, 110007, India
| | - Saurabh Pandey
- Department of Biochemistry, School of Chemical and Life Sciences, Jamia Hamdard, Delhi, 110062, India
| | - Baljeet Kaur
- Department of Computer Science, Hansraj College, University of Delhi, Delhi, 110007, India.
| | - Vikas Sood
- Department of Biochemistry, School of Chemical and Life Sciences, Jamia Hamdard, Delhi, 110062, India.
| |
Collapse
|
20
|
Bohnsack KS, Kaden M, Abel J, Saralajew S, Villmann T. The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers. ENTROPY (BASEL, SWITZERLAND) 2021; 23:1357. [PMID: 34682081 PMCID: PMC8534762 DOI: 10.3390/e23101357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Revised: 10/11/2021] [Accepted: 10/14/2021] [Indexed: 11/16/2022]
Abstract
In the present article we propose the application of variants of the mutual information function as characteristic fingerprints of biomolecular sequences for classification analysis. In particular, we consider the resolved mutual information functions based on Shannon-, Rényi-, and Tsallis-entropy. In combination with interpretable machine learning classifier models based on generalized learning vector quantization, a powerful methodology for sequence classification is achieved which allows substantial knowledge extraction in addition to the high classification ability due to the model-inherent robustness. Any potential (slightly) inferior performance of the used classifier is compensated by the additional knowledge provided by interpretable models. This knowledge may assist the user in the analysis and understanding of the used data and considered task. After theoretical justification of the concepts, we demonstrate the approach for various example data sets covering different areas in biomolecular sequence analysis.
Collapse
Affiliation(s)
- Katrin Sophie Bohnsack
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Marika Kaden
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Julia Abel
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Sascha Saralajew
- Bosch Center for Artificial Intelligence, 71272 Renningen, Germany;
| | - Thomas Villmann
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| |
Collapse
|
21
|
Asim MN, Ibrahim MA, Imran Malik M, Dengel A, Ahmed S. Advances in Computational Methodologies for Classification and Sub-Cellular Locality Prediction of Non-Coding RNAs. Int J Mol Sci 2021; 22:8719. [PMID: 34445436 PMCID: PMC8395733 DOI: 10.3390/ijms22168719] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2021] [Revised: 08/02/2021] [Accepted: 08/03/2021] [Indexed: 02/06/2023] Open
Abstract
Apart from protein-coding Ribonucleic acids (RNAs), there exists a variety of non-coding RNAs (ncRNAs) which regulate complex cellular and molecular processes. High-throughput sequencing technologies and bioinformatics approaches have largely promoted the exploration of ncRNAs which revealed their crucial roles in gene regulation, miRNA binding, protein interactions, and splicing. Furthermore, ncRNAs are involved in the development of complicated diseases like cancer. Categorization of ncRNAs is essential to understand the mechanisms of diseases and to develop effective treatments. Sub-cellular localization information of ncRNAs demystifies diverse functionalities of ncRNAs. To date, several computational methodologies have been proposed to precisely identify the class as well as sub-cellular localization patterns of RNAs). This paper discusses different types of ncRNAs, reviews computational approaches proposed in the last 10 years to distinguish coding-RNA from ncRNA, to identify sub-types of ncRNAs such as piwi-associated RNA, micro RNA, long ncRNA, and circular RNA, and to determine sub-cellular localization of distinct ncRNAs and RNAs. Furthermore, it summarizes diverse ncRNA classification and sub-cellular localization determination datasets along with benchmark performance to aid the development and evaluation of novel computational methodologies. It identifies research gaps, heterogeneity, and challenges in the development of computational approaches for RNA sequence analysis. We consider that our expert analysis will assist Artificial Intelligence researchers with knowing state-of-the-art performance, model selection for various tasks on one platform, dominantly used sequence descriptors, neural architectures, and interpreting inter-species and intra-species performance deviation.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Muhammad Imran Malik
- National Center for Artificial Intelligence (NCAI), National University of Sciences and Technology, Islamabad 44000, Pakistan;
- School of Electrical Engineering & Computer Science, National University of Sciences and Technology, Islamabad 44000, Pakistan
| | - Andreas Dengel
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- DeepReader GmbH, Trippstadter Str. 122, 67663 Kaiserslautern, Germany
| |
Collapse
|
22
|
Singh OP, Vallejo M, El-Badawy IM, Aysha A, Madhanagopal J, Mohd Faudzi AA. Classification of SARS-CoV-2 and non-SARS-CoV-2 using machine learning algorithms. Comput Biol Med 2021; 136:104650. [PMID: 34329865 PMCID: PMC8294595 DOI: 10.1016/j.compbiomed.2021.104650] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2021] [Revised: 07/08/2021] [Accepted: 07/13/2021] [Indexed: 11/28/2022]
Abstract
Due to the continued evolution of the SARS-CoV-2 pandemic, researchers worldwide are working to mitigate, suppress its spread, and better understand it by deploying digital signal processing (DSP) and machine learning approaches. This study presents an alignment-free approach to classify the SARS-CoV-2 using complementary DNA, which is DNA synthesized from the single-stranded RNA virus. Herein, a total of 1582 samples, with different lengths of genome sequences from different regions, were collected from various data sources and divided into a SARS-CoV-2 and a non-SARS-CoV-2 group. We extracted eight biomarkers based on three-base periodicity, using DSP techniques, and ranked those based on a filter-based feature selection. The ranked biomarkers were fed into k-nearest neighbor, support vector machines, decision trees, and random forest classifiers for the classification of SARS-CoV-2 from other coronaviruses. The training dataset was used to test the performance of the classifiers based on accuracy and F-measure via 10-fold cross-validation. Kappa-scores were estimated to check the influence of unbalanced data. Further, 10 × 10 cross-validation paired t-test was utilized to test the best model with unseen data. Random forest was elected as the best model, differentiating the SARS-CoV-2 coronavirus from other coronaviruses and a control a group with an accuracy of 97.4 %, sensitivity of 96.2 %, and specificity of 98.2 %, when tested with unseen samples. Moreover, the proposed algorithm was computationally efficient, taking only 0.31 s to compute the genome biomarkers, outperforming previous studies.
Collapse
Affiliation(s)
| | - Marta Vallejo
- School of Engineering & Physical Sciences, Heriot-Watt University, Edinburgh, UK
| | - Ismail M El-Badawy
- Electronics and Communications Engineering Department, Arab Academy for Science and Technology, Cairo, Egypt
| | - Ali Aysha
- School of Chemistry, University of Edinburgh, Edinburgh, UK
| | - Jagannathan Madhanagopal
- School of Physiotherapy, Faculty of Allied Health Professional, AIMST University, Semeling Campus, Bedong, Kedah, Malaysia
| | | |
Collapse
|
23
|
Classification of Breast Cancer and Breast Neoplasm Scenarios Based on Machine Learning and Sequence Features from lncRNAs-miRNAs-Diseases Associations. Interdiscip Sci 2021; 13:572-581. [PMID: 34152557 DOI: 10.1007/s12539-021-00451-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Revised: 04/28/2021] [Accepted: 06/09/2021] [Indexed: 10/21/2022]
Abstract
The influence of non-coding RNAs, such as lncRNAs (long non-coding RNAs) and miRNAs (microRNAs), is undeniable in several diseases, for example, in the formation of neoplasms and cancer scenarios. However, there are challenges due to the scarcity of validated datasets and the imbalance in the data. We found that the research of associations between miRNAs-lncRNAs and diseases is limited or done separately. In addition, those investigations, which use Machine Learning models joined with genomic sequence features extracted from miRNAs and lncRNAs, are few compared with using some methods such as genomic expression or Deep Learning techniques. In this paper, we propose a structure of using supervised and unsupervised machine learning models with genomic sequence features, such as k-mers, sequence alignments, and energy folding values, to validate miRNAs and lncRNAs association with breast cancer and neoplasms scenarios. Using One-Class SVM for outlier detection and comparing two supervised models such as SVM and Random Forest, we manage to obtain accuracy results of 95.44% for the One-class model, with 88.79% and 99.65% for the SVM and Random Forest models, respectively. The results showed a promising path for the study of sequence features interactions joined with Machine Learning models comparable to those found in the existing literature.
Collapse
|
24
|
PlncRNA-HDeep: plant long noncoding RNA prediction using hybrid deep learning based on two encoding styles. BMC Bioinformatics 2021; 22:242. [PMID: 33980138 PMCID: PMC8114701 DOI: 10.1186/s12859-020-03870-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2020] [Accepted: 11/09/2020] [Indexed: 11/10/2022] Open
Abstract
Background Long noncoding RNAs (lncRNAs) play an important role in regulating biological activities and their prediction is significant for exploring biological processes. Long short-term memory (LSTM) and convolutional neural network (CNN) can automatically extract and learn the abstract information from the encoded RNA sequences to avoid complex feature engineering. An ensemble model learns the information from multiple perspectives and shows better performance than a single model. It is feasible and interesting that the RNA sequence is considered as sentence and image to train LSTM and CNN respectively, and then the trained models are hybridized to predict lncRNAs. Up to present, there are various predictors for lncRNAs, but few of them are proposed for plant. A reliable and powerful predictor for plant lncRNAs is necessary. Results To boost the performance of predicting lncRNAs, this paper proposes a hybrid deep learning model based on two encoding styles (PlncRNA-HDeep), which does not require prior knowledge and only uses RNA sequences to train the models for predicting plant lncRNAs. It not only learns the diversified information from RNA sequences encoded by p-nucleotide and one-hot encodings, but also takes advantages of lncRNA-LSTM proposed in our previous study and CNN. The parameters are adjusted and three hybrid strategies are tested to maximize its performance. Experiment results show that PlncRNA-HDeep is more effective than lncRNA-LSTM and CNN and obtains 97.9% sensitivity, 95.1% precision, 96.5% accuracy and 96.5% F1 score on Zea mays dataset which are better than those of several shallow machine learning methods (support vector machine, random forest, k-nearest neighbor, decision tree, naive Bayes and logistic regression) and some existing tools (CNCI, PLEK, CPC2, LncADeep and lncRNAnet). Conclusions PlncRNA-HDeep is feasible and obtains the credible predictive results. It may also provide valuable references for other related research.
Collapse
|
25
|
MET Exon 14 Skipping: A Case Study for the Detection of Genetic Variants in Cancer Driver Genes by Deep Learning. Int J Mol Sci 2021; 22:ijms22084217. [PMID: 33921709 PMCID: PMC8072630 DOI: 10.3390/ijms22084217] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Revised: 04/13/2021] [Accepted: 04/17/2021] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Disruption of alternative splicing (AS) is frequently observed in cancer and might represent an important signature for tumor progression and therapy. Exon skipping (ES) represents one of the most frequent AS events, and in non-small cell lung cancer (NSCLC) MET exon 14 skipping was shown to be targetable. METHODS We constructed neural networks (NN/CNN) specifically designed to detect MET exon 14 skipping events using RNAseq data. Furthermore, for discovery purposes we also developed a sparsely connected autoencoder to identify uncharacterized MET isoforms. RESULTS The neural networks had a Met exon 14 skipping detection rate greater than 94% when tested on a manually curated set of 690 TCGA bronchus and lung samples. When globally applied to 2605 TCGA samples, we observed that the majority of false positives was characterized by a blurry coverage of exon 14, but interestingly they share a common coverage peak in the second intron and we speculate that this event could be the transcription signature of a LINE1 (Long Interspersed Nuclear Element 1)-MET (Mesenchymal Epithelial Transition receptor tyrosine kinase) fusion. CONCLUSIONS Taken together, our results indicate that neural networks can be an effective tool to provide a quick classification of pathological transcription events, and sparsely connected autoencoders could represent the basis for the development of an effective discovery tool.
Collapse
|
26
|
Emami N, Ferdousi R. AptaNet as a deep learning approach for aptamer-protein interaction prediction. Sci Rep 2021; 11:6074. [PMID: 33727685 PMCID: PMC7971039 DOI: 10.1038/s41598-021-85629-0] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2020] [Accepted: 03/03/2021] [Indexed: 02/08/2023] Open
Abstract
Aptamers are short oligonucleotides (DNA/RNA) or peptide molecules that can selectively bind to their specific targets with high specificity and affinity. As a powerful new class of amino acid ligands, aptamers have high potentials in biosensing, therapeutic, and diagnostic fields. Here, we present AptaNet-a new deep neural network-to predict the aptamer-protein interaction pairs by integrating features derived from both aptamers and the target proteins. Aptamers were encoded by using two different strategies, including k-mer and reverse complement k-mer frequency. Amino acid composition (AAC) and pseudo amino acid composition (PseAAC) were applied to represent target information using 24 physicochemical and conformational properties of the proteins. To handle the imbalance problem in the data, we applied a neighborhood cleaning algorithm. The predictor was constructed based on a deep neural network, and optimal features were selected using the random forest algorithm. As a result, 99.79% accuracy was achieved for the training dataset, and 91.38% accuracy was obtained for the testing dataset. AptaNet achieved high performance on our constructed aptamer-protein benchmark dataset. The results indicate that AptaNet can help identify novel aptamer-protein interacting pairs and build more-efficient insights into the relationship between aptamers and proteins. Our benchmark dataset and the source codes for AptaNet are available in: https://github.com/nedaemami/AptaNet .
Collapse
Affiliation(s)
- Neda Emami
- Department of Health Information Technology, School of Management and Medical Informatics, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Reza Ferdousi
- Department of Health Information Technology, School of Management and Medical Informatics, Tabriz University of Medical Sciences, Tabriz, Iran.
- Research Center for Pharmaceutical Nanotechnology, Biomedicine Institute, Tabriz University of Medical Sciences, Tabriz, Iran.
| |
Collapse
|
27
|
Pinkney HR, Wright BM, Diermeier SD. The lncRNA Toolkit: Databases and In Silico Tools for lncRNA Analysis. Noncoding RNA 2020; 6:E49. [PMID: 33339309 PMCID: PMC7768357 DOI: 10.3390/ncrna6040049] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2020] [Revised: 12/14/2020] [Accepted: 12/15/2020] [Indexed: 02/07/2023] Open
Abstract
Long non-coding RNAs (lncRNAs) are a rapidly expanding field of research, with many new transcripts identified each year. However, only a small subset of lncRNAs has been characterized functionally thus far. To aid investigating the mechanisms of action by which new lncRNAs act, bioinformatic tools and databases are invaluable. Here, we review a selection of computational tools and databases for the in silico analysis of lncRNAs, including tissue-specific expression, protein coding potential, subcellular localization, structural conformation, and interaction partners. The assembled lncRNA toolkit is aimed primarily at experimental researchers as a useful starting point to guide wet-lab experiments, mainly containing multi-functional, user-friendly interfaces. With more and more new lncRNA analysis tools available, it will be essential to provide continuous updates and maintain the availability of key software in the future.
Collapse
Affiliation(s)
| | | | - Sarah D. Diermeier
- Department of Biochemistry, University of Otago, Dunedin 9016, New Zealand; (H.R.P.); (B.M.W.)
| |
Collapse
|
28
|
XG-ac4C: identification of N4-acetylcytidine (ac4C) in mRNA using eXtreme gradient boosting with electron-ion interaction pseudopotentials. Sci Rep 2020; 10:20942. [PMID: 33262392 PMCID: PMC7708984 DOI: 10.1038/s41598-020-77824-2] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Accepted: 10/22/2020] [Indexed: 02/06/2023] Open
Abstract
N4-acetylcytidine (ac4C) is a post-transcriptional modification in mRNA which plays a major role in the stability and regulation of mRNA translation. The working mechanism of ac4C modification in mRNA is still unclear and traditional laboratory experiments are time-consuming and expensive. Therefore, we propose an XG-ac4C machine learning model based on the eXtreme Gradient Boost classifier for the identification of ac4C sites. The XG-ac4C model uses a combination of electron-ion interaction pseudopotentials and electron-ion interaction pseudopotentials of trinucleotide of the nucleotides in ac4C sites. Moreover, Shapley additive explanations and local interpretable model-agnostic explanations are applied to understand the importance of features and their contribution to the final prediction outcome. The obtained results demonstrate that XG-ac4C outperforms existing state-of-the-art methods. In more detail, the proposed model improves the area under the precision-recall curve by 9.4% and 9.6% in cross-validation and independent tests, respectively. Finally, a user-friendly web server based on the proposed model for ac4C site identification is made freely available at http://nsclbio.jbnu.ac.kr/tools/xgac4c/ .
Collapse
|