1
|
Hu X, Sun H, Shan L, Ma C, Quan H, Zhang Y, Zhang J, Fan Z, Tang Y, Deng L. Unraveling Disease-Associated PIWI-Interacting RNAs with a Contrastive Learning Methods. J Chem Inf Model 2025; 65:4687-4697. [PMID: 40263714 DOI: 10.1021/acs.jcim.5c00173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/24/2025]
Abstract
PIWI-interacting RNAs (piRNAs) are a class of small, non-coding RNAs predominantly expressed in the germ cells of animals and play a crucial role in maintaining genomic integrity, mediating transposon suppression, and ensuring gene stability. Beyond their functions in reproductive cells, piRNAs also play roles in various human diseases, including cancer, suggesting their potential as significant biomarkers critical for disease diagnosis and treatment. Wet-lab methods to identify piRNA-disease associations require substantial resources and are often hit-or-miss. With advancements in computational technologies, an increasing number of researchers are employing computational methods to efficiently predict potential piRNA-disease associations. The sparsity of data in piRNA-disease association studies significantly limits model performance improvement. In this study, we propose a novel computational model, iPiDA_CL, to predict potential piRNA-disease associations through contrastive learning methods, which do not require negative samples. The model represents piRNA-disease association pairs as a bipartite graph and computes the initial embeddings of piRNAs and diseases using Gaussian kernel similarity, with features updated via LightGCN. Based on the siamese network framework, iPiDA_CL constructs online and target networks and employs data augmentation in the target network to build a contrastive learning objective that optimizes model parameters without introducing negative samples. Finally, cross-prediction methods are used to calculate specific piRNA-disease association scores. A series of experimental results demonstrate that iPiDA_CL surpasses state-of-the-art methods in both performance and computational efficiency. The application of iPiDA_CL to the miRNA-disease association dataset underscores its versatility across various ncRNA-disease association task. Furthermore, a case study highlights iPiDA_CL as an efficient and promising tool for predicting piRNA-disease associations.
Collapse
Affiliation(s)
- Xiaowen Hu
- School of Computer Science and Engineering, Center South University, Changsha 410083, China
| | - Hao Sun
- School of Computer Science and Engineering, Center South University, Changsha 410083, China
| | - Linchao Shan
- School of Computer Science and Engineering, Center South University, Changsha 410083, China
| | - Chenxi Ma
- School of Computer Science and Engineering, Center South University, Changsha 410083, China
| | - Hanming Quan
- School of Computer Science and Engineering, Center South University, Changsha 410083, China
| | - Yuanpeng Zhang
- School of software, Xinjiang University, Urumqi 830049, China
| | - Jiaxuan Zhang
- Department of Electrical and Computer Engineering, University of California, San Diego, California 92161, United States
| | - Ziyu Fan
- School of Computer Science and Engineering, Center South University, Changsha 410083, China
| | - Yongjun Tang
- Department of Pediatrics, Xiangya Hospital, Central South University, Changsha 410083, China
| | - Lei Deng
- School of Computer Science and Engineering, Center South University, Changsha 410083, China
| |
Collapse
|
2
|
Wei H, Hou J, Liu Y, Shaytan AK, Liu B, Wu H. iPiDA-LGE: a local and global graph ensemble learning framework for identifying piRNA-disease associations. BMC Biol 2025; 23:119. [PMID: 40346532 PMCID: PMC12065364 DOI: 10.1186/s12915-025-02221-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2024] [Accepted: 04/23/2025] [Indexed: 05/11/2025] Open
Abstract
BACKGROUND Exploring piRNA-disease associations can help discover candidate diagnostic or prognostic biomarkers and therapeutic targets. Several computational methods have been presented for identifying associations between piRNAs and diseases. However, the existing methods encounter challenges such as over-smoothing in feature learning and overlooking specific local proximity relationships, resulting in limited representation of piRNA-disease pairs and insufficient detection of association patterns. RESULTS In this study, we propose a novel computational method called iPiDA-LGE for piRNA-disease association identification. iPiDA-LGE comprises two graph convolutional neural network modules based on local and global piRNA-disease graphs, aimed at capturing specific and general features of piRNA-disease pairs. Additionally, it integrates their refined and macroscopic inferences to derive the final prediction result. CONCLUSIONS The experimental results show that iPiDA-LGE effectively leverages the advantages of both local and global graph learning, thereby achieving more discriminative pair representation and superior predictive performance.
Collapse
Affiliation(s)
- Hang Wei
- School of Computer Science and Technology, Xidian University, Xi'an, Shaanxi, 710126, China.
| | - Jialu Hou
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China
| | - Yumeng Liu
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, Guangdong, 518118, China
| | - Alexey K Shaytan
- Department of Biology, Lomonosov Moscow State University, Moscow, 119234, Russia
- International Laboratory of Bioinformatics, AI and Digital Sciences Institute, Faculty of Computer Science, HSE University, Moscow, 109028, Russia
| | - Bin Liu
- SMBU-MSU-BIT Joint Laboratory On Bioinformatics and Engineering Biology, Shenzhen MSU-BIT University, Shenzhen, Guangdong, 518172, China.
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
- Zhongguancun Academy, Beijing, 100094, China.
| | - Hao Wu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| |
Collapse
|
3
|
Guo C, Wang X, Ren H. Databases and computational methods for the identification of piRNA-related molecules: A survey. Comput Struct Biotechnol J 2024; 23:813-833. [PMID: 38328006 PMCID: PMC10847878 DOI: 10.1016/j.csbj.2024.01.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 12/31/2023] [Accepted: 01/15/2024] [Indexed: 02/09/2024] Open
Abstract
Piwi-interacting RNAs (piRNAs) are a class of small non-coding RNAs (ncRNAs) that plays important roles in many biological processes and major cancer diagnosis and treatment, thus becoming a hot research topic. This study aims to provide an in-depth review of computational piRNA-related research, including databases and computational models. Herein, we perform literature analysis and use comparative evaluation methods to summarize and analyze three aspects of computational piRNA-related research: (i) computational models for piRNA-related molecular identification tasks, (ii) computational models for piRNA-disease association prediction tasks, and (iii) computational resources and evaluation metrics for these tasks. This study shows that computational piRNA-related research has significantly progressed, exhibiting promising performance in recent years, whereas they also suffer from the emerging challenges of inconsistent naming systems and the lack of data. Different from other reviews on piRNA-related identification tasks that focus on the organization of datasets and computational methods, we pay more attention to the analysis of computational models, algorithms, and performances that aim to provide valuable references for computational piRNA-related identification tasks. This study will benefit the theoretical development and practical application of piRNAs by better understanding computational models and resources to investigate the biological functions and clinical implications of piRNA.
Collapse
Affiliation(s)
- Chang Guo
- Laboratory of Language Engineering and Computing, Guangdong University of Foreign Studies, Guangzhou 510420, China
| | - Xiaoli Wang
- Institute of Reproductive Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
| | - Han Ren
- Laboratory of Language Engineering and Computing, Guangdong University of Foreign Studies, Guangzhou 510420, China
- Laboratory of Language and Artificial Intelligence, Guangdong University of Foreign Studies, Guangzhou 510420, China
| |
Collapse
|
4
|
Liu Y, Zhang F, Ding Y, Fei R, Li J, Wu FX. MRDPDA: A multi-Laplacian regularized deepFM model for predicting piRNA-disease associations. J Cell Mol Med 2024; 28:e70046. [PMID: 39228010 PMCID: PMC11371490 DOI: 10.1111/jcmm.70046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 07/15/2024] [Accepted: 08/16/2024] [Indexed: 09/05/2024] Open
Abstract
PIWI-interacting RNAs (piRNAs) are a typical class of small non-coding RNAs, which are essential for gene regulation, genome stability and so on. Accumulating studies have revealed that piRNAs have significant potential as biomarkers and therapeutic targets for a variety of diseases. However current computational methods face the challenge in effectively capturing piRNA-disease associations (PDAs) from limited data. In this study, we propose a novel method, MRDPDA, for predicting PDAs based on limited data from multiple sources. Specifically, MRDPDA integrates a deep factorization machine (deepFM) model with regularizations derived from multiple yet limited datasets, utilizing separate Laplacians instead of a simple average similarity network. Moreover, a unified objective function to combine embedding loss about similarities is proposed to ensure that the embedding is suitable for the prediction task. In addition, a balanced benchmark dataset based on piRPheno is constructed and a deep autoencoder is applied for creating reliable negative set from the unlabeled dataset. Compared with three latest methods, MRDPDA achieves the best performance on the pirpheno dataset in terms of the five-fold cross validation test and independent test set, and case studies further demonstrate the effectiveness of MRDPDA.
Collapse
Affiliation(s)
- Yajun Liu
- Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, China
| | - Fan Zhang
- Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, China
| | - Yulian Ding
- Center for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Rong Fei
- Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, China
| | - Junhuai Li
- Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, China
| | - Fang-Xiang Wu
- Department of Computer Science, Biomedical Engineering and Mechanical Engineering, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| |
Collapse
|
5
|
Sun W, Guo C, Wan J, Ren H. piRNA-disease association prediction based on multi-channel graph variational autoencoder. PeerJ Comput Sci 2024; 10:e2216. [PMID: 39145234 PMCID: PMC11323097 DOI: 10.7717/peerj-cs.2216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Accepted: 07/03/2024] [Indexed: 08/16/2024]
Abstract
Piwi-interacting RNA (piRNA) is a type of non-coding small RNA that is highly expressed in mammalian testis. PiRNA has been implicated in various human diseases, but the experimental validation of piRNA-disease associations is costly and time-consuming. In this article, a novel computational method for predicting piRNA-disease associations using a multi-channel graph variational autoencoder (MC-GVAE) is proposed. This method integrates four types of similarity networks for piRNAs and diseases, which are derived from piRNA sequences, disease semantics, piRNA Gaussian Interaction Profile (GIP) kernel, and disease GIP kernel, respectively. These networks are modeled by a graph VAE framework, which can learn low-dimensional and informative feature representations for piRNAs and diseases. Then, a multi-channel method is used to fuse the feature representations from different networks. Finally, a three-layer neural network classifier is applied to predict the potential associations between piRNAs and diseases. The method was evaluated on a benchmark dataset containing 5,002 experimentally validated associations with 4,350 piRNAs and 21 diseases, constructed from the piRDisease v1.0 database. It achieved state-of-the-art performance, with an average AUC value of 0.9310 and an AUPR value of 0.9247 under five-fold cross-validation. This demonstrates the method's effectiveness and superiority in piRNA-disease association prediction.
Collapse
Affiliation(s)
- Wei Sun
- School of Information Science and Technology, Qiongtai Normal University, Haikou, China
| | - Chang Guo
- School of Modern Information Industry, Guangzhou College of Commerce, Guangzhou, China
| | - Jing Wan
- Center for Lexicographical Studies, Guangdong University of Foreign Studies, Guangzhou, China
| | - Han Ren
- Laboratory of Language Engineering and Computing, Guangdong University of Foreign Studies, Guangzhou, China
- Laboratory of Language and Artificial Intelligence, Guangdong University of Foreign Studies, Guangzhou, China
| |
Collapse
|
6
|
Zhang J, Lang M, Zhou Y, Zhang Y. Predicting RNA structures and functions by artificial intelligence. Trends Genet 2023; 40:S0168-9525(23)00229-9. [PMID: 39492264 DOI: 10.1016/j.tig.2023.10.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 08/22/2023] [Accepted: 10/03/2023] [Indexed: 11/05/2024]
Abstract
RNA functions by interacting with its intended targets structurally. However, due to the dynamic nature of RNA molecules, RNA structures are difficult to determine experimentally or predict computationally. Artificial intelligence (AI) has revolutionized many biomedical fields and has been progressively utilized to deduce RNA structures, target binding, and associated functionality. Integrating structural and target binding information could also help improve the robustness of AI-based RNA function prediction and RNA design. Given the rapid development of deep learning (DL) algorithms, AI will provide an unprecedented opportunity to elucidate the sequence-structure-function relation of RNAs.
Collapse
Affiliation(s)
- Jun Zhang
- National Engineering Laboratory for Big Data System Computing Technology, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong, 518060, China
| | - Mei Lang
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, Guangdong, 518106, China
| | - Yaoqi Zhou
- Institute of Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen, Guangdong, 518106, China.
| | - Yang Zhang
- School of Science, Harbin Institute of Technology, Shenzhen, Guangdong, 518055, China.
| |
Collapse
|
7
|
Hou J, Wei H, Liu B. iPiDA-SWGCN: Identification of piRNA-disease associations based on Supplementarily Weighted Graph Convolutional Network. PLoS Comput Biol 2023; 19:e1011242. [PMID: 37339125 DOI: 10.1371/journal.pcbi.1011242] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Accepted: 06/05/2023] [Indexed: 06/22/2023] Open
Abstract
Accurately identifying potential piRNA-disease associations is of great importance in uncovering the pathogenesis of diseases. Recently, several machine-learning-based methods have been proposed for piRNA-disease association detection. However, they are suffering from the high sparsity of piRNA-disease association network and the Boolean representation of piRNA-disease associations ignoring the confidence coefficients. In this study, we propose a supplementarily weighted strategy to solve these disadvantages. Combined with Graph Convolutional Networks (GCNs), a novel predictor called iPiDA-SWGCN is proposed for piRNA-disease association prediction. There are three main contributions of iPiDA-SWGCN: (i) Potential piRNA-disease associations are preliminarily supplemented in the sparse piRNA-disease network by integrating various basic predictors to enrich network structure information. (ii) The original Boolean piRNA-disease associations are assigned with different relevance confidence to learn node representations from neighbour nodes in varying degrees. (iii) The experimental results show that iPiDA-SWGCN achieves the best performance compared with the other state-of-the-art methods, and can predict new piRNA-disease associations.
Collapse
Affiliation(s)
- Jialu Hou
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Hang Wei
- School of Computer Science and Technology, Xidian University, Xi'an, Shaanxi, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
8
|
Bai T, Yan K, Liu B. DAmiRLocGNet: miRNA subcellular localization prediction by combining miRNA-disease associations and graph convolutional networks. Brief Bioinform 2023:bbad212. [PMID: 37332057 DOI: 10.1093/bib/bbad212] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 05/17/2023] [Accepted: 05/18/2023] [Indexed: 06/20/2023] Open
Abstract
MicroRNAs (miRNAs) are human post-transcriptional regulators in humans, which are involved in regulating various physiological processes by regulating the gene expression. The subcellular localization of miRNAs plays a crucial role in the discovery of their biological functions. Although several computational methods based on miRNA functional similarity networks have been presented to identify the subcellular localization of miRNAs, it remains difficult for these approaches to effectively extract well-referenced miRNA functional representations due to insufficient miRNA-disease association representation and disease semantic representation. Currently, there has been a significant amount of research on miRNA-disease associations, making it possible to address the issue of insufficient miRNA functional representation. In this work, a novel model is established, named DAmiRLocGNet, based on graph convolutional network (GCN) and autoencoder (AE) for identifying the subcellular localizations of miRNA. The DAmiRLocGNet constructs the features based on miRNA sequence information, miRNA-disease association information and disease semantic information. GCN is utilized to gather the information of neighboring nodes and capture the implicit information of network structures from miRNA-disease association information and disease semantic information. AE is employed to capture sequence semantics from sequence similarity networks. The evaluation demonstrates that the performance of DAmiRLocGNet is superior to other competing computational approaches, benefiting from implicit features captured by using GCNs. The DAmiRLocGNet has the potential to be applied to the identification of subcellular localization of other non-coding RNAs. Moreover, it can facilitate further investigation into the functional mechanisms underlying miRNA localization. The source code and datasets are accessed at http://bliulab.net/DAmiRLocGNet.
Collapse
Affiliation(s)
- Tao Bai
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
- School of Mathematics & Computer Science, Yan'an University, Shaanxi 716000, China
| | - Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|
9
|
Meng X, Shang J, Ge D, Yang Y, Zhang T, Liu JX. ETGPDA: identification of piRNA-disease associations based on embedding transformation graph convolutional network. BMC Genomics 2023; 24:279. [PMID: 37226081 PMCID: PMC10210294 DOI: 10.1186/s12864-023-09380-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 05/15/2023] [Indexed: 05/26/2023] Open
Abstract
BACKGROUND Piwi-interacting RNAs (piRNAs) have been proven to be closely associated with human diseases. The identification of the potential associations between piRNA and disease is of great significance for complex diseases. Traditional "wet experiment" is time-consuming and high-priced, predicting the piRNA-disease associations by computational methods is of great significance. METHODS In this paper, a method based on the embedding transformation graph convolution network is proposed to predict the piRNA-disease associations, named ETGPDA. Specifically, a heterogeneous network is constructed based on the similarity information of piRNA and disease, as well as the known piRNA-disease associations, which is applied to extract low-dimensional embeddings of piRNA and disease based on graph convolutional network with an attention mechanism. Furthermore, the embedding transformation module is developed for the problem of embedding space inconsistency, which is lightweighter, stronger learning ability and higher accuracy. Finally, the piRNA-disease association score is calculated by the similarity of the piRNA and disease embedding. RESULTS Evaluated by fivefold cross-validation, the AUC of ETGPDA achieves 0.9603, which is better than the other five selected computational models. The case studies based on Head and neck squamous cell carcinoma and Alzheimer's disease further prove the superior performance of ETGPDA. CONCLUSIONS Hence, the ETGPDA is an effective method for predicting the hidden piRNA-disease associations.
Collapse
Affiliation(s)
- Xianghan Meng
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Junliang Shang
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Daohui Ge
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Yi Yang
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| | - Tongdui Zhang
- Science and Technology Innovation Service Institution of Rizhao, Rizhao, 276826 China
| | - Jin-Xing Liu
- School of Computer Science, Qufu Normal University, Rizhao, 276826 China
| |
Collapse
|
10
|
Zhang L, Bai T, Wu H. sgRNA-2wPSM: Identify sgRNAs on-target activity by combining two-window-based position specific mismatch and synthetic minority oversampling technique. Comput Biol Med 2023; 155:106489. [PMID: 36841059 DOI: 10.1016/j.compbiomed.2022.106489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Accepted: 12/27/2022] [Indexed: 12/30/2022]
Abstract
MOTIVATION sgRNAs on-target activity prediction is a critical step in the CRISPR-Cas9 system. Due to its importance to RNA function research and genome editing application, some computational methods were introduced, treating it as a binary classification task or a regression task. Among these methods, sgRNA-PSM is a state-of-the-art method. In this work, we improved this method by proposing a new feature extraction method called two-window-based PSM, which divides the DNA sequences into two non-overlapping segments so as to extract different patterns in the two different segments. The two-window-based PSM were fed into Support Vector Machines (SVMs), and a new method called sgRNA-2wPSM was proposed. Furthermore, a new oversampling method called SCORE-SVM-SMOTE was proposed to solve the imbalanced training set problem based on the SVM-SMOTE algorithm. Results on the benchmark datasets indicated that sgRNA-2wPSM is superior to other methods.
Collapse
Affiliation(s)
- Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China.
| | - Tao Bai
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China; School of Mathematics & Computer Science, Yanan University, Shanxi, 716000, China.
| | - Hao Wu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| |
Collapse
|
11
|
Shi H, Wu C, Bai T, Chen J, Li Y, Wu H. Identify essential genes based on clustering based synthetic minority oversampling technique. Comput Biol Med 2023; 153:106523. [PMID: 36652869 DOI: 10.1016/j.compbiomed.2022.106523] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 12/13/2022] [Accepted: 12/31/2022] [Indexed: 01/03/2023]
Abstract
Prediction of essential genes in a life organism is one of the central tasks in synthetic biology. Computational predictors are desired because experimental data is often unavailable. Recently, some sequence-based predictors have been constructed to identify essential genes. However, their predictive performance should be further improved. One key problem is how to effectively extract the sequence-based features, which are able to discriminate the essential genes. Another problem is the imbalanced training set. The amount of essential genes in human cell lines is lower than that of non-essential genes. Therefore, predictors trained with such imbalanced training set tend to identify an unseen sequence as a non-essential gene. Here, a new over-sampling strategy was proposed called Clustering based Synthetic Minority Oversampling Technique (CSMOTE) to overcome the imbalanced data issue. Combining CSMOTE with the Z curve, the global features, and Support Vector Machines, a new protocol called iEsGene-CSMOTE was proposed to identify essential genes. The rigorous jackknife cross validation results indicated that iEsGene-CSMOTE is better than the other competing methods. The proposed method outperformed λ-interval Z curve by 35.48% and 11.25% in terms of Sn and BACC, respectively.
Collapse
Affiliation(s)
- Hua Shi
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China.
| | - Chenjin Wu
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China.
| | - Tao Bai
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China; School of Mathematics & Computer Science, Yanan University, Shanxi, 716000, China.
| | - Jiahai Chen
- Xiamen Sankuai Online Technology Co., Ltd, Xiamen, China.
| | - Yan Li
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, China.
| | - Hao Wu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China.
| |
Collapse
|