1
|
Feng H, Nie Q, Yang S. SORFPP: Enhancing rich sequence-driven information to identify SEPs based on fused framework on validation datasets. PLoS One 2025; 20:e0320314. [PMID: 40294059 PMCID: PMC12036913 DOI: 10.1371/journal.pone.0320314] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2024] [Accepted: 02/17/2025] [Indexed: 04/30/2025] Open
Abstract
BACKGROUND Genome sequencing has enabled us to find functional peptides encoded by short open read frames (sORFs) in long non-coding RNAs (lncRNAs). sORFs-encoded peptides (SEPs) regulate gene expression, signaling, and so on and have significant roles, unlike common peptides. Various computational methods have been proposed. However, there is a lack of contributive features and effective models. Therefore, a high-throughput computational method to predict SEPs is needed. RESULTS We propose a computational method, SORFPP, to predict SEPs by mining feature information from multiple perspectives in an experimentally validated dataset from TranLnc. SORFPP fully extracts SEP sequence information using the protein language model ESM-2 and curated traditional encoding, including QSOrder, k-mer, etc. SORFPP uses CatBoost to solve the sparsity problem of traditional encoding. SORFPP also analyzes ESM-2 pre-training characterization information with the Self-attention model. Finally, an ensemble learning framework combines the two models and their results are fed into Logistic Regression model for accurate and robust predictions. For comparison, SORFPP outperforms other state-of-the-art models in Matthew correlation coefficient by 12.2%-24.2% on three benchmark datasets. CONCLUSION Integrating the ensemble learning strategy with contributive traditional features and the protein language encoding methods shows better performance. Datasets and codes are accessible at https://doi.org/10.6084/m9.figshare.28079897 and http://111.229.198.94:5000/.
Collapse
Affiliation(s)
- Hongqi Feng
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou, China
| | - Qi Nie
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou, China
| | - Sen Yang
- School of Computer Science and Artificial Intelligence Aliyun School of Big Data School of Software, Changzhou University, Changzhou, China
- The Affiliated Changzhou No.2 People’s Hospital of Nanjing Medical University, Changzhou, China
| |
Collapse
|
2
|
Li H, Meng J, Wang Z, Luan Y. misORFPred: A Novel Method to Mine Translatable sORFs in Plant Pri-miRNAs Using Enhanced Scalable k-mer and Dynamic Ensemble Voting Strategy. Interdiscip Sci 2025; 17:114-133. [PMID: 39397199 DOI: 10.1007/s12539-024-00661-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 09/18/2024] [Accepted: 09/22/2024] [Indexed: 10/15/2024]
Abstract
The primary microRNAs (pri-miRNAs) have been observed to contain translatable small open reading frames (sORFs) that can encode peptides as an independent element. Relevant studies have proven that those of sORFs are of significance in regulating the expression of biological traits. The existing methods for predicting the coding potential of sORFs frequently overlook this data or categorize them as negative samples, impeding the identification of additional translatable sORFs in pri-miRNAs. In light of this, a novel method named misORFPred has been proposed. Specifically, an enhanced scalable k-mer (ESKmer) that simultaneously integrates the composition information within a sequence and distance information between sequences is designed to extract the nucleotide sequence features. After feature selection, the optimal features and several machine learning classifiers are combined to construct the ensemble model, where a newly devised dynamic ensemble voting strategy (DEVS) is proposed to dynamically adjust the weights of base classifiers and adaptively select the optimal base classifiers for each unlabeled sample. Cross-validation results suggest that ESKmer and DEVS are essential for this classification task and could boost model performance. Independent testing results indicate that misORFPred outperforms the state-of-the-art methods. Furthermore, we execute misORFPerd on the genomes of various plant species and perform a thorough analysis of the predicted outcomes. Taken together, misORFPred is a powerful tool for identifying the translatable sORFs in plant pri-miRNAs and can provide highly trusted candidates for subsequent biological experiments.
Collapse
Affiliation(s)
- Haibin Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China.
| | - Zhaowei Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian, 116024, China.
| |
Collapse
|
3
|
Zhang Q, Liu L. Novel insights into small open reading frame-encoded micropeptides in hepatocellular carcinoma: A potential breakthrough. Cancer Lett 2024; 587:216691. [PMID: 38360139 DOI: 10.1016/j.canlet.2024.216691] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 01/13/2024] [Accepted: 01/27/2024] [Indexed: 02/17/2024]
Abstract
Traditionally, non-coding RNAs (ncRNAs) are regarded as a class of RNA transcripts that lack encoding capability; however, advancements in technology have revealed that some ncRNAs contain small open reading frames (sORFs) that are capable of encoding micropeptides of approximately 150 amino acids in length. sORF-encoded micropeptides (SEPs) have emerged as intriguing entities in hepatocellular carcinoma (HCC) research, shedding light on this previously unexplored realm. Recent studies have highlighted the regulatory functions of SEPs in the occurrence and progression of HCC. Some SEPs exhibit inhibitory effects on HCC, but others facilitate its development. This discovery has revolutionized the landscape of HCC research and clinical management. Here, we introduce the concept and characteristics of SEPs, summarize their associations with HCC, and elucidate their carcinogenic mechanisms in HCC metabolism, signaling pathways, cell proliferation, and metastasis. In addition, we propose a step-by-step workflow for the investigation of HCC-associated SEPs. Lastly, we discuss the challenges and prospects of applying SEPs in the diagnosis and treatment of HCC. This review aims to facilitate the discovery, optimization, and clinical application of HCC-related SEPs, inspiring the development of early diagnostic, individualized, and precision therapeutic strategies for HCC.
Collapse
Affiliation(s)
- Qiangnu Zhang
- Division of Hepatobiliary and Pancreas Surgery, Department of General Surgery, Shenzhen People's Hospital (The Second Clinical Medical College, Jinan University, The First Affiliated Hospital, Southern University of Science and Technology), 518020, Shenzhen, China
| | - Liping Liu
- Division of Hepatobiliary and Pancreas Surgery, Department of General Surgery, Shenzhen People's Hospital (The Second Clinical Medical College, Jinan University, The First Affiliated Hospital, Southern University of Science and Technology), 518020, Shenzhen, China.
| |
Collapse
|
4
|
Azad H, Akbar MY, Sarfraz J, Haider W, Riaz MN, Ali GM, Ghazanfar S. G-ACP: a machine learning approach to the prediction of therapeutic peptides for gastric cancer. J Biomol Struct Dyn 2024:1-14. [PMID: 38450672 DOI: 10.1080/07391102.2024.2323141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 02/15/2024] [Indexed: 03/08/2024]
Abstract
Conventional Gastrointestinal (GI) cancer treatments are quite expensive and have major hazards. Nowadays, a different strategy places more emphasis on creating tiny biologically active peptides that do not cause severe poisoning. Anticancer peptides (ACPs) are found through experimental screening, which is time-dependent and frequently fraught with difficulties. Gastric ACPs are emerging as a promising GI cancer treatment in the current day. It is crucial to identify novel gastric ACPs to have an improved knowledge of their functioning processes and treatment of gastric cancer. As a result of the post-genomic era's massive production of peptide sequences, rapid and effective ACPs using a computational method are essential. Several adaptive statistical techniques for distinguishing ACPs and non-ACPs have recently been developed. A variety of adapted statistically significant methods have been developed to differentiate between ACPs and non-ACPs. Despite significant progress, there is no specific model for the prediction of gastric ACPs because the specific model will predict a particular type of peptide more accurately and quickly. To overcome this, an initiative is taken for the creation of a reliable framework for the accurate identification of gastric ACPs. The current technique in particular contains four possible features along with one hybrid feature encoding mechanisms which are the target-class motif previously indicated by Amino Acid Composition, Dipeptide Composition, Tripeptide Composition (TPC), Pseudo Amino Acid Composition (PAAC), and their Hybrid. Machine Learning algorithms make high-performance and accurate prediction tools. Moreover, highly variable and ideal deep feature selection is done using an ANOVA-based F score for feature pruning. Experiments on a range of algorithms are carried out to identify the optimal operating strategy due to the diverse nature of learning. Following analysis of the empirical results, Naïve Bayes with TPC and Hybrid feature space outperforms other methods with 0.99 accuracy score on the testing dataset. To find the model generalization an external validation is carried out. In external datasets, the Extra Trees with PAAC features outperforms with the accuracy of 0.94. The comparison study shows that our suggested model will predict gastric ACPs more accurately and will be useful in drug development and gastric cancer. The predictive model can be freely accessed at https://github.com/humeraazad10/G-ACP.git.
Collapse
Affiliation(s)
- Humera Azad
- Department of Biosciences (Bioinformatics) Islamabad, Comsats University Islamabad, Pakistan
| | - Muhammad Yasir Akbar
- National Institute for Genomics and Advanced Biotechnology (NIGAB), National Agricultural Research Center (NARC), Pakistan
| | | | - Waseem Haider
- Department of Biosciences (Bioinformatics) Islamabad, Comsats University Islamabad, Pakistan
| | - Muhammad Naeem Riaz
- National Institute for Genomics and Advanced Biotechnology (NIGAB), National Agricultural Research Center (NARC), Pakistan
| | - Ghulam Muhammad Ali
- Department of Biosciences (Bioinformatics) Islamabad, Comsats University Islamabad, Pakistan
| | - Shakira Ghazanfar
- National Institute for Genomics and Advanced Biotechnology (NIGAB), National Agricultural Research Center (NARC), Pakistan
| |
Collapse
|
5
|
Jastrzebski JP, Pascarella S, Lipka A, Dorocki S. IncRna: The R Package for Optimizing lncRNA Identification Processes. J Comput Biol 2023; 30:1322-1326. [PMID: 37878344 DOI: 10.1089/cmb.2023.0091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2023] Open
Abstract
In silico identification of long noncoding RNAs (lncRNAs) is a multistage process including filtering of transcripts according to their physical characteristics (e.g., length, exon-intron structure) and determination of the coding potential of the sequence. A common issue within this process is the choice of the most suitable method of coding potential analysis for the conducted research. Selection of tools on the sole basis of their single performance may not provide the most effective choice for a specific problem. To overcome these limitations, we developed the R library lncRna, which provides functions to easily carry out the entire lncRNA identification process. For example, the package prepares the data files for coding potential analysis to perform error analysis. Moreover, the package gives the opportunity to analyze the effectiveness of various combinations of the lncRNA prediction methods to select the optimal configuration of the entire process.
Collapse
Affiliation(s)
- Jan Pawel Jastrzebski
- Faculty of Biology and Biotechnology, University of Warmia and Mazury in Olsztyn, Olsztyn, Poland
| | - Stefano Pascarella
- Department of Biochemical Sciences "A. Rossi Fanelli" Sapienza University of Rome, Rome, Italy
| | - Aleksandra Lipka
- Institute of Oral Biology, Faculty of Dentistry University of Oslo, Oslo, Norway
| | | |
Collapse
|
6
|
Wang Y, Pan Z, Mou M, Xia W, Zhang H, Zhang H, Liu J, Zheng L, Luo Y, Zheng H, Yu X, Lian X, Zeng Z, Li Z, Zhang B, Zheng M, Li H, Hou T, Zhu F. A task-specific encoding algorithm for RNAs and RNA-associated interactions based on convolutional autoencoder. Nucleic Acids Res 2023; 51:e110. [PMID: 37889083 PMCID: PMC10682500 DOI: 10.1093/nar/gkad929] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 08/01/2023] [Accepted: 10/10/2023] [Indexed: 10/28/2023] Open
Abstract
RNAs play essential roles in diverse physiological and pathological processes by interacting with other molecules (RNA/protein/compound), and various computational methods are available for identifying these interactions. However, the encoding features provided by existing methods are limited and the existing tools does not offer an effective way to integrate the interacting partners. In this study, a task-specific encoding algorithm for RNAs and RNA-associated interactions was therefore developed. This new algorithm was unique in (a) realizing comprehensive RNA feature encoding by introducing a great many of novel features and (b) enabling task-specific integration of interacting partners using convolutional autoencoder-directed feature embedding. Compared with existing methods/tools, this novel algorithm demonstrated superior performances in diverse benchmark testing studies. This algorithm together with its source code could be readily accessed by all user at: https://idrblab.org/corain/ and https://github.com/idrblab/corain/.
Collapse
Affiliation(s)
- Yunxia Wang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
| | - Ziqi Pan
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
| | - Weiqi Xia
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
| | - Hongning Zhang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
| | - Hanyu Zhang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
| | - Jin Liu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
| | - Lingyan Zheng
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-ZJU Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Yongchao Luo
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
| | - Hanqi Zheng
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
| | - Xinyuan Yu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
| | - Xichen Lian
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
| | - Zhenyu Zeng
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-ZJU Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Zhaorong Li
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-ZJU Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Bing Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-ZJU Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Mingyue Zheng
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
| | - Honglin Li
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
- School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Tingjun Hou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Polytechnic Institute, Zhejiang University, Hangzhou 310058, China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-ZJU Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang, China
| |
Collapse
|
7
|
Chen XG, Yang X, Li C, Lin X, Zhang W. Non-coding RNA identification with pseudo RNA sequences and feature representation learning. Comput Biol Med 2023; 165:107355. [PMID: 37639767 DOI: 10.1016/j.compbiomed.2023.107355] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 07/16/2023] [Accepted: 08/12/2023] [Indexed: 08/31/2023]
Abstract
Distinguishing non-coding RNAs (ncRNAs) from coding RNAs is very important in bioinformatics. Although many methods have been proposed for solving this task, it remains highly challenging to further improve the accuracy of ncRNA identification. In this paper, we propose a coding potential predictor using feature representation learning based on pseudo RNA sequences named CPPFLPS. In this method, we use the pseudo RNA sequences generated by simulating RNA sequence mutations as new samples for data augmentation, and six string operations simulating RNA sequence mutations are considered: base replacement, base insertion, base deletion, subsequence reversion, subsequence repetition and subsequence deletion. In the feature representation learning framework, different types of pseudo RNA sequences are added to the training set to form new training sets that can be used to train baseline classifiers, thus obtaining baseline models. The resulting labels of these baseline models are used as feature vectors to represent RNA sequences, and the resulting feature vectors acquired after feature selection are used to train a predictive model for distinguishing ncRNAs from coding RNAs. Our method achieves better performance compared with that of existing state-of-the-art methods. The implementation of the proposed method is available at https://github.com/chenxgscuec/CPPFLPS.
Collapse
Affiliation(s)
- Xian-Gan Chen
- School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China; Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central Minzu University, Wuhan, 430074, China; Key Laboratory of Cognitive Science(South-Central Minzu University), State Ethnic Affairs Commission, Wuhan, 430074, China.
| | - Xiaofei Yang
- School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China; Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central Minzu University, Wuhan, 430074, China; Key Laboratory of Cognitive Science(South-Central Minzu University), State Ethnic Affairs Commission, Wuhan, 430074, China.
| | - Chenhong Li
- School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China; Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central Minzu University, Wuhan, 430074, China; Key Laboratory of Cognitive Science(South-Central Minzu University), State Ethnic Affairs Commission, Wuhan, 430074, China.
| | - Xianguang Lin
- School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China; Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central Minzu University, Wuhan, 430074, China; Key Laboratory of Cognitive Science(South-Central Minzu University), State Ethnic Affairs Commission, Wuhan, 430074, China.
| | - Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
8
|
Deng L, Jiang Y, Hu X, Zheng R, Huang Z, Zhang J. ABLNCPP: Attention Mechanism-Based Bidirectional Long Short-Term Memory for Noncoding RNA Coding Potential Prediction. J Chem Inf Model 2023; 63:3955-3966. [PMID: 37294848 DOI: 10.1021/acs.jcim.3c00366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
With the continuous development of ribosome profiling, sequencing technology, and proteomics, evidence is mounting that noncoding RNA (ncRNA) may be a novel source of peptides or proteins. These peptides and proteins play crucial roles in inhibiting tumor progression and interfering with cancer metabolism and other essential physiological processes. Therefore, identifying ncRNAs with coding potential is vital to ncRNA functional research. However, existing studies perform well in classifying ncRNAs and mRNAs, and no research has been explicitly raised to distinguish whether ncRNA transcripts have coding potential. For this reason, we propose an attention mechanism-based bidirectional LSTM network called ABLNCPP to assess the coding possibility of ncRNA sequences. Considering the sequential information loss in previous methods, we introduce a novel nonoverlapping trinucleotide embedding (NOLTE) method for ncRNAs to obtain embeddings containing sequential features. The extensive evaluations show that ABLNCPP outperforms other state-of-the-art models. In general, ABLNCPP overcomes the bottleneck of ncRNA coding potential prediction and is expected to provide valuable contributions to cancer discovery and treatment in the future. The source code and data sets are freely available at https://github.com/YinggggJ/ABLNCPP.
Collapse
Affiliation(s)
- Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha 410018, China
| | - Ying Jiang
- School of Computer Science and Engineering, Central South University, Changsha 410018, China
| | - Xiaowen Hu
- School of Computer Science and Engineering, Central South University, Changsha 410018, China
| | - Rongtao Zheng
- School of Computer Science and Engineering, Central South University, Changsha 410018, China
| | - Zhijian Huang
- School of Computer Science and Engineering, Central South University, Changsha 410018, China
| | - Jingpu Zhang
- School of Computer and Data Science, Henan University of Urban Construction, Pingdingshan 467000, China
| |
Collapse
|
9
|
Wang Z, Cui Q, Su C, Zhao S, Wang R, Wang Z, Meng J, Luan Y. Unveiling the secrets of non-coding RNA-encoded peptides in plants: A comprehensive review of mining methods and research progress. Int J Biol Macromol 2023:124952. [PMID: 37257526 DOI: 10.1016/j.ijbiomac.2023.124952] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Revised: 05/15/2023] [Accepted: 05/16/2023] [Indexed: 06/02/2023]
Abstract
Non-coding RNAs (ncRNAs) are not conventionally involved in protein encoding. However, recent findings indicate that ncRNAs possess the capacity to code for proteins or peptides. These ncRNA-encoded peptides (ncPEPs) are vital for diverse plant life processes and exhibit significant potential value. Despite their importance, research on plant ncPEPs is limited, with only a few studies conducted and less information on the underlying mechanisms, and the field remains in its nascent stage. This manuscript provides a comprehensive overview of ncPEPs mining methods in plants, focusing on prediction, identification, and functional analysis. We discuss the strengths and weaknesses of various techniques, identify future research directions in the ncPEPs domain, and elucidate the biological functions and agricultural application prospects of plant ncPEPs. By highlighting the immense potential and research value of ncPEPs, we aim to lay a solid foundation for more in-depth studies in plant science.
Collapse
Affiliation(s)
- Zhengjie Wang
- School of Bioengineering, Dalian University of Technology, Dalian 116024, China
| | - Qi Cui
- School of Bioengineering, Dalian University of Technology, Dalian 116024, China
| | - Chenglin Su
- School of Bioengineering, Dalian University of Technology, Dalian 116024, China
| | - Siyuan Zhao
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Ruiming Wang
- School of Bioengineering, Dalian University of Technology, Dalian 116024, China
| | - Zhicheng Wang
- School of Bioengineering, Dalian University of Technology, Dalian 116024, China
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian 116024, China.
| |
Collapse
|
10
|
Chen X, Huang J, He B. AntiDMPpred: a web service for identifying anti-diabetic peptides. PeerJ 2022; 10:e13581. [PMID: 35722269 PMCID: PMC9205309 DOI: 10.7717/peerj.13581] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2022] [Accepted: 05/23/2022] [Indexed: 01/17/2023] Open
Abstract
Diabetes mellitus (DM) is a chronic metabolic disease that has been a major threat to human health globally, causing great economic and social adversities. The oral administration of anti-diabetic peptide drugs has become a novel route for diabetes therapy. Numerous bioactive peptides have demonstrated potential anti-diabetic properties and are promising as alternative treatment measures to prevent and manage diabetes. The computational prediction of anti-diabetic peptides can help promote peptide-based drug discovery in the process of searching newly effective therapeutic peptide agents for diabetes treatment. Here, we resorted to random forest to develop a computational model, named AntiDMPpred, for predicting anti-diabetic peptides. A benchmark dataset with 236 anti-diabetic and 236 non-anti-diabetic peptides was first constructed. Four types of sequence-derived descriptors were used to represent the peptide sequences. We then combined four machine learning methods and six feature scoring methods to select the non-redundant features, which were fed into diverse machine learning classifiers to train the models. Experimental results show that AntiDMPpred reached an accuracy of 77.12% and area under the receiver operating curve (AUCROC) of 0.8193 in the nested five-fold cross-validation, yielding a satisfactory performance and surpassing other classifiers implemented in the study. The web service is freely accessible at http://i.uestc.edu.cn/AntiDMPpred/cgi-bin/AntiDMPpred.pl. We hope AntiDMPpred could improve the discovery of anti-diabetic bioactive peptides.
Collapse
Affiliation(s)
- Xue Chen
- Medical College, Guizhou University, Guiyang, China
| | - Jian Huang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Bifang He
- Medical College, Guizhou University, Guiyang, China
| |
Collapse
|
11
|
Asim MN, Ibrahim MA, Imran Malik M, Dengel A, Ahmed S. Advances in Computational Methodologies for Classification and Sub-Cellular Locality Prediction of Non-Coding RNAs. Int J Mol Sci 2021; 22:8719. [PMID: 34445436 PMCID: PMC8395733 DOI: 10.3390/ijms22168719] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2021] [Revised: 08/02/2021] [Accepted: 08/03/2021] [Indexed: 02/06/2023] Open
Abstract
Apart from protein-coding Ribonucleic acids (RNAs), there exists a variety of non-coding RNAs (ncRNAs) which regulate complex cellular and molecular processes. High-throughput sequencing technologies and bioinformatics approaches have largely promoted the exploration of ncRNAs which revealed their crucial roles in gene regulation, miRNA binding, protein interactions, and splicing. Furthermore, ncRNAs are involved in the development of complicated diseases like cancer. Categorization of ncRNAs is essential to understand the mechanisms of diseases and to develop effective treatments. Sub-cellular localization information of ncRNAs demystifies diverse functionalities of ncRNAs. To date, several computational methodologies have been proposed to precisely identify the class as well as sub-cellular localization patterns of RNAs). This paper discusses different types of ncRNAs, reviews computational approaches proposed in the last 10 years to distinguish coding-RNA from ncRNA, to identify sub-types of ncRNAs such as piwi-associated RNA, micro RNA, long ncRNA, and circular RNA, and to determine sub-cellular localization of distinct ncRNAs and RNAs. Furthermore, it summarizes diverse ncRNA classification and sub-cellular localization determination datasets along with benchmark performance to aid the development and evaluation of novel computational methodologies. It identifies research gaps, heterogeneity, and challenges in the development of computational approaches for RNA sequence analysis. We consider that our expert analysis will assist Artificial Intelligence researchers with knowing state-of-the-art performance, model selection for various tasks on one platform, dominantly used sequence descriptors, neural architectures, and interpreting inter-species and intra-species performance deviation.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Muhammad Imran Malik
- National Center for Artificial Intelligence (NCAI), National University of Sciences and Technology, Islamabad 44000, Pakistan;
- School of Electrical Engineering & Computer Science, National University of Sciences and Technology, Islamabad 44000, Pakistan
| | - Andreas Dengel
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany; (M.A.I.); (A.D.); (S.A.)
- DeepReader GmbH, Trippstadter Str. 122, 67663 Kaiserslautern, Germany
| |
Collapse
|
12
|
Chen XG, Zhang W, Yang X, Li C, Chen H. ACP-DA: Improving the Prediction of Anticancer Peptides Using Data Augmentation. Front Genet 2021; 12:698477. [PMID: 34276801 PMCID: PMC8279753 DOI: 10.3389/fgene.2021.698477] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2021] [Accepted: 06/07/2021] [Indexed: 12/09/2022] Open
Abstract
Anticancer peptides (ACPs) have provided a promising perspective for cancer treatment, and the prediction of ACPs is very important for the discovery of new cancer treatment drugs. It is time consuming and expensive to use experimental methods to identify ACPs, so computational methods for ACP identification are urgently needed. There have been many effective computational methods, especially machine learning-based methods, proposed for such predictions. Most of the current machine learning methods try to find suitable features or design effective feature learning techniques to accurately represent ACPs. However, the performance of these methods can be further improved for cases with insufficient numbers of samples. In this article, we propose an ACP prediction model called ACP-DA (Data Augmentation), which uses data augmentation for insufficient samples to improve the prediction performance. In our method, to better exploit the information of peptide sequences, peptide sequences are represented by integrating binary profile features and AAindex features, and then the samples in the training set are augmented in the feature space. After data augmentation, the samples are used to train the machine learning model, which is used to predict ACPs. The performance of ACP-DA exceeds that of existing methods, and ACP-DA achieves better performance in the prediction of ACPs compared with a method without data augmentation. The proposed method is available at http://github.com/chenxgscuec/ACPDA.
Collapse
Affiliation(s)
- Xian-Gan Chen
- School of Biomedical Engineering, South-Central University for Nationalities, Wuhan, China.,Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central University for Nationalities, Wuhan, China.,Key Laboratory of Cognitive Science (South-Central University for Nationalities), State Ethnic Affairs Commission, Wuhan, China
| | - Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, China.,Hubei Engineering Technology Research Center of Agricultural Big Data, Wuhan, China
| | - Xiaofei Yang
- School of Biomedical Engineering, South-Central University for Nationalities, Wuhan, China.,Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central University for Nationalities, Wuhan, China.,Key Laboratory of Cognitive Science (South-Central University for Nationalities), State Ethnic Affairs Commission, Wuhan, China
| | - Chenhong Li
- School of Biomedical Engineering, South-Central University for Nationalities, Wuhan, China.,Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central University for Nationalities, Wuhan, China.,Key Laboratory of Cognitive Science (South-Central University for Nationalities), State Ethnic Affairs Commission, Wuhan, China
| | - Hengling Chen
- School of Biomedical Engineering, South-Central University for Nationalities, Wuhan, China.,Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central University for Nationalities, Wuhan, China.,Key Laboratory of Cognitive Science (South-Central University for Nationalities), State Ethnic Affairs Commission, Wuhan, China
| |
Collapse
|