1
|
Li H, Meng J, Wang Z, Tang Y, Xia S, Wang Y, Qin Z, Luan Y. miPEPPred-FRL: A Novel Method for Predicting Plant MiRNA-Encoded Peptides Using Adaptive Feature Representation Learning. J Chem Inf Model 2024; 64:2889-2900. [PMID: 37733290 DOI: 10.1021/acs.jcim.3c01020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
Abstract
MicroRNAs (miRNAs) are an essential type of small molecule RNAs that play significant regulatory roles in organisms. Recent studies have demonstrated that small open reading frames (sORFs) harbored in primary miRNAs (pri-miRNAs) can encode small peptides, known as miPEPs. Plant miPEPs can increase the abundance and activity of cognate miRNAs by promoting the transcription of their corresponding pri-miRNAs, thereby modulating plant traits. Biological experiments are the most effective way to accurately identify miPEPs; however, they are time-consuming and expensive. Hence, an efficient computational method for the identification of miPEPs on a large scale is highly desirable. Up to now, there have been no specialized computational tools for identifying miPEPs. In this work, a novel predictor named miPEPPred-FRL based on an adaptive feature representation learning framework that consists of the feature transformation module and the cascade architecture has been proposed. The feature transformation module integrating a newly designed feature selection method and classifier selection rule is developed to convert sequence-based features into primary class and probabilistic features, which are then fed into the improved cascade architecture to obtain more stable and discriminative augmented features. Finally, the augmented features are utilized to construct the final predictor. Cross-validation experiments illustrate that the novel feature selection method and classifier selection rule contribute to boosting the feature representation ability of the framework. Furthermore, the high accuracy of miPEPPred-FRL on independent testing data suggests that it is a trustworthy and valuable tool for the identification of miPEPs.
Collapse
Affiliation(s)
- Haibin Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Zhaowei Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Youwei Tang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Shihao Xia
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Yu Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Zhaojing Qin
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning 116024, China
| |
Collapse
|
2
|
Peng Z, Li J, Jiang X, Wan C. sOCP: a framework predicting smORF coding potential based on TIS and in-frame features and effectively applied in the human genome. Brief Bioinform 2024; 25:bbae147. [PMID: 38600664 PMCID: PMC11006793 DOI: 10.1093/bib/bbae147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Revised: 02/25/2024] [Accepted: 03/19/2024] [Indexed: 04/12/2024] Open
Abstract
Small open reading frames (smORFs) have been acknowledged to play various roles on essential biological pathways and affect human beings from diabetes to tumorigenesis. Predicting smORFs in silico is quite a prerequisite for processing the omics data. Here, we proposed the smORF-coding-potential-predicting framework, sOCP, which provides functions to construct a model for predicting novel smORFs in some species. The sOCP model constructed in human was based on in-frame features and the nucleotide bias around the start codon, and the small feature subset was proved to be competent enough and avoid overfitting problems for complicated models. It showed more advanced prediction metrics than previous methods and could correlate closely with experimental evidence in a heterogeneous dataset. The model was applied to Rattus norvegicus and exhibited satisfactory performance. We then scanned smORFs with ATG and non-ATG start codons from the human genome and generated a database containing about a million novel smORFs with coding potential. Around 72 000 smORFs are located on the lncRNA regions of the genome. The smORF-encoded peptides may be involved in biological pathways rare for canonical proteins, including glucocorticoid catabolic process and the prokaryotic defense system. Our work provides a model and database for human smORF investigation and a convenient tool for further smORF prediction in other species.
Collapse
Affiliation(s)
- Zhao Peng
- School of Life Sciences, and Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan 430079, Hubei, People’s Republic of China
| | - Jiaqiang Li
- School of Computer Science, and Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan 430079, Hubei, People’s Republic of China
| | - Xingpeng Jiang
- School of Computer Science, and Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan 430079, Hubei, People’s Republic of China
| | - Cuihong Wan
- School of Life Sciences, and Hubei Key Laboratory of Genetic Regulation and Integrative Biology, Central China Normal University, Wuhan 430079, Hubei, People’s Republic of China
| |
Collapse
|
3
|
Valdivia-Francia F, Sendoel A. No country for old methods: New tools for studying microproteins. iScience 2024; 27:108972. [PMID: 38333695 PMCID: PMC10850755 DOI: 10.1016/j.isci.2024.108972] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/10/2024] Open
Abstract
Microproteins encoded by small open reading frames (sORFs) have emerged as a fascinating frontier in genomics. Traditionally overlooked due to their small size, recent technological advancements such as ribosome profiling, mass spectrometry-based strategies and advanced computational approaches have led to the annotation of more than 7000 sORFs in the human genome. Despite the vast progress, only a tiny portion of these microproteins have been characterized and an important challenge in the field lies in identifying functionally relevant microproteins and understanding their role in different cellular contexts. In this review, we explore the recent advancements in sORF research, focusing on the new methodologies and computational approaches that have facilitated their identification and functional characterization. Leveraging these new tools hold great promise for dissecting the diverse cellular roles of microproteins and will ultimately pave the way for understanding their role in the pathogenesis of diseases and identifying new therapeutic targets.
Collapse
Affiliation(s)
- Fabiola Valdivia-Francia
- University of Zurich, Institute for Regenerative Medicine (IREM), Wagistrasse 12, 8952 Schlieren-Zurich, Switzerland
- Life Science Zurich Graduate School, Molecular Life Science Program, University of Zurich/ ETH Zurich, Schlieren-Zurich, Switzerland
| | - Ataman Sendoel
- University of Zurich, Institute for Regenerative Medicine (IREM), Wagistrasse 12, 8952 Schlieren-Zurich, Switzerland
| |
Collapse
|
4
|
Bosch JA, Keith N, Escobedo F, Fisher WW, LaGraff JT, Rabasco J, Wan KH, Weiszmann R, Hu Y, Kondo S, Brown JB, Perrimon N, Celniker SE. Molecular and functional characterization of the Drosophila melanogaster conserved smORFome. Cell Rep 2023; 42:113311. [PMID: 37889754 PMCID: PMC10843857 DOI: 10.1016/j.celrep.2023.113311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Revised: 08/24/2023] [Accepted: 10/04/2023] [Indexed: 10/29/2023] Open
Abstract
Short polypeptides encoded by small open reading frames (smORFs) are ubiquitously found in eukaryotic genomes and are important regulators of physiology, development, and mitochondrial processes. Here, we focus on a subset of 298 smORFs that are evolutionarily conserved between Drosophila melanogaster and humans. Many of these smORFs are conserved broadly in the bilaterian lineage, and ∼182 are conserved in plants. We observe remarkably heterogeneous spatial and temporal expression patterns of smORF transcripts-indicating wide-spread tissue-specific and stage-specific mitochondrial architectures. In addition, an analysis of annotated functional domains reveals a predicted enrichment of smORF polypeptides localizing to mitochondria. We conduct an embryonic ribosome profiling experiment and find support for translation of 137 of these smORFs during embryogenesis. We further embark on functional characterization using CRISPR knockout/activation, RNAi knockdown, and cDNA overexpression, revealing diverse phenotypes. This study underscores the importance of identifying smORF function in disease and phenotypic diversity.
Collapse
Affiliation(s)
- Justin A Bosch
- Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA
| | - Nathan Keith
- Division of Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA; Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Felipe Escobedo
- Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA
| | - William W Fisher
- Division of Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - James Thai LaGraff
- Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA
| | - Jorden Rabasco
- Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA
| | - Kenneth H Wan
- Division of Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Richard Weiszmann
- Division of Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Yanhui Hu
- Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA
| | - Shu Kondo
- Laboratory of Invertebrate Genetics, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - James B Brown
- Division of Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA; Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
| | - Norbert Perrimon
- Department of Genetics, Blavatnik Institute, Harvard Medical School, Boston, MA 02115, USA; Howard Hughes Medical Institute, Harvard Medical School, Boston, MA 02115, USA.
| | - Susan E Celniker
- Division of Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
| |
Collapse
|
5
|
Wang X, Zhang Z, Shi C, Wang Y, Zhou T, Lin A. Clinical prospects and research strategies of long non-coding RNA encoding micropeptides. Zhejiang Da Xue Xue Bao Yi Xue Ban 2023; 52:397-405. [PMID: 37643974 PMCID: PMC10495248 DOI: 10.3724/zdxbyxb-2023-0128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Accepted: 07/20/2023] [Indexed: 08/12/2023]
Abstract
Long non-coding RNAs (lncRNAs) which are usually thought to have no protein coding ability, are widely involved in cell proliferation, signal transduction and other biological activities. However, recent studies have suggested that short open reading frames (sORFs) of some lncRNAs can encode small functional peptides (micropeptides). These micropeptides appear to play important roles in calcium homeostasis, embryonic development and tumorigenesis, suggesting their potential as therapeutic targets and diagnostic biomarkers. Currently, bioinformatic tools as well as experimental methods such as ribosome mapping and in vitro translation are applied to predict the coding potential of lncRNAs. Furthermore, mass spectrometry, specific antibodies and epitope tags are used for validating the expression of micropeptides. Here, we review the physiological and pathological functions of recently identified micropeptides as well as research strategies for predicting the coding potential of lncRNAs to facilitate the further research of lncRNA encoded micropeptides.
Collapse
Affiliation(s)
- Xinyi Wang
- College of Life Sciences, Zhejiang University, Hangzhou 310058, China.
- Zhejiang University Cancer Center, Hangzhou 310058, China.
| | - Zhen Zhang
- College of Life Sciences, Zhejiang University, Hangzhou 310058, China
- Zhejiang University Cancer Center, Hangzhou 310058, China
| | - Chengyu Shi
- College of Life Sciences, Zhejiang University, Hangzhou 310058, China
- Zhejiang University Cancer Center, Hangzhou 310058, China
| | - Ying Wang
- College of Life Sciences, Zhejiang University, Hangzhou 310058, China
- Zhejiang University Cancer Center, Hangzhou 310058, China
| | - Tianhua Zhou
- Zhejiang University Cancer Center, Hangzhou 310058, China.
- The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Center for RNA Medicine, International Institutes of Medicine, Zhejiang University, Jinhua 322000, Zhejiang Province, China.
- Department of Cell Biology, Zhejiang University School of Medicine, Hangzhou 310058, China.
| | - Aifu Lin
- College of Life Sciences, Zhejiang University, Hangzhou 310058, China.
- Zhejiang University Cancer Center, Hangzhou 310058, China.
- The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Center for RNA Medicine, International Institutes of Medicine, Zhejiang University, Jinhua 322000, Zhejiang Province, China.
| |
Collapse
|
6
|
Deng L, Jiang Y, Hu X, Zheng R, Huang Z, Zhang J. ABLNCPP: Attention Mechanism-Based Bidirectional Long Short-Term Memory for Noncoding RNA Coding Potential Prediction. J Chem Inf Model 2023. [PMID: 37294848 DOI: 10.1021/acs.jcim.3c00366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
With the continuous development of ribosome profiling, sequencing technology, and proteomics, evidence is mounting that noncoding RNA (ncRNA) may be a novel source of peptides or proteins. These peptides and proteins play crucial roles in inhibiting tumor progression and interfering with cancer metabolism and other essential physiological processes. Therefore, identifying ncRNAs with coding potential is vital to ncRNA functional research. However, existing studies perform well in classifying ncRNAs and mRNAs, and no research has been explicitly raised to distinguish whether ncRNA transcripts have coding potential. For this reason, we propose an attention mechanism-based bidirectional LSTM network called ABLNCPP to assess the coding possibility of ncRNA sequences. Considering the sequential information loss in previous methods, we introduce a novel nonoverlapping trinucleotide embedding (NOLTE) method for ncRNAs to obtain embeddings containing sequential features. The extensive evaluations show that ABLNCPP outperforms other state-of-the-art models. In general, ABLNCPP overcomes the bottleneck of ncRNA coding potential prediction and is expected to provide valuable contributions to cancer discovery and treatment in the future. The source code and data sets are freely available at https://github.com/YinggggJ/ABLNCPP.
Collapse
Affiliation(s)
- Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha 410018, China
| | - Ying Jiang
- School of Computer Science and Engineering, Central South University, Changsha 410018, China
| | - Xiaowen Hu
- School of Computer Science and Engineering, Central South University, Changsha 410018, China
| | - Rongtao Zheng
- School of Computer Science and Engineering, Central South University, Changsha 410018, China
| | - Zhijian Huang
- School of Computer Science and Engineering, Central South University, Changsha 410018, China
| | - Jingpu Zhang
- School of Computer and Data Science, Henan University of Urban Construction, Pingdingshan 467000, China
| |
Collapse
|
7
|
Zhao S, Meng J, Wekesa JS, Luan Y. Identification of small open reading frames in plant lncRNA using class-imbalance learning. Comput Biol Med 2023; 157:106773. [PMID: 36924731 DOI: 10.1016/j.compbiomed.2023.106773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Revised: 02/21/2023] [Accepted: 03/09/2023] [Indexed: 03/12/2023]
Abstract
Recently, small open reading frames (sORFs) in long noncoding RNA (lncRNA) have been demonstrated to encode small peptides that can help study the mechanisms of growth and development in organisms. Since machine learning-based computational methods are less costly compared with biological experiments, they can be used to identify sORFs and provide a basis for biological experiments. However, few computational methods and data resources have been exploited for identifying sORFs in plant lncRNA. Besides, machine learning models produce underperforming classifiers when faced with a class-imbalance problem. In this study, an alternative method called SMOTE based on weighted cosine distance (WCDSMOTE) which enables interaction with feature selection is put forward to synthesize minority class samples and weighted edited nearest neighbor (WENN) is applied to clean up majority class samples, thus, hybrid sampling WCDSMOTE-ENN is proposed to deal with imbalanced datasets with the multi-angle feature. A heterogeneous classifier ensemble is introduced to complete the classification task. Therefore, a novel computational method that is based on class-imbalance learning to identify the sORFs with coding potential in plant lncRNA (sORFplnc) is presented. Experimental results manifest that sORFplnc outperforms existing computational methods in identifying sORFs with coding potential. We anticipate that the proposed work can be a reference for relevant research and contribute to agriculture and biomedicine.
Collapse
Affiliation(s)
- Siyuan Zhao
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116024, China
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116024, China.
| | - Jael Sanyanda Wekesa
- Department of Information Technology, Jomo Kenyatta University of Agriculture and Technology, Nairobi, 62000-00200, Kenya
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning, 116024, China
| |
Collapse
|