51
|
Li F, Dong S, Leier A, Han M, Guo X, Xu J, Wang X, Pan S, Jia C, Zhang Y, Webb GI, Coin LJM, Li C, Song J. Positive-unlabeled learning in bioinformatics and computational biology: a brief review. Brief Bioinform 2021; 23:6415313. [PMID: 34729589 DOI: 10.1093/bib/bbab461] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 09/27/2021] [Accepted: 10/07/2021] [Indexed: 12/14/2022] Open
Abstract
Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications.
Collapse
Affiliation(s)
- Fuyi Li
- Monash University, Australia
| | | | - André Leier
- Department of Genetics, UAB School of Medicine, USA
| | - Meiya Han
- Department of Biochemistry and Molecular Biology, Monash University, Australia
| | | | - Jing Xu
- Computer Science and Technology from Nankai University, China
| | - Xiaoyu Wang
- Department of Biochemistry and Molecular Biology and Biomedicine Discovery Institute, Monash University, Australia
| | - Shirui Pan
- University of Technology Sydney (UTS), Ultimo, NSW, Australia
| | - Cangzhi Jia
- College of Science, Dalian Maritime University, Australia
| | - Yang Zhang
- Northwestern Polytechnical University, China
| | - Geoffrey I Webb
- Faculty of Information Technology at Monash University, Australia
| | - Lachlan J M Coin
- Department of Clinical Pathology, University of Melbourne, Australia
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry of Molecular Biology, Monash University, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Melbourne, Australia
| |
Collapse
|
52
|
Jiao S, Zou Q, Guo H, Shi L. iTTCA-RF: a random forest predictor for tumor T cell antigens. J Transl Med 2021; 19:449. [PMID: 34706730 PMCID: PMC8554859 DOI: 10.1186/s12967-021-03084-x] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Accepted: 09/16/2021] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Cancer is one of the most serious diseases threatening human health. Cancer immunotherapy represents the most promising treatment strategy due to its high efficacy and selectivity and lower side effects compared with traditional treatment. The identification of tumor T cell antigens is one of the most important tasks for antitumor vaccines development and molecular function investigation. Although several machine learning predictors have been developed to identify tumor T cell antigen, more accurate tumor T cell antigen identification by existing methodology is still challenging. METHODS In this study, we used a non-redundant dataset of 592 tumor T cell antigens (positive samples) and 393 tumor T cell antigens (negative samples). Four types feature encoding methods have been studied to build an efficient predictor, including amino acid composition, global protein sequence descriptors and grouped amino acid and peptide composition. To improve the feature representation ability of the hybrid features, we further employed a two-step feature selection technique to search for the optimal feature subset. The final prediction model was constructed using random forest algorithm. RESULTS Finally, the top 263 informative features were selected to train the random forest classifier for detecting tumor T cell antigen peptides. iTTCA-RF provides satisfactory performance, with balanced accuracy, specificity and sensitivity values of 83.71%, 78.73% and 88.69% over tenfold cross-validation as well as 73.14%, 62.67% and 83.61% over independent tests, respectively. The online prediction server was freely accessible at http://lab.malab.cn/~acy/iTTCA . CONCLUSIONS We have proven that the proposed predictor iTTCA-RF is superior to the other latest models, and will hopefully become an effective and useful tool for identifying tumor T cell antigens presented in the context of major histocompatibility complex class I.
Collapse
Affiliation(s)
- Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Huannan Guo
- Department of Oncology, General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin, China.
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Shanghai, China.
| |
Collapse
|
53
|
Zou H, Yin Z. m7G-DPP: Identifying N7-methylguanosine sites based on dinucleotide physicochemical properties of RNA. Biophys Chem 2021; 279:106697. [PMID: 34628276 DOI: 10.1016/j.bpc.2021.106697] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2021] [Revised: 10/01/2021] [Accepted: 10/02/2021] [Indexed: 11/17/2022]
Abstract
N7-methylguanosine (m7G) modification is one of the most common post-transcriptional RNA modifications, which play vital role in the regulation of gene expression. Dysfunction of m7G may result to developmental defects and the appearance of some serious diseases. Thus, it is an urgent task to fast and accurate identifying m7G sites. In view of experimental approaches are costly and time-consuming, researchers focused their attention on computational models. Hence, in current study, we proposed a novel predictor called m7G-DPP to identify m7G sites. In the predictor, the RNA sequences were firstly encoded by physicochemical (PC) properties of dinucleotide. Then, sliding window approach was adopted to divide PC matrix into multiple matrixes, and Pearson's correlation coefficient (PCC), dynamic time warping (DTW), and distance correlation (DC) were employed to extract classification features at each window. Next, the least absolute shrinkage and selection operator (LASSO) algorithm was applied to select discriminative features. Finally, these selected features were fed into support vector machine to identify m7G sites. Experimental results showed that the proposed method is effective, which may play a complementary role in current m7G sites prediction studies. The MATLAB codes and dataset can be obtained from website at https://figshare.com/articles/online_resource/m7G-DPP/15000348.
Collapse
Affiliation(s)
- Hongliang Zou
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang 330003, China.
| | - Zhijian Yin
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang 330003, China
| |
Collapse
|
54
|
Zhou Y, Yang J, Tian Z, Zeng J, Shen W. Research progress concerning m 6A methylation and cancer. Oncol Lett 2021; 22:775. [PMID: 34589154 PMCID: PMC8442141 DOI: 10.3892/ol.2021.13036] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2021] [Accepted: 08/20/2021] [Indexed: 12/12/2022] Open
Abstract
N6-methyladenosine (m6A) methylation is a type of methylation modification on RNA molecules, which was first discovered in 1974, and has become a hot topic in life science in recent years. m6A modification is an epigenetic regulation similar to DNA and histone modification and is dynamically reversible in mammalian cells. This chemical marker of RNA is produced by m6A 'writers' (methylase) and can be degraded by m6A 'erasers' (demethylase). Methylated reading protein is the 'reader', that can recognize the mRNA containing m6A and regulate the expression of downstream genes accordingly. m6A methylation is involved in all stages of the RNA life cycle, including RNA processing, nuclear export, translation and regulation of RNA degradation, indicating that m6A plays a crucial role in RNA metabolism. Recent studies have shown that m6A modification is a complicated regulatory network in different cell lines, tissues and spatio-temporal models, and m6A methylation is associated with the occurrence and development of tumors. The present review describes the regulatory mechanism and physiological functions of m6A methylation, and its research progress in several types of human tumor, to provide novel approaches for early diagnosis and targeted treatment of cancer.
Collapse
Affiliation(s)
- Yang Zhou
- Department of Cell Biology, School of Medicine of Yangzhou University, Yangzhou, Jiangsu 225000, P.R. China
| | - Jie Yang
- Department of Cell Biology, School of Medicine of Yangzhou University, Yangzhou, Jiangsu 225000, P.R. China
| | - Zheng Tian
- Department of Cell Biology, School of Medicine of Yangzhou University, Yangzhou, Jiangsu 225000, P.R. China
| | - Jing Zeng
- Department of Cell Biology, School of Medicine of Yangzhou University, Yangzhou, Jiangsu 225000, P.R. China
| | - Weigan Shen
- Department of Cell Biology, School of Medicine of Yangzhou University, Yangzhou, Jiangsu 225000, P.R. China
| |
Collapse
|
55
|
Malik AA, Chotpatiwetchkul W, Phanus-Umporn C, Nantasenamat C, Charoenkwan P, Shoombuatong W. StackHCV: a web-based integrative machine-learning framework for large-scale identification of hepatitis C virus NS5B inhibitors. J Comput Aided Mol Des 2021; 35:1037-1053. [PMID: 34622387 DOI: 10.1007/s10822-021-00418-1] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2021] [Accepted: 09/17/2021] [Indexed: 01/07/2023]
Abstract
Fast and accurate identification of inhibitors with potency against HCV NS5B polymerase is currently a challenging task. As conventional experimental methods is the gold standard method for the design and development of new HCV inhibitors, they often require costly investment of time and resources. In this study, we develop a novel machine learning-based meta-predictor (termed StackHCV) for accurate and large-scale identification of HCV inhibitors. Unlike the existing method, which is based on single-feature-based approach, we first constructed a pool of various baseline models by employing a wide range of heterogeneous molecular fingerprints with five popular machine learning algorithms (k-nearest neighbor, multi-layer perceptron, partial least squares, random forest and support vectors machine). Secondly, we integrated these baseline models in order to develop the final meta-based model by means of the stacking strategy. Extensive benchmarking experiments showed that StackHCV achieved a more accurate and stable performance as compared to its constituent baseline models on the training dataset and also outperformed the existing predictor on the independent test dataset. To facilitate the high-throughput identification of HCV inhibitors, we built a web server that can be freely accessed at http://camt.pythonanywhere.com/StackHCV . It is expected that StackHCV could be a useful tool for fast and precise identification of potential drugs against HCV NS5B particularly for liver cancer therapy and other clinical applications.
Collapse
Affiliation(s)
- Aijaz Ahmad Malik
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Warot Chotpatiwetchkul
- Applied Computational Chemistry Research Unit, Department of Chemistry, School of Science, King Mongkut's Institute of Technology Ladkrabang, Bangkok, 10520, Thailand
| | - Chuleeporn Phanus-Umporn
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, 50200, Thailand.
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand.
| |
Collapse
|
56
|
El Allali A, Elhamraoui Z, Daoud R. Machine learning applications in RNA modification sites prediction. Comput Struct Biotechnol J 2021; 19:5510-5524. [PMID: 34712397 PMCID: PMC8517552 DOI: 10.1016/j.csbj.2021.09.025] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Revised: 09/24/2021] [Accepted: 09/25/2021] [Indexed: 12/15/2022] Open
Abstract
Ribonucleic acid (RNA) modifications are post-transcriptional chemical composition changes that have a fundamental role in regulating the main aspect of RNA function. Recently, large datasets have become available thanks to the recent development in deep sequencing and large-scale profiling. This availability of transcriptomic datasets has led to increased use of machine learning based approaches in epitranscriptomics, particularly in identifying RNA modifications. In this review, we comprehensively explore machine learning based approaches used for the prediction of 11 RNA modification types, namely,m 1 A ,m 6 A ,m 5 C , 5 hmC , ψ , 2 ' - O - Me , ac 4 C ,m 7 G , A - to - I ,m 2 G , and D . This review covers the life cycle of machine learning methods to predict RNA modification sites including available benchmark datasets, feature extraction, and classification algorithms. We compare available methods in terms of datasets, target species, approach, and accuracy for each RNA modification type. Finally, we discuss the advantages and limitations of the reviewed approaches and suggest future perspectives.
Collapse
Affiliation(s)
- A. El Allali
- African Genome Center, University Mohamed VI Polytechnic, Morocco
| | - Zahra Elhamraoui
- African Genome Center, University Mohamed VI Polytechnic, Morocco
| | - Rachid Daoud
- African Genome Center, University Mohamed VI Polytechnic, Morocco
| |
Collapse
|
57
|
Basith S, Lee G, Manavalan B. STALLION: a stacking-based ensemble learning framework for prokaryotic lysine acetylation site prediction. Brief Bioinform 2021; 23:6370848. [PMID: 34532736 PMCID: PMC8769686 DOI: 10.1093/bib/bbab376] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2021] [Revised: 08/22/2021] [Accepted: 08/24/2021] [Indexed: 12/13/2022] Open
Abstract
Protein post-translational modification (PTM) is an important regulatory mechanism that plays a key role in both normal and disease states. Acetylation on lysine residues is one of the most potent PTMs owing to its critical role in cellular metabolism and regulatory processes. Identifying protein lysine acetylation (Kace) sites is a challenging task in bioinformatics. To date, several machine learning-based methods for the in silico identification of Kace sites have been developed. Of those, a few are prokaryotic species-specific. Despite their attractive advantages and performances, these methods have certain limitations. Therefore, this study proposes a novel predictor STALLION (STacking-based Predictor for ProkAryotic Lysine AcetyLatION), containing six prokaryotic species-specific models to identify Kace sites accurately. To extract crucial patterns around Kace sites, we employed 11 different encodings representing three different characteristics. Subsequently, a systematic and rigorous feature selection approach was employed to identify the optimal feature set independently for five tree-based ensemble algorithms and built their respective baseline model for each species. Finally, the predicted values from baseline models were utilized and trained with an appropriate classifier using the stacking strategy to develop STALLION. Comparative benchmarking experiments showed that STALLION significantly outperformed existing predictor on independent tests. To expedite direct accessibility to the STALLION models, a user-friendly online predictor was implemented, which is available at: http://thegleelab.org/STALLION.
Collapse
Affiliation(s)
- Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Republic of Korea
| | - Gwang Lee
- Department of Molecular Science and Technology, Ajou University, Suwon 16499, Republic of Korea
| | | |
Collapse
|
58
|
Ao C, Gao L, Yu L. Research progress in predicting DNA methylation modifications and the relation with human diseases. Curr Med Chem 2021; 29:822-836. [PMID: 34533438 DOI: 10.2174/0929867328666210917115733] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 07/05/2021] [Accepted: 07/11/2021] [Indexed: 11/22/2022]
Abstract
DNA methylation is an important mode of regulation in epigenetic mechanisms, and it is one of the research foci in the field of epigenetics. DNA methylation modification affects a series of biological processes, such as eukaryotic cell growth, differentiation and transformation mechanisms, by regulating gene expression. In this review, we systematically summarized the DNA methylation databases, prediction tools for DNA methylation modification, machine learning algorithms for predicting DNA methylation modification, and the relationship between DNA methylation modification and diseases such as hypertension, Alzheimer's disease, diabetic nephropathy, and cancer. An in-depth understanding of DNA methylation mechanisms can promote accurate prediction of DNA methylation modifications and the treatment and diagnosis of related diseases.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
59
|
Zhao YW, Zhang S, Ding H. Recent development of machine learning methods in sumoylation sites prediction. Curr Med Chem 2021; 29:894-907. [PMID: 34525906 DOI: 10.2174/0929867328666210915112030] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Revised: 07/24/2021] [Accepted: 08/07/2021] [Indexed: 11/22/2022]
Abstract
Sumoylation of proteins is an important reversible post-translational modification of proteins and mediates a variety of cellular processes. Sumo-modified proteins can change their subcellular localization, activity and stability. In addition, it also plays an important role in various cellular processes such as transcriptional regulation and signal transduction. The abnormal sumoylation is involved in many diseases, including neurodegeneration and immune-related diseases, as well as the development of cancer. Therefore, identification of the sumoylation site (SUMO site) is fundamental to understanding their molecular mechanisms and regulatory roles. In contrast to labor-intensive and costly experimental approaches, computational prediction of sumoylation sites in silico also attracted much attention for its accuracy, convenience and speed. At present, many computational prediction models have been used to identify SUMO sites, but these contents have not been comprehensively summarized and reviewed. Therefore, the research progress of relevant models is summarized and discussed in this paper. We will briefly summarize the development of bioinformatics methods on sumoylation site prediction. We will mainly focus on the benchmark dataset construction, feature extraction, machine learning method, published results and online tools. We hope the review will provide more help for wet-experimental scholars.
Collapse
Affiliation(s)
- Yi-Wei Zhao
- School of Medicine, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Shihua Zhang
- College of Life Science and Health, Wuhan University of Science and Technology, Wuhan 430065. China
| | - Hui Ding
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| |
Collapse
|
60
|
Identifying Dipeptidyl Peptidase-IV Inhibitory Peptides Based on Correlation Information of Physicochemical Properties. Int J Pept Res Ther 2021. [DOI: 10.1007/s10989-021-10280-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
|
61
|
Yang YH, Wang JS, Yuan SS, Liu ML, Su W, Lin H, Zhang ZY. A Survey for Predicting ATP Binding Residues of Proteins Using Machine Learning Methods. Curr Med Chem 2021; 29:789-806. [PMID: 34514982 DOI: 10.2174/0929867328666210910125802] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 06/29/2021] [Accepted: 07/04/2021] [Indexed: 11/22/2022]
Abstract
Protein-ligand interactions are necessary for majority protein functions. Adenosine-5'-triphosphate (ATP) is one such ligand that plays vital role as a coenzyme in providing energy for cellular activities, catalyzing biological reaction and signaling. Knowing ATP binding residues of proteins is helpful for annotation of protein function and drug design. However, due to the huge amounts of protein sequences influx into databases in the post-genome era, experimentally identifying ATP binding residues is cost-ineffective and time-consuming. To address this problem, computational methods have been developed to predict ATP binding residues. In this review, we briefly summarized the application of machine learning methods in detecting ATP binding residues of proteins. We expect this review will be helpful for further research.
Collapse
Affiliation(s)
- Yu-He Yang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Jia-Shu Wang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Shi-Shi Yuan
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Meng-Lu Liu
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Wei Su
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Hao Lin
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| | - Zhao-Yue Zhang
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054. China
| |
Collapse
|
62
|
Qin S, Mao Y, Chen X, Xiao J, Qin Y, Zhao L. The functional roles, cross-talk and clinical implications of m6A modification and circRNA in hepatocellular carcinoma. Int J Biol Sci 2021; 17:3059-3079. [PMID: 34421350 PMCID: PMC8375232 DOI: 10.7150/ijbs.62767] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2021] [Accepted: 07/06/2021] [Indexed: 12/13/2022] Open
Abstract
Hepatocellular carcinoma (HCC) is one of the leading causes of cancer-related deaths worldwide. HCC has high rates of death and recurrence, as well as very low survival rates. N6-methyladenosine (m6A) is the most abundant modification in eukaryotic RNAs, and circRNAs are a class of circular noncoding RNAs that are generated by back-splicing and they modulate multiple functions in a variety of cellular processes. Although the carcinogenesis of HCC is complex, emerging evidence has indicated that m6A modification and circRNA play vital roles in HCC development and progression. However, the underlying mechanisms governing HCC, their cross-talk, and clinical implications have not been fully elucidated. Therefore, in this paper, we elucidated the biological functions and molecular mechanisms of m6A modification in the carcinogenesis of HCC by illustrating three different regulatory factors ("writer", "eraser", and "reader") of the m6A modification process. Additionally, we dissected the functional roles of circRNAs in various malignant behaviors of HCC, thereby contributing to HCC initiation, progression and relapse. Furthermore, we demonstrated the cross-talk and interplay between m6A modification and circRNA by revealing the effects of the collaboration of circRNA and m6A modification on HCC progression. Finally, we proposed the clinical potential and implications of m6A modifiers and circRNAs as diagnostic biomarkers and therapeutic targets for HCC diagnosis, treatment and prognosis evaluation.
Collapse
Affiliation(s)
- Sha Qin
- Department of Pathology, Xiangya Hospital, Central South University, Changsha, Hunan, China; and Department of Pathology, School of Basic Medical Science, Xiangya School of Medicine, Central South University, Changsha, Hunan, China
| | - Yitao Mao
- Department of Radiology, Xiangya Hospital, Central South University, Changsha, Hunan, China.,National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha, Hunan, China
| | - Xue Chen
- Early Clinical Trial Center, Hunan Cancer Hospital and The Affiliated Cancer Hospital of Xiangya School of Medicine, Central South University, Changsha, Hunan, China
| | - Juxiong Xiao
- Department of Radiology, Xiangya Hospital, Central South University, Changsha, Hunan, China
| | - Yan Qin
- Department of Radiology, Xiangya Hospital, Central South University, Changsha, Hunan, China
| | - Luqing Zhao
- Department of Pathology, Xiangya Hospital, Central South University, Changsha, Hunan, China; and Department of Pathology, School of Basic Medical Science, Xiangya School of Medicine, Central South University, Changsha, Hunan, China.,National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha, Hunan, China
| |
Collapse
|
63
|
iBitter-Fuse: A Novel Sequence-Based Bitter Peptide Predictor by Fusing Multi-View Features. Int J Mol Sci 2021; 22:ijms22168958. [PMID: 34445663 PMCID: PMC8396555 DOI: 10.3390/ijms22168958] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Revised: 08/08/2021] [Accepted: 08/17/2021] [Indexed: 12/19/2022] Open
Abstract
Accurate identification of bitter peptides is of great importance for better understanding their biochemical and biophysical properties. To date, machine learning-based methods have become effective approaches for providing a good avenue for identifying potential bitter peptides from large-scale protein datasets. Although few machine learning-based predictors have been developed for identifying the bitterness of peptides, their prediction performances could be improved. In this study, we developed a new predictor (named iBitter-Fuse) for achieving more accurate identification of bitter peptides. In the proposed iBitter-Fuse, we have integrated a variety of feature encoding schemes for providing sufficient information from different aspects, namely consisting of compositional information and physicochemical properties. To enhance the predictive performance, the customized genetic algorithm utilizing self-assessment-report (GA-SAR) was employed for identifying informative features followed by inputting optimal ones into a support vector machine (SVM)-based classifier for developing the final model (iBitter-Fuse). Benchmarking experiments based on both 10-fold cross-validation and independent tests indicated that the iBitter-Fuse was able to achieve more accurate performance as compared to state-of-the-art methods. To facilitate the high-throughput identification of bitter peptides, the iBitter-Fuse web server was established and made freely available online. It is anticipated that the iBitter-Fuse will be a useful tool for aiding the discovery and de novo design of bitter peptides.
Collapse
|
64
|
Jiang P, Ning W, Shi Y, Liu C, Mo S, Zhou H, Liu K, Guo Y. FSL-Kla: A few-shot learning-based multi-feature hybrid system for lactylation site prediction. Comput Struct Biotechnol J 2021; 19:4497-4509. [PMID: 34471495 PMCID: PMC8385177 DOI: 10.1016/j.csbj.2021.08.013] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Revised: 08/05/2021] [Accepted: 08/08/2021] [Indexed: 01/04/2023] Open
Abstract
As a novel lactate-derived post-translational modification (PTM), lysine lactylation (Kla) is involved in diverse biological processes, and participates in human tumorigenesis. Identification of Kla substrates with their exact sites is crucial for revealing the molecular mechanisms of lactylation. In contrast with labor-intensive and time-consuming experimental approaches, computational prediction of Kla could provide convenience and increased speed, but is still lacking. In this work, although current identified Kla sites are limited, we constructed the first Kla benchmark dataset and developed a few-shot learning-based architecture approach to leverage the power of small datasets and reduce the impact of imbalance and overfitting. A maximum 11.7% (0.745 versus 0.667) increase of area under the curve (AUC) value was achieved in contrast to conventional machine learning methods. We conducted a comprehensive survey of the performance by combining 8 sequence-based features and 3 structure-based features and tailored a multi-feature hybrid system for synergistic combination. This system achieved >16.2% improvement of the AUC value (0.889 versus 0.765) compared with single feature-based models for the prediction of Kla sites in silico. Taken few-shot learning and hybrid system together, we present our newly designed predictor named FSL-Kla, which is not only a cutting-edge tool for Kla site profile but also could generate candidates for further experimental approaches. The webserver of FSL-Kla is freely accessible for academic research at http://kla.zbiolab.cn/.
Collapse
Affiliation(s)
- Peiran Jiang
- Department of Pathophysiology, School of Basic Medical Sciences, Zhengzhou University, Zhengzhou, Henan 450001, China
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA
| | - Wanshan Ning
- MOE Key Laboratory of Molecular Biophysics, Hubei Bioinformatics and Molecular Imaging Key Laboratory, Center for Artificial Intelligence Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Yunshu Shi
- Department of Pathophysiology, School of Basic Medical Sciences, Zhengzhou University, Zhengzhou, Henan 450001, China
- Henan Provincial Cooperative Innovation Center for Cancer Chemoprevention, Zhengzhou, Henan 450001, China
| | - Chuan Liu
- State Key Laboratory of Digital Manufacturing Equipment and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Saijun Mo
- Department of Pathophysiology, School of Basic Medical Sciences, Zhengzhou University, Zhengzhou, Henan 450001, China
| | - Haoran Zhou
- Department of Pathophysiology, School of Basic Medical Sciences, Zhengzhou University, Zhengzhou, Henan 450001, China
| | - Kangdong Liu
- Department of Pathophysiology, School of Basic Medical Sciences, Zhengzhou University, Zhengzhou, Henan 450001, China
- State Key Laboratory of Esophageal Cancer Prevention and Treatment, Zhengzhou, Henan 450001, China
- Academy of Medical Sciences, Zhengzhou University, Zhengzhou, Henan 450001, China
| | - Yaping Guo
- Department of Pathophysiology, School of Basic Medical Sciences, Zhengzhou University, Zhengzhou, Henan 450001, China
- State Key Laboratory of Esophageal Cancer Prevention and Treatment, Zhengzhou, Henan 450001, China
| |
Collapse
|
65
|
Islam N, Park J. bCNN-Methylpred: Feature-Based Prediction of RNA Sequence Modification Using Branch Convolutional Neural Network. Genes (Basel) 2021; 12:genes12081155. [PMID: 34440330 PMCID: PMC8392086 DOI: 10.3390/genes12081155] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Revised: 07/24/2021] [Accepted: 07/26/2021] [Indexed: 11/16/2022] Open
Abstract
RNA modification is vital to various cellular and biological processes. Among the existing RNA modifications, N6-methyladenosine (m6A) is considered the most important modification owing to its involvement in many biological processes. The prediction of m6A sites is crucial because it can provide a better understanding of their functional mechanisms. In this regard, although experimental methods are useful, they are time consuming. Previously, researchers have attempted to predict m6A sites using computational methods to overcome the limitations of experimental methods. Some of these approaches are based on classical machine-learning techniques that rely on handcrafted features and require domain knowledge, whereas other methods are based on deep learning. However, both methods lack robustness and yield low accuracy. Hence, we develop a branch-based convolutional neural network and a novel RNA sequence representation. The proposed network automatically extracts features from each branch of the designated inputs. Subsequently, these features are concatenated in the feature space to predict the m6A sites. Finally, we conduct experiments using four different species. The proposed approach outperforms existing state-of-the-art methods, achieving accuracies of 94.91%, 94.28%, 88.46%, and 94.8% for the H. sapiens, M. musculus, S. cerevisiae, and A. thaliana datasets, respectively.
Collapse
Affiliation(s)
- Naeem Islam
- Core Research Institute of Intelligent Robots, Jeonbuk National University, Jeonju 54896, Korea;
- College of Electrical & Mechanical Engineering, NUST, Islamabad 44000, Pakistan
| | - Jaebyung Park
- Core Research Institute of Intelligent Robots, Jeonbuk National University, Jeonju 54896, Korea;
- Division of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea
- Correspondence: ; Tel.: +82-63-270-4283
| |
Collapse
|
66
|
Zulfiqar H, Yuan SS, Huang QL, Sun ZJ, Dao FY, Yu XL, Lin H. Identification of cyclin protein using gradient boost decision tree algorithm. Comput Struct Biotechnol J 2021; 19:4123-4131. [PMID: 34527186 PMCID: PMC8346528 DOI: 10.1016/j.csbj.2021.07.013] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 07/15/2021] [Accepted: 07/15/2021] [Indexed: 12/12/2022] Open
Abstract
Cyclin proteins are capable to regulate the cell cycle by forming a complex with cyclin-dependent kinases to activate cell cycle. Correct recognition of cyclin proteins could provide key clues for studying their functions. However, their sequences share low similarity, which results in poor prediction for sequence similarity-based methods. Thus, it is urgent to construct a machine learning model to identify cyclin proteins. This study aimed to develop a computational model to discriminate cyclin proteins from non-cyclin proteins. In our model, protein sequences were encoded by seven kinds of features that are amino acid composition, composition of k-spaced amino acid pairs, tri peptide composition, pseudo amino acid composition, geary correlation, normalized moreau-broto autocorrelation and composition/transition/distribution. Afterward, these features were optimized by using analysis of variance (ANOVA) and minimum redundancy maximum relevance (mRMR) with incremental feature selection (IFS) technique. A gradient boost decision tree (GBDT) classifier was trained on the optimal features. Five-fold cross-validated results showed that our model would identify cyclins with an accuracy of 93.06% and AUC value of 0.971, which are higher than the two recent studies on the same data.
Collapse
Affiliation(s)
- Hasan Zulfiqar
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Shi-Shi Yuan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Qin-Lai Huang
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zi-Jie Sun
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fu-Ying Dao
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xiao-Long Yu
- School of Materials Science and Engineering, Hainan University, Haikou 570228, China
| | - Hao Lin
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
67
|
Nie F, Feng P, Song X, Wu M, Tang Q, Chen W. RNAWRE: a resource of writers, readers and erasers of RNA modifications. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2020:5865458. [PMID: 32608478 PMCID: PMC7327530 DOI: 10.1093/database/baaa049] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/12/2020] [Revised: 05/25/2020] [Accepted: 05/29/2020] [Indexed: 12/12/2022]
Abstract
RNA modifications are involved in various kinds of cellular biological processes. Accumulated evidences have demonstrated that the functions of RNA modifications are determined by the effectors that can catalyze, recognize and remove RNA modifications. They are called ‘writers’, ‘readers’ and ‘erasers’. The identification of RNA modification effectors will be helpful for understanding the regulatory mechanisms and biological functions of RNA modifications. In this work, we developed a database called RNAWRE that specially deposits RNA modification effectors. The current version of RNAWRE stored 2045 manually curated writers, readers and erasers for the six major kinds of RNA modifications, namely Cap, m1A, m6A, m5C, ψ and Poly A. The main modules of RNAWRE not only allow browsing and downloading the RNA modification effectors but also support the BLAST search of the potential RNA modification effectors in other species. We hope that RNAWRE will be helpful for the researches on RNA modifications. Database URL: http://rnawre.bio2db.com
Collapse
Affiliation(s)
- Fulei Nie
- School of Life Sciences, Center for Genomics and Computational Biology, North China University of Science and Technology, 21 Bohai Road, Caofeidian Xincheng, Tangshan 063009, China
| | - Pengmian Feng
- School of Basic Medical Sciences, 1166 Liutai Avenue, Wenjiang District, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Xiaoming Song
- School of Life Sciences, Center for Genomics and Computational Biology, North China University of Science and Technology, 21 Bohai Road, Caofeidian Xincheng, Tangshan 063009, China
| | - Meng Wu
- School of Life Sciences, Center for Genomics and Computational Biology, North China University of Science and Technology, 21 Bohai Road, Caofeidian Xincheng, Tangshan 063009, China
| | - Qiang Tang
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, 1166 Liutai Avenue, Wenjiang District, Chengdu 611137, China
| | - Wei Chen
- School of Life Sciences, Center for Genomics and Computational Biology, North China University of Science and Technology, 21 Bohai Road, Caofeidian Xincheng, Tangshan 063009, China.,Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, 1166 Liutai Avenue, Wenjiang District, Chengdu 611137, China
| |
Collapse
|
68
|
Basith S, Hasan MM, Lee G, Wei L, Manavalan B. Integrative machine learning framework for the identification of cell-specific enhancers from the human genome. Brief Bioinform 2021; 22:6315815. [PMID: 34226917 DOI: 10.1093/bib/bbab252] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 06/08/2021] [Accepted: 06/14/2021] [Indexed: 02/06/2023] Open
Abstract
Enhancers are deoxyribonucleic acid (DNA) fragments which when bound by transcription factors enhance the transcription of related genes. Due to its sporadic distribution and similar fractions, identification of enhancers from the human genome seems a daunting task. Compared to the traditional experimental approaches, computational methods with easy-to-use platforms could be efficiently applied to annotate enhancers' functions and physiological roles. In this aspect, several bioinformatics tools have been developed to identify enhancers. Despite their spectacular performances, existing methods have certain drawbacks and limitations, including fixed length of sequences being utilized for model development and cell-specificity negligence. A novel predictor would be beneficial in the context of genome-wide enhancer prediction by addressing the above-mentioned issues. In this study, we constructed new datasets for eight different cell types. Utilizing these data, we proposed an integrative machine learning (ML)-based framework called Enhancer-IF for identifying cell-specific enhancers. Enhancer-IF comprehensively explores a wide range of heterogeneous features with five commonly used ML methods (random forest, extremely randomized tree, multilayer perceptron, support vector machine and extreme gradient boosting). Specifically, these five classifiers were trained with seven encodings and obtained 35 baseline models. The output of these baseline models was integrated and again inputted to five classifiers for the construction of five meta-models. Finally, the integration of five meta-models through ensemble learning improved the model robustness. Our proposed approach showed an excellent prediction performance compared to the baseline models on both training and independent datasets in different cell types, thus highlighting the superiority of our approach in the identification of the enhancers. We assume that Enhancer-IF will be a valuable tool for screening and identifying potential enhancers from the human DNA sequences.
Collapse
Affiliation(s)
- Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Republic of Korea
| | - Md Mehedi Hasan
- Tulane University, USA.,Kyushu Institute of Technology, Japan
| | - Gwang Lee
- Department of Physiology, Ajou University School of Medicine, Republic of Korea
| | - Leyi Wei
- Xiamen University, China.,Shandong University, China
| | | |
Collapse
|
69
|
Song Z, Huang D, Song B, Chen K, Song Y, Liu G, Su J, Magalhães JPD, Rigden DJ, Meng J. Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications. Nat Commun 2021; 12:4011. [PMID: 34188054 PMCID: PMC8242015 DOI: 10.1038/s41467-021-24313-3] [Citation(s) in RCA: 58] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2020] [Accepted: 06/07/2021] [Indexed: 02/08/2023] Open
Abstract
Recent studies suggest that epi-transcriptome regulation via post-transcriptional RNA modifications is vital for all RNA types. Precise identification of RNA modification sites is essential for understanding the functions and regulatory mechanisms of RNAs. Here, we present MultiRM, a method for the integrated prediction and interpretation of post-transcriptional RNA modifications from RNA sequences. Built upon an attention-based multi-label deep learning framework, MultiRM not only simultaneously predicts the putative sites of twelve widely occurring transcriptome modifications (m6A, m1A, m5C, m5U, m6Am, m7G, Ψ, I, Am, Cm, Gm, and Um), but also returns the key sequence contents that contribute most to the positive predictions. Importantly, our model revealed a strong association among different types of RNA modifications from the perspective of their associated sequence contexts. Our work provides a solution for detecting multiple RNA modifications, enabling an integrated analysis of these RNA modifications, and gaining a better understanding of sequence-based RNA modification mechanisms.
Collapse
Affiliation(s)
- Zitao Song
- Department of Mathematical Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, PR China
| | - Daiyun Huang
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, PR China.
- Department of Computer Sciences, University of Liverpool, Liverpool, United Kingdom.
| | - Bowen Song
- Department of Mathematical Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, PR China
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom
| | - Kunqi Chen
- Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, PR China
| | - Yiyou Song
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, PR China
| | - Gang Liu
- Department of Mathematical Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, PR China
| | - Jionglong Su
- School of AI and Advanced Computing, XJTLU Entrepreneur College (Taicang), Xi'an Jiaotong-Liverpool University, Suzhou, PR China
| | | | - Daniel J Rigden
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom
| | - Jia Meng
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, PR China.
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom.
- AI University Research Centre, Xi'an Jiaotong-Liverpool University, Suzhou, PR China.
| |
Collapse
|
70
|
Pan J, Huang Z, Xu Y. m5C-Related lncRNAs Predict Overall Survival of Patients and Regulate the Tumor Immune Microenvironment in Lung Adenocarcinoma. Front Cell Dev Biol 2021; 9:671821. [PMID: 34268304 PMCID: PMC8277384 DOI: 10.3389/fcell.2021.671821] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Accepted: 06/01/2021] [Indexed: 12/24/2022] Open
Abstract
Long non-coding RNAs (lncRNAs), which are involved in the regulation of RNA methylation, can be used to evaluate tumor prognosis. lncRNAs are closely related to the prognosis of patients with lung adenocarcinoma (LUAD); thus, it is crucial to identify RNA methylation-associated lncRNAs with definitive prognostic value. We used Pearson correlation analysis to construct a 5-Methylcytosine (m5C)-related lncRNAs–mRNAs coexpression network. Univariate and multivariate Cox proportional risk analyses were then used to determine a risk model for m5C-associated lncRNAs with prognostic value. The risk model was verified using Kaplan–Meier analysis, univariate and multivariate Cox regression analysis, and receiver operating characteristic curve analysis. We used principal component analysis and gene set enrichment analysis functional annotation to analyze the risk model. We also verified the expression level of m5C-related lncRNAs in vitro. The association between the risk model and tumor-infiltrating immune cells was assessed using the CIBERSORT tool and the TIMER database. Based on these analyses, a total of 14 m5C-related lncRNAs with prognostic value were selected to build the risk model. Patients were divided into high- and low-risk groups according to the median risk score. The prognosis of the high-risk group was worse than that of the low-risk group, suggesting the good sensitivity and specificity of the constructed risk model. In addition, 5 types of immune cells were significantly different in the high-and low-risk groups, and 6 types of immune cells were negatively correlated with the risk score. These results suggested that the risk model based on 14 m5C-related lncRNAs with prognostic value might be a promising prognostic tool for LUAD and might facilitate the management of patients with LUAD.
Collapse
Affiliation(s)
- Junfan Pan
- Shengli Clinical Medical College of Fujian Medical University, Fuzhou, China
| | - Zhidong Huang
- Quanzhou First Hospital, Fujian Medical University, Quanzhou, China
| | - Yiquan Xu
- Department of Thoracic Oncology, Fujian Medical University Cancer Hospital, Fujian Cancer Hospital, Fuzhou, China
| |
Collapse
|
71
|
Wang M, Xie J, Xu S. M6A-BiNP: predicting N 6-methyladenosine sites based on bidirectional position-specific propensities of polynucleotides and pointwise joint mutual information. RNA Biol 2021; 18:2498-2512. [PMID: 34161188 PMCID: PMC8632114 DOI: 10.1080/15476286.2021.1930729] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
N6-methyladenosine (m6A) plays an important role in various biological processes. Identifying m6A site is a key step in exploring its biological functions. One of the biggest challenges in identifying m6A sites is how to extract features comprising rich categorical information to distinguish m6A and non-m6A sites. To address this challenge, we propose bidirectional dinucleotide and trinucleotide position-specific propensities, respectively, in this paper. Based on this, we propose two feature-encoding algorithms: Position-Specific Propensities and Pointwise Mutual Information (PSP-PMI) and Position-Specific Propensities and Pointwise Joint Mutual Information (PSP-PJMI). PSP-PMI is based on the bidirectional dinucleotide propensity and the pointwise mutual information, while PSP-PJMI is based on the bidirectional trinucleotide position-specific propensity and the proposed pointwise joint mutual information in this paper. We introduce parameters α and β in PSP-PMI and PSP-PJMI, respectively, to represent the distance from the nucleotide to its forward or backward adjacent nucleotide or dinucleotide, so as to extract features containing local and global classification information. Finally, we propose the M6A-BiNP predictor based on PSP-PMI or PSP-PJMI and SVM classifier. The 10-fold cross-validation experimental results on the benchmark datasets of non-single-base resolution and single-base resolution demonstrate that PSP-PMI and PSP-PJMI can extract features with strong capabilities to identify m6A and non-m6A sites. The M6A-BiNP predictor based on our proposed feature encoding algorithm PSP-PJMI is better than the state-of-the-art predictors, and it is so far the best model to identify m6A and non-m6A sites.
Collapse
Affiliation(s)
- Mingzhao Wang
- College of Life Sciences, Shaanxi Normal University, Xi'an, China.,School of Computer Science, Shaanxi Normal University, Xi'an, China
| | - Juanying Xie
- School of Computer Science, Shaanxi Normal University, Xi'an, China
| | - Shengquan Xu
- College of Life Sciences, Shaanxi Normal University, Xi'an, China
| |
Collapse
|
72
|
Wang Y, Guo R, Huang L, Yang S, Hu X, He K. m6AGE: A Predictor for N6-Methyladenosine Sites Identification Utilizing Sequence Characteristics and Graph Embedding-Based Geometrical Information. Front Genet 2021; 12:670852. [PMID: 34122525 PMCID: PMC8191635 DOI: 10.3389/fgene.2021.670852] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Accepted: 04/29/2021] [Indexed: 11/30/2022] Open
Abstract
N6-methyladenosine (m6A) is one of the most prevalent RNA post-transcriptional modifications and is involved in various vital biological processes such as mRNA splicing, exporting, stability, and so on. Identifying m6A sites contributes to understanding the functional mechanism and biological significance of m6A. The existing biological experimental methods for identifying m6A sites are time-consuming and costly. Thus, developing a high confidence computational method is significant to explore m6A intrinsic characters. In this study, we propose a predictor called m6AGE which utilizes sequence-derived and graph embedding features. To the best of our knowledge, our predictor is the first to combine sequence-derived features and graph embeddings for m6A site prediction. Comparison results show that our proposed predictor achieved the best performance compared with other predictors on four public datasets across three species. On the A101 dataset, our predictor outperformed 1.34% (accuracy), 0.0227 (Matthew's correlation coefficient), 5.63% (specificity), and 0.0081 (AUC) than comparing predictors, which indicates that m6AGE is a useful tool for m6A site prediction. The source code of m6AGE is available at https://github.com/bokunoBike/m6AGE.
Collapse
Affiliation(s)
- Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, China
- School of Artificial Intelligence, Jilin University, Changchun, China
| | - Rui Guo
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, China
| | - Lan Huang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, China
| | - Sen Yang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, China
| | - Xuemei Hu
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, China
| | - Kai He
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, and College of Computer Science and Technology, Jilin University, Changchun, China
| |
Collapse
|
73
|
Ao C, Zou Q, Yu L. RFhy-m2G: Identification of RNA N2-methylguanosine modification sites based on random forest and hybrid features. Methods 2021; 203:32-39. [PMID: 34033879 DOI: 10.1016/j.ymeth.2021.05.016] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2021] [Revised: 05/04/2021] [Accepted: 05/20/2021] [Indexed: 12/31/2022] Open
Abstract
N2-methylguanosine is a post-transcriptional modification of RNA that is found in eukaryotes and archaea. The biological function of m2G modification discovered so far is to control and stabilize the three-dimensional structure of tRNA and the dynamic barrier of reverse transcription. To discover additional biological functions of m2G, it is necessary to develop time-saving and labor-saving calculation tools to identify m2G. In this paper, based on hybrid features and a random forest, a novel predictor, RFhy-m2G, was developed to identify the m2G modification sites for three species. The hybrid feature used by the predictor is used to fuse the three features of ENAC, PseDNC, and NPPS. These three features include primary sequence derivation properties, physicochemical properties, and position-specific properties. Since there are redundant features in hybrid features, MRMD2.0 is used for optimal feature selection. Through feature analysis, it is found that the optimal hybrid features obtained still contain three kinds of properties, and the hybrid features can more accurately identify m2G modification sites and improve prediction performance. Based on five-fold cross-validation and independent testing to evaluate the prediction model, the accuracies obtained were 0.9982 and 0.9417, respectively. The robustness of the predictor is demonstrated by comparisons with other predictors.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China; Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China.
| |
Collapse
|
74
|
Zhang SY, Zhang SW, Zhang T, Fan XN, Meng J. Recent advances in functional annotation and prediction of the epitranscriptome. Comput Struct Biotechnol J 2021; 19:3015-3026. [PMID: 34136099 PMCID: PMC8175281 DOI: 10.1016/j.csbj.2021.05.030] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 05/16/2021] [Accepted: 05/18/2021] [Indexed: 12/17/2022] Open
Abstract
RNA modifications, in particular N6-methyladenosine (m6A), participate in every stages of RNA metabolism and play diverse roles in essential biological processes and disease pathogenesis. Thanks to the advances in sequencing technology, tens of thousands of RNA modification sites can be identified in a typical high-throughput experiment; however, it remains a major challenge to decipher the functional relevance of these sites, such as, affecting alternative splicing, regulation circuit in essential biological processes or association to diseases. As the focus of RNA epigenetics gradually shifts from site discovery to functional studies, we review here recent progress in functional annotation and prediction of RNA modification sites from a bioinformatics perspective. The review covers naïve annotation with associated biological events, e.g., single nucleotide polymorphism (SNP), RNA binding protein (RBP) and alternative splicing, prediction of key sites and their regulatory functions, inference of disease association, and mining the diagnosis and prognosis value of RNA modification regulators. We further discussed the limitations of existing approaches and some future perspectives.
Collapse
Affiliation(s)
- Song-Yao Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an 710072, China
| | - Shao-Wu Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an 710072, China
| | - Teng Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an 710072, China
| | - Xiao-Nan Fan
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an 710072, China
| | - Jia Meng
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
| |
Collapse
|
75
|
Charoenkwan P, Chiangjong W, Nantasenamat C, Hasan MM, Manavalan B, Shoombuatong W. StackIL6: a stacking ensemble model for improving the prediction of IL-6 inducing peptides. Brief Bioinform 2021; 22:6271998. [PMID: 33963832 DOI: 10.1093/bib/bbab172] [Citation(s) in RCA: 95] [Impact Index Per Article: 23.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Revised: 03/30/2021] [Accepted: 04/10/2021] [Indexed: 12/13/2022] Open
Abstract
The release of interleukin (IL)-6 is stimulated by antigenic peptides from pathogens as well as by immune cells for activating aggressive inflammation. IL-6 inducing peptides are derived from pathogens and can be used as diagnostic biomarkers for predicting various stages of disease severity as well as being used as IL-6 inhibitors for the suppression of aggressive multi-signaling immune responses. Thus, the accurate identification of IL-6 inducing peptides is of great importance for investigating their mechanism of action as well as for developing diagnostic and immunotherapeutic applications. This study proposes a novel stacking ensemble model (termed StackIL6) for accurately identifying IL-6 inducing peptides. More specifically, StackIL6 was constructed from twelve different feature descriptors derived from three major groups of features (composition-based features, composition-transition-distribution-based features and physicochemical properties-based features) and five popular machine learning algorithms (extremely randomized trees, logistic regression, multi-layer perceptron, support vector machine and random forest). To enhance the utility of baseline models, they were effectively and systematically integrated through a stacking strategy to build the final meta-based model. Extensive benchmarking experiments demonstrated that StackIL6 could achieve significantly better performance than the existing method (IL6PRED) and outperformed its constituent baseline models on both training and independent test datasets, which thereby support its excellent discrimination and generalization abilities. To facilitate easy access to the StackIL6 model, it was established as a freely available web server accessible at http://camt.pythonanywhere.com/StackIL6. It is anticipated that StackIL6 can help to facilitate rapid screening of promising IL-6 inducing peptides for the development of diagnostic and immunotherapeutic applications in the future.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai 50200, Thailand
| | - Wararat Chiangjong
- Pediatric Translational Research Unit, Department of Pediatrics, Faculty of Medicine, Ramathibodi Hospital, Mahidol University, Bangkok 10400, Thailand
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Md Mehedi Hasan
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | | | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| |
Collapse
|
76
|
Zhang D, Xu ZC, Su W, Yang YH, Lv H, Yang H, Lin H. iCarPS: a computational tool for identifying protein carbonylation sites by novel encoded features. Bioinformatics 2021; 37:171-177. [PMID: 32766811 DOI: 10.1093/bioinformatics/btaa702] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2020] [Revised: 07/12/2020] [Accepted: 07/28/2020] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Protein carbonylation is one of the most important oxidative stress-induced post-translational modifications, which is generally characterized as stability, irreversibility and relative early formation. It plays a significant role in orchestrating various biological processes and has been already demonstrated to be related to many diseases. However, the experimental technologies for carbonylation sites identification are not only costly and time consuming, but also unable of processing a large number of proteins at a time. Thus, rapidly and effectively identifying carbonylation sites by computational methods will provide key clues for the analysis of occurrence and development of diseases. RESULTS In this study, we developed a predictor called iCarPS to identify carbonylation sites based on sequence information. A novel feature encoding scheme called residues conical coordinates combined with their physicochemical properties was proposed to formulate carbonylated protein and non-carbonylated protein samples. To remove potential redundant features and improve the prediction performance, a feature selection technique was used. The accuracy and robustness of iCarPS were proved by experiments on training and independent datasets. Comparison with other published methods demonstrated that the proposed method is powerful and could provide powerful performance for carbonylation sites identification. AVAILABILITY AND IMPLEMENTATION Based on the proposed model, a user-friendly webserver and a software package were constructed, which can be freely accessed at http://lin-group.cn/server/iCarPS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dan Zhang
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zhao-Chun Xu
- Computer Department, Jingdezhen Ceramic Institute, Jingdezhen 333403, China
| | - Wei Su
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Yu-He Yang
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hao Lv
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hui Yang
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hao Lin
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
77
|
Feng P, Chen W. iRNA-m5U: A sequence based predictor for identifying 5-methyluridine modification sites in Saccharomyces cerevisiae. Methods 2021; 203:28-31. [PMID: 33882361 DOI: 10.1016/j.ymeth.2021.04.013] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Revised: 04/11/2021] [Accepted: 04/15/2021] [Indexed: 01/28/2023] Open
Abstract
The 5-methyluridine (m5U)modification plays important roles in a series of biological processes. Accurate identification of m5U sites will be helpful to decode its biological functions. Although experimental techniques have been proposed to detect m5U, they are still expensive and time consuming. In the present work, a support vector machine based method, called iRNA-m5U, was developed to identify the m5U sites in the Saccharomyces cerevisiae transcriptome. The performance of iRNA-m5U was validated based on different datasets. The accuracies obtained by iRNA-m5U is promising, indicating that it holds the potential to become an useful tool for the identification of m5U sites.
Collapse
Affiliation(s)
- Pengmian Feng
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| | - Wei Chen
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China; Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China; School of Life Sciences, North China University of Science and Technology, Tangshan 063000, China.
| |
Collapse
|
78
|
Zulfiqar H, Khan RS, Hassan F, Hippe K, Hunt C, Ding H, Song XM, Cao R. Computational identification of N4-methylcytosine sites in the mouse genome with machine-learning method. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:3348-3363. [PMID: 34198389 DOI: 10.3934/mbe.2021167] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/24/2023]
Abstract
N4-methylcytosine (4mC) is a kind of DNA modification which could regulate multiple biological processes. Correctly identifying 4mC sites in genomic sequences can provide precise knowledge about their genetic roles. This study aimed to develop an ensemble model to predict 4mC sites in the mouse genome. In the proposed model, DNA sequences were encoded by k-mer, enhanced nucleic acid composition and composition of k-spaced nucleic acid pairs. Subsequently, these features were optimized by using minimum redundancy maximum relevance (mRMR) with incremental feature selection (IFS) and five-fold cross-validation. The obtained optimal features were inputted into random forest classifier for discriminating 4mC from non-4mC sites in mouse. On the independent dataset, our model could yield the overall accuracy of 85.41%, which was approximately 3.8% -6.3% higher than the two existing models, i4mC-Mouse and 4mCpred-EL respectively. The data and source code of the model can be freely download from https://github.com/linDing-groups/model_4mc.
Collapse
Affiliation(s)
- Hasan Zulfiqar
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Rida Sarwar Khan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Farwa Hassan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Kyle Hippe
- Department of Computer Science, Pacific Lutheran University, Tacoma 98447, USA
| | - Cassandra Hunt
- Department of Computer Science, Pacific Lutheran University, Tacoma 98447, USA
| | - Hui Ding
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xiao-Ming Song
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
- School of Life Sciences, North China University of Science and Technology, Tangshan, Hebei 063210, China
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University, Tacoma 98447, USA
| |
Collapse
|
79
|
Tang Q, Nie F, Kang J, Chen W. mRNALocater: Enhance the prediction accuracy of eukaryotic mRNA subcellular localization by using model fusion strategy. Mol Ther 2021; 29:2617-2623. [PMID: 33823302 DOI: 10.1016/j.ymthe.2021.04.004] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2020] [Revised: 03/23/2021] [Accepted: 03/31/2021] [Indexed: 02/07/2023] Open
Abstract
The functions of mRNAs are closely correlated with their locations in cells. Knowledge about the subcellular locations of mRNA is helpful to understand their biological functions. In recent years, it has become a hot topic to develop effective computational models to predict eukaryotic mRNA subcellular localizations. However, existing state-of-the-art models still have certain deficiencies in terms of prediction accuracy and generalization ability. Therefore, it is urgent to develop novel methods to accurately predict mRNA subcellular localizations. In this study, a novel method called mRNALocater was proposed to detect the subcellular localization of eukaryotic mRNA by adopting the model fusion strategy. To fully extract information from mRNA sequences, the electron-ion interaction pseudopotential and pseudo k-tuple nucleotide composition were used to encode the sequences. Moreover, the correlation coefficient filtering algorithm and feature forward search technology were used to mine hidden feature information, which guarantees that mRNALocater can be more effectively applied to new sequences. The results based on the independent dataset tests demonstrate that mRNALocater yields promising performances for predicting eukaryotic mRNA subcellular localizations and is a powerful tool in practical applications. A freely available online web server for mRNALocater has been established at http://bio-bigdata.cn/mRNALocater.
Collapse
Affiliation(s)
- Qiang Tang
- State Key Laboratory of Southwestern Chinese Medicine Resources, Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Fulei Nie
- School of Life Sciences, North China University of Science and Technology, Tangshan 063210, China; School of Public Health, North China University of Science and Technology, Tangshan 063210, China
| | - Juanjuan Kang
- Affiliated Foshan Maternity & Child Healthcare Hospital, Southern Medical University (Foshan Maternity & Child Healthcare Hospital), Foshan 528000, China
| | - Wei Chen
- State Key Laboratory of Southwestern Chinese Medicine Resources, Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China; School of Life Sciences, North China University of Science and Technology, Tangshan 063210, China; School of Public Health, North China University of Science and Technology, Tangshan 063210, China.
| |
Collapse
|
80
|
Sivasudhan E, Blake N, Lu ZL, Meng J, Rong R. Dynamics of m6A RNA Methylome on the Hallmarks of Hepatocellular Carcinoma. Front Cell Dev Biol 2021; 9:642443. [PMID: 33869193 PMCID: PMC8047153 DOI: 10.3389/fcell.2021.642443] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Accepted: 02/23/2021] [Indexed: 12/19/2022] Open
Abstract
Epidemiological data consistently rank hepatocellular carcinoma (HCC) as one of the leading causes of cancer-related deaths worldwide, often posing severe economic burden on health care. While the molecular etiopathogenesis associated with genetic and epigenetic modifications has been extensively explored, the biological influence of the emerging field of epitranscriptomics and its associated aberrant RNA modifications on tumorigenesis is a largely unexplored territory with immense potential for discovering new therapeutic approaches. In particular, the underlying cellular mechanisms of different hallmarks of hepatocarcinogenesis that are governed by the complex dynamics of m6A RNA methylation demand further investigation. In this review, we reveal the up-to-date knowledge on the mechanistic and functional link between m6A RNA methylation and pathogenesis of HCC.
Collapse
Affiliation(s)
- Enakshi Sivasudhan
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, China.,Department of Clinical Infection, Microbiology and Immunology, Institute of Infection, Veterinary and Ecological Sciences, University of Liverpool, Liverpool, United Kingdom
| | - Neil Blake
- Department of Clinical Infection, Microbiology and Immunology, Institute of Infection, Veterinary and Ecological Sciences, University of Liverpool, Liverpool, United Kingdom
| | - Zhi-Liang Lu
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, China.,Institute of Integrative Biology, University of Liverpool, Liverpool, United Kingdom
| | - Jia Meng
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, China.,Institute of Integrative Biology, University of Liverpool, Liverpool, United Kingdom
| | - Rong Rong
- Department of Biological Sciences, Xi'an Jiaotong-Liverpool University, Suzhou, China.,Institute of Integrative Biology, University of Liverpool, Liverpool, United Kingdom
| |
Collapse
|
81
|
Dao FY, Lv H, Su W, Sun ZJ, Huang QL, Lin H. iDHS-Deep: an integrated tool for predicting DNase I hypersensitive sites by deep neural network. Brief Bioinform 2021; 22:6158360. [PMID: 33751027 DOI: 10.1093/bib/bbab047] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2020] [Revised: 01/28/2021] [Accepted: 01/29/2021] [Indexed: 01/09/2023] Open
Abstract
DNase I hypersensitive site (DHS) refers to the hypersensitive region of chromatin for the DNase I enzyme. It is an important part of the noncoding region and contains a variety of regulatory elements, such as promoter, enhancer, and transcription factor-binding site, etc. Moreover, the related locus of disease (or trait) are usually enriched in the DHS regions. Therefore, the detection of DHS region is of great significance. In this study, we develop a deep learning-based algorithm to identify whether an unknown sequence region would be potential DHS. The proposed method showed high prediction performance on both training datasets and independent datasets in different cell types and developmental stages, demonstrating that the method has excellent superiority in the identification of DHSs. Furthermore, for the convenience of related wet-experimental researchers, the user-friendly web-server iDHS-Deep was established at http://lin-group.cn/server/iDHS-Deep/, by which users can easily distinguish DHS and non-DHS and obtain the corresponding developmental stage ofDHS.
Collapse
Affiliation(s)
- Fu-Ying Dao
- Informational Biology at University of Electronic Science and Technology of China, China
| | - Hao Lv
- Informational Biology at University of Electronic Science and Technology of China, China
| | - Wei Su
- Informational Biology at University of Electronic Science and Technology of China, China
| | - Zi-Jie Sun
- Informational Biology at University of Electronic Science and Technology of China, China
| | - Qin-Lai Huang
- Informational Biology at University of Electronic Science and Technology of China, China
| | - Hao Lin
- Informational Biology at University of Electronic Science and Technology of China, China
| |
Collapse
|
82
|
Charoenkwan P, Nantasenamat C, Hasan MM, Manavalan B, Shoombuatong W. BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides. Bioinformatics 2021; 37:2556-2562. [PMID: 33638635 DOI: 10.1093/bioinformatics/btab133] [Citation(s) in RCA: 102] [Impact Index Per Article: 25.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Revised: 02/08/2021] [Accepted: 02/24/2021] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION The identification of bitter peptides through experimental approaches is an expensive and time-consuming endeavor. Due to the huge number of newly available peptide sequences in the post-genomic era, the development of automated computational models for the identification of novel bitter peptides is highly desira-ble. RESULTS In this work, we present BERT4Bitter, a bidirectional encoder representation from transformers (BERT)-based model for predicting bitter peptides directly from their amino acid sequence without using any structural information. To the best of our knowledge, this is the first time a BERT-based model has been employed to identify bitter peptides. Compared to widely used machine learning models, BERT4Bitter achieved the best performance with accuracy of 0.861 and 0.922 for cross-validation and independent tests, respectively. Furthermore, extensive empirical benchmarking experiments on the independent dataset demonstrated that BERT4Bitter clearly outperformed the existing method with improvements of > 8% accuracy and >16% Matthews coefficient correlation, highlighting the effectiveness and robustness of BERT4Bitter. We believe that the BERT4Bitter method proposed herein will be a useful tool for rapidly screening and identifying novel bitter peptides for drug development and nutritional research. AVAILABILITY The user-friendly web server of the proposed BERT4Bitter is freely accessible at: http://pmlab.pythonanywhere.com/BERT4Bitter. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai 50200, Thailand
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Md Mehedi Hasan
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA, 70112 USA
| | | | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| |
Collapse
|
83
|
Recent Advances in Predicting Protein S-Nitrosylation Sites. BIOMED RESEARCH INTERNATIONAL 2021; 2021:5542224. [PMID: 33628788 PMCID: PMC7892234 DOI: 10.1155/2021/5542224] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Revised: 01/24/2021] [Accepted: 01/25/2021] [Indexed: 01/09/2023]
Abstract
Protein S-nitrosylation (SNO) is a process of covalent modification of nitric oxide (NO) and its derivatives and cysteine residues. SNO plays an essential role in reversible posttranslational modifications of proteins. The accurate prediction of SNO sites is crucial in revealing a certain biological mechanism of NO regulation and related drug development. Identification of the sites of SNO in proteins is currently a very hot topic. In this review, we briefly summarize recent advances in computationally identifying SNO sites. The challenges and future perspectives for identifying SNO sites are also discussed. We anticipate that this review will provide insights into research on SNO site prediction.
Collapse
|
84
|
Le NQK, Ho QT, Nguyen TTD, Ou YY. A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinform 2021; 22:6128847. [PMID: 33539511 DOI: 10.1093/bib/bbab005] [Citation(s) in RCA: 99] [Impact Index Per Article: 24.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Revised: 01/01/2021] [Accepted: 01/03/2021] [Indexed: 01/11/2023] Open
Abstract
Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
| | - Quang-Thai Ho
- College of Information and Communication Technology, Can Tho University, Vietnam
| | | | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Taiwan
| |
Collapse
|
85
|
Cui F, Zhang Z, Zou Q. Sequence representation approaches for sequence-based protein prediction tasks that use deep learning. Brief Funct Genomics 2021; 20:61-73. [PMID: 33527980 DOI: 10.1093/bfgp/elaa030] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Revised: 12/16/2020] [Accepted: 12/18/2020] [Indexed: 11/12/2022] Open
Abstract
Deep learning has been increasingly used in bioinformatics, especially in sequence-based protein prediction tasks, as large amounts of biological data are available and deep learning techniques have been developed rapidly in recent years. For sequence-based protein prediction tasks, the selection of a suitable model architecture is essential, whereas sequence data representation is a major factor in controlling model performance. Here, we summarized all the main approaches that are used to represent protein sequence data (amino acid sequence encoding or embedding), which include end-to-end embedding methods, non-contextual embedding methods and embedding methods that use transfer learning and others that are applied for some specific tasks (such as protein sequence embedding based on extracted features for protein structure predictions and graph convolutional network-based embedding for drug discovery tasks). We have also reviewed the architectures of various types of embedding models theoretically and the development of these types of sequence embedding approaches to facilitate researchers and users in selecting the model that best suits their requirements.
Collapse
Affiliation(s)
- Feifei Cui
- University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Zilong Zhang
- University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| |
Collapse
|
86
|
Chen K, Song B, Tang Y, Wei Z, Xu Q, Su J, de Magalhães JP, Rigden DJ, Meng J. RMDisease: a database of genetic variants that affect RNA modifications, with implications for epitranscriptome pathogenesis. Nucleic Acids Res 2021; 49:D1396-D1404. [PMID: 33010174 PMCID: PMC7778951 DOI: 10.1093/nar/gkaa790] [Citation(s) in RCA: 68] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 09/08/2020] [Accepted: 09/11/2020] [Indexed: 12/11/2022] Open
Abstract
Deciphering the biological impacts of millions of single nucleotide variants remains a major challenge. Recent studies suggest that RNA modifications play versatile roles in essential biological mechanisms, and are closely related to the progression of various diseases including multiple cancers. To comprehensively unveil the association between disease-associated variants and their epitranscriptome disturbance, we built RMDisease, a database of genetic variants that can affect RNA modifications. By integrating the prediction results of 18 different RNA modification prediction tools and also 303,426 experimentally-validated RNA modification sites, RMDisease identified a total of 202,307 human SNPs that may affect (add or remove) sites of eight types of RNA modifications (m6A, m5C, m1A, m5U, Ψ, m6Am, m7G and Nm). These include 4,289 disease-associated variants that may imply disease pathogenesis functioning at the epitranscriptome layer. These SNPs were further annotated with essential information such as post-transcriptional regulations (sites for miRNA binding, interaction with RNA-binding proteins and alternative splicing) revealing putative regulatory circuits. A convenient graphical user interface was constructed to support the query, exploration and download of the relevant information. RMDisease should make a useful resource for studying the epitranscriptome impact of genetic variants via multiple RNA modifications with emphasis on their potential disease relevance. RMDisease is freely accessible at: www.xjtlu.edu.cn/biologicalsciences/rmd.
Collapse
Affiliation(s)
- Kunqi Chen
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
- Institute of Ageing & Chronic Disease, University of Liverpool, L7 8TX Liverpool, UK
| | - Bowen Song
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L7 8TX Liverpool, UK
- Department of Mathematical Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
| | - Yujiao Tang
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L7 8TX Liverpool, UK
| | - Zhen Wei
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L7 8TX Liverpool, UK
| | - Qingru Xu
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
| | - Jionglong Su
- Department of Mathematical Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
| | | | - Daniel J Rigden
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L7 8TX Liverpool, UK
| | - Jia Meng
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L7 8TX Liverpool, UK
- AI University Research Centre, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
| |
Collapse
|
87
|
Li J, Zhang L, He S, Guo F, Zou Q. SubLocEP: a novel ensemble predictor of subcellular localization of eukaryotic mRNA based on machine learning. Brief Bioinform 2021; 22:6059770. [PMID: 33388743 DOI: 10.1093/bib/bbaa401] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 11/28/2020] [Accepted: 12/08/2020] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION mRNA location corresponds to the location of protein translation and contributes to precise spatial and temporal management of the protein function. However, current assignment of subcellular localization of eukaryotic mRNA reveals important limitations: (1) turning multiple classifications into multiple dichotomies makes the training process tedious; (2) the majority of the models trained by classical algorithm are based on the extraction of single sequence information; (3) the existing state-of-the-art models have not reached an ideal level in terms of prediction and generalization ability. To achieve better assignment of subcellular localization of eukaryotic mRNA, a better and more comprehensive model must be developed. RESULTS In this paper, SubLocEP is proposed as a two-layer integrated prediction model for accurate prediction of the location of sequence samples. Unlike the existing models based on limited features, SubLocEP comprehensively considers additional feature attributes and is combined with LightGBM to generated single feature classifiers. The initial integration model (single-layer model) is generated according to the categories of a feature. Subsequently, two single-layer integration models are weighted (sequence-based: physicochemical properties = 3:2) to produce the final two-layer model. The performance of SubLocEP on independent datasets is sufficient to indicate that SubLocEP is an accurate and stable prediction model with strong generalization ability. Additionally, an online tool has been developed that contains experimental data and can maximize the user convenience for estimation of subcellular localization of eukaryotic mRNA.
Collapse
Affiliation(s)
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology
| | | | | | | |
Collapse
|
88
|
Jiang P, Ning W, Shi Y, Liu C, Mo S, Zhou H, Liu K, Guo Y. FSL-Kla: A few-shot learning-based multi-feature hybrid system for lactylation site prediction. Comput Struct Biotechnol J 2021. [DOI: 10.1016/j.csbj.2021.08.013\] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
|
89
|
ReFold-MAP: Protein remote homology detection and fold recognition based on features extracted from profiles. Anal Biochem 2020; 611:114013. [PMID: 33160906 DOI: 10.1016/j.ab.2020.114013] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Revised: 10/24/2020] [Accepted: 11/02/2020] [Indexed: 11/23/2022]
Abstract
Protein remote homology detection and protein fold recognition are two important tasks in protein structure and function prediction. There are three kinds of methods in this field, including the discriminative methods, the alignment methods, and the ranking methods. In this study, a new discriminative method called ReFold-MAP is proposed. The proposed method extracts comprehensive features based on three profile-based features: Motif-PSSM, ACC-PSSM, and PDT-profile. We call these features as MAP features, which incorporate the structural motif kernel information, the evolutionary information, and the sequence information. The experiments prove that ReFold-MAP outperforms other approaches. Therefore, ReFold-MAP will be a useful tool for protein remote homology detection and fold recognition.
Collapse
|
90
|
Jiang J, Song B, Tang Y, Chen K, Wei Z, Meng J. m5UPred: A Web Server for the Prediction of RNA 5-Methyluridine Sites from Sequences. MOLECULAR THERAPY. NUCLEIC ACIDS 2020; 22:742-747. [PMID: 33230471 PMCID: PMC7595847 DOI: 10.1016/j.omtn.2020.09.031] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/07/2020] [Accepted: 09/25/2020] [Indexed: 11/16/2022]
Abstract
As one of the widely occurring RNA modifications, 5-methyluridine (m5U) has recently been shown to play critical roles in various biological functions and disease pathogenesis, such as under stress response and during breast cancer development. Precise identification of m5U sites on RNA is vital for the understanding of the regulatory mechanisms of RNA life. We present here m5UPred, the first web server for in silico identification of m5U sites from the primary sequences of RNA. Built upon the support vector machine (SVM) algorithm and the biochemical encoding scheme, m5UPred achieved reasonable prediction performance with the area under the receiver operating characteristic curve (AUC) greater than 0.954 by 5-fold cross-validation and independent testing datasets. To critically test and validate the performance of our newly proposed predictor, the experimentally validated m5U sites were further separated by high-throughput sequencing techniques (miCLIP-Seq and FICC-Seq) and cell types (HEK293 and HAP1). When tested on cross-technique and cross-cell-type validation using independent datasets, m5UPred achieved an average AUC of 0.922 and 0.926 under mature mRNA mode, respectively, showing reasonable accuracy and reliability. The m5UPred web server is freely accessible now and it should make a useful tool for the researchers who are interested in m5U RNA modification.
Collapse
Affiliation(s)
- Jie Jiang
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, 215123, China
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L7 8TX, Liverpool, UK
| | - Bowen Song
- Department of Mathematical Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, 215123, China
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L7 8TX, Liverpool, UK
| | - Yujiao Tang
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, 215123, China
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L7 8TX, Liverpool, UK
| | - Kunqi Chen
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, 215123, China
- Institute of Ageing & Chronic Disease, University of Liverpool, L7 8TX, Liverpool, UK
| | - Zhen Wei
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, 215123, China
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L7 8TX, Liverpool, UK
| | - Jia Meng
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, 215123, China
- AI University Research Centre, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, 215123, China
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L7 8TX, Liverpool, UK
| |
Collapse
|
91
|
XG-ac4C: identification of N4-acetylcytidine (ac4C) in mRNA using eXtreme gradient boosting with electron-ion interaction pseudopotentials. Sci Rep 2020; 10:20942. [PMID: 33262392 PMCID: PMC7708984 DOI: 10.1038/s41598-020-77824-2] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Accepted: 10/22/2020] [Indexed: 02/06/2023] Open
Abstract
N4-acetylcytidine (ac4C) is a post-transcriptional modification in mRNA which plays a major role in the stability and regulation of mRNA translation. The working mechanism of ac4C modification in mRNA is still unclear and traditional laboratory experiments are time-consuming and expensive. Therefore, we propose an XG-ac4C machine learning model based on the eXtreme Gradient Boost classifier for the identification of ac4C sites. The XG-ac4C model uses a combination of electron-ion interaction pseudopotentials and electron-ion interaction pseudopotentials of trinucleotide of the nucleotides in ac4C sites. Moreover, Shapley additive explanations and local interpretable model-agnostic explanations are applied to understand the importance of features and their contribution to the final prediction outcome. The obtained results demonstrate that XG-ac4C outperforms existing state-of-the-art methods. In more detail, the proposed model improves the area under the precision-recall curve by 9.4% and 9.6% in cross-validation and independent tests, respectively. Finally, a user-friendly web server based on the proposed model for ac4C site identification is made freely available at http://nsclbio.jbnu.ac.kr/tools/xgac4c/ .
Collapse
|
92
|
Manavalan B, Basith S, Shin TH, Lee G. Computational prediction of species-specific yeast DNA replication origin via iterative feature representation. Brief Bioinform 2020; 22:6000361. [PMID: 33232970 PMCID: PMC8294535 DOI: 10.1093/bib/bbaa304] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Revised: 10/08/2020] [Accepted: 10/09/2020] [Indexed: 12/13/2022] Open
Abstract
Deoxyribonucleic acid replication is one of the most crucial tasks taking place in the cell, and it has to be precisely regulated. This process is initiated in the replication origins (ORIs), and thus it is essential to identify such sites for a deeper understanding of the cellular processes and functions related to the regulation of gene expression. Considering the important tasks performed by ORIs, several experimental and computational approaches have been developed in the prediction of such sites. However, existing computational predictors for ORIs have certain curbs, such as building only single-feature encoding models, limited systematic feature engineering efforts and failure to validate model robustness. Hence, we developed a novel species-specific yeast predictor called yORIpred that accurately identify ORIs in the yeast genomes. To develop yORIpred, we first constructed optimal 40 baseline models by exploring eight different sequence-based encodings and five different machine learning classifiers. Subsequently, the predicted probability of 40 models was considered as the novel feature vector and carried out iterative feature learning approach independently using five different classifiers. Our systematic analysis revealed that the feature representation learned by the support vector machine algorithm (yORIpred) could well discriminate the distribution characteristics between ORIs and non-ORIs when compared with the other four algorithms. Comprehensive benchmarking experiments showed that yORIpred achieved superior and stable performance when compared with the existing predictors on the same training datasets. Furthermore, independent evaluation showcased the best and accurate performance of yORIpred thus underscoring the significance of iterative feature representation. To facilitate the users in obtaining their desired results without undergoing any mathematical, statistical or computational hassles, we developed a web server for the yORIpred predictor, which is available at: http://thegleelab.org/yORIpred.
Collapse
Affiliation(s)
| | - Shaherin Basith
- Department of Physiology, Ajou University School of Medicine, Republic of Korea
| | - Tae Hwan Shin
- Department of Physiology, Ajou University School of Medicine, Republic of Korea
| | - Gwang Lee
- Department of Physiology, Ajou University School of Medicine, Republic of Korea
| |
Collapse
|
93
|
Feng P, Liu W, Huang C, Tang Z. Classifying the superfamily of small heat shock proteins by using g-gap dipeptide compositions. Int J Biol Macromol 2020; 167:1575-1578. [PMID: 33212104 DOI: 10.1016/j.ijbiomac.2020.11.111] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Revised: 11/02/2020] [Accepted: 11/13/2020] [Indexed: 01/16/2023]
Abstract
Small heat shock protein (sHSP) is a superfamily of molecular chaperone and is found from archaea to human. Recent researches have demonstrated that sHSPs participate in a series of biological processes and are even closely associated with serious diseases. Since sHSP is a very large superfamily and members from different superfamilies exhibit distinct functions, accurate classification of the subfamily of sHSP will be helpful for unrevealing its functions. In the present work, a support vector machine-based method was proposed to classify the subfamily of sHSPs. In the 10-fold cross validation test, an overall accuracy of 93.25% was obtained for classifying the subfamily of sHSPs. The superiority of the proposed method was also demonstrated by comparing it with the other methods. It is anticipated that the proposed method will become a useful tool for classifying the subfamily of sHSPs.
Collapse
Affiliation(s)
- Pengmian Feng
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China.
| | - Weiwei Liu
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| | - Cong Huang
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| | - Zhaohui Tang
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| |
Collapse
|
94
|
Wei L, He W, Malik A, Su R, Cui L, Manavalan B. Computational prediction and interpretation of cell-specific replication origin sites from multiple eukaryotes by exploiting stacking framework. Brief Bioinform 2020; 22:5956930. [PMID: 33152766 DOI: 10.1093/bib/bbaa275] [Citation(s) in RCA: 76] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Revised: 09/14/2020] [Accepted: 09/21/2020] [Indexed: 12/13/2022] Open
Abstract
Origins of replication sites (ORIs), which refers to the initiative locations of genomic DNA replication, play essential roles in DNA replication process. Detection of ORIs' distribution in genome scale is one of key steps to in-depth understanding their regulation mechanisms. In this study, we presented a novel machine learning-based approach called Stack-ORI encompassing 10 cell-specific prediction models for identifying ORIs from four different eukaryotic species (Homo sapiens, Mus musculus, Drosophila melanogaster and Arabidopsis thaliana). For each cell-specific model, we employed 12 feature encoding schemes that cover nucleic acid composition, position-specific and physicochemical properties information. The optimal feature set was identified from each encoding individually and developed their respective baseline models using the eXtreme Gradient Boosting (XGBoost) classifier. Subsequently, the predicted scores of 12 baseline models are integrated as a novel feature vector to train XGBoost and develop the final model. Extensive experimental results show that Stack-ORI achieves significantly better performance as compared with their baseline models on both training and independent datasets. Interestingly, Stack-ORI consistently outperforms existing predictor in all cell-specific models, not only on training but also on independent test. Moreover, our novel approach provides necessary interpretations that help understanding model success by leveraging the powerful SHapley Additive exPlanation algorithm, thus underlining the most important feature encoding schemes significant for predicting cell-specific ORIs.
Collapse
Affiliation(s)
- Leyi Wei
- computer science from Xiamen University, China
| | - Wenjia He
- School of Software at Shandong University, China
| | - Adeel Malik
- Institute of Intelligence Informatics Technology, Sangmyung University, Seoul, Republic of Korea
| | - Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Lizhen Cui
- School of Software, Shandong University, the Deputy Director of the E-Commerce Research Center and the Director of the Research Center of Software and Data Engineering, Jinan
| | | |
Collapse
|
95
|
Sequence based prediction of pattern recognition receptors by using feature selection technique. Int J Biol Macromol 2020; 162:931-934. [DOI: 10.1016/j.ijbiomac.2020.06.234] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2020] [Revised: 06/23/2020] [Accepted: 06/24/2020] [Indexed: 01/04/2023]
|
96
|
Lv H, Dao FY, Guan ZX, Yang H, Li YW, Lin H. Deep-Kcr: accurate detection of lysine crotonylation sites using deep learning method. Brief Bioinform 2020; 22:5937175. [PMID: 33099604 DOI: 10.1093/bib/bbaa255] [Citation(s) in RCA: 88] [Impact Index Per Article: 17.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2020] [Revised: 08/31/2020] [Accepted: 09/08/2020] [Indexed: 12/23/2022] Open
Abstract
As a newly discovered protein posttranslational modification, histone lysine crotonylation (Kcr) involved in cellular regulation and human diseases. Various proteomics technologies have been developed to detect Kcr sites. However, experimental approaches for identifying Kcr sites are often time-consuming and labor-intensive, which is difficult to widely popularize in large-scale species. Computational approaches are cost-effective and can be used in a high-throughput manner to generate relatively precise identification. In this study, we develop a deep learning-based method termed as Deep-Kcr for Kcr sites prediction by combining sequence-based features, physicochemical property-based features and numerical space-derived information with information gain feature selection. We investigate the performances of convolutional neural network (CNN) and five commonly used classifiers (long short-term memory network, random forest, LogitBoost, naive Bayes and logistic regression) using 10-fold cross-validation and independent set test. Results show that CNN could always display the best performance with high computational efficiency on large dataset. We also compare the Deep-Kcr with other existing tools to demonstrate the excellent predictive power and robustness of our method. Based on the proposed model, a webserver called Deep-Kcr was established and is freely accessible at http://lin-group.cn/server/Deep-Kcr.
Collapse
Affiliation(s)
- Hao Lv
- Center for Informational Biology at the University of Electronic Science and Technology of China
| | - Fu-Ying Dao
- Center for Informational Biology at the University of Electronic Science and Technology of China
| | - Zheng-Xing Guan
- Center for Informational Biology at the University of Electronic Science and Technology of China
| | - Hui Yang
- Center for Informational Biology at the University of Electronic Science and Technology of China
| | | | - Hao Lin
- Center for Informational Biology at the University of Electronic Science and Technology of China
| |
Collapse
|
97
|
Guo Z, Wang P, Liu Z, Zhao Y. Discrimination of Thermophilic Proteins and Non-thermophilic Proteins Using Feature Dimension Reduction. Front Bioeng Biotechnol 2020; 8:584807. [PMID: 33195148 PMCID: PMC7642589 DOI: 10.3389/fbioe.2020.584807] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2020] [Accepted: 09/11/2020] [Indexed: 01/19/2023] Open
Abstract
Thermophilicity is a very important property of proteins, as it sometimes determines denaturation and cell death. Thus, methods for predicting thermophilic proteins and non-thermophilic proteins are of interest and can contribute to the design and engineering of proteins. In this article, we describe the use of feature dimension reduction technology and LIBSVM to identify thermophilic proteins. The highest accuracy obtained by cross-validation was 96.02% with 119 parameters. When using only 16 features, we obtained an accuracy of 93.33%. We discuss the importance of the different characteristics in identification and report a comparison of the performance of support vector machine to that of other methods.
Collapse
Affiliation(s)
- Zifan Guo
- School of Aeronautics and Astronautic, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Pingping Wang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Zhendong Liu
- School of Computer Science and Technology, Shandong Jianzhu University, Jinan, China
| | - Yuming Zhao
- Information and Computer Engineering College, Northeast Forestry University, Harbin, China
| |
Collapse
|
98
|
A Method for Identifying Vesicle Transport Proteins Based on LibSVM and MRMD. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:8926750. [PMID: 33133228 PMCID: PMC7591939 DOI: 10.1155/2020/8926750] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Revised: 08/14/2020] [Accepted: 09/16/2020] [Indexed: 12/14/2022]
Abstract
With the development of computer technology, many machine learning algorithms have been applied to the field of biology, forming the discipline of bioinformatics. Protein function prediction is a classic research topic in this subject area. Though many scholars have made achievements in identifying protein by different algorithms, they often extract a large number of feature types and use very complex classification methods to obtain little improvement in the classification effect, and this process is very time-consuming. In this research, we attempt to utilize as few features as possible to classify vesicular transportation proteins and to simultaneously obtain a comparative satisfactory classification result. We adopt CTDC which is a submethod of the method of composition, transition, and distribution (CTD) to extract only 39 features from each sequence, and LibSVM is used as the classification method. We use the SMOTE method to deal with the problem of dataset imbalance. There are 11619 protein sequences in our dataset. We selected 4428 sequences to train our classification model and selected other 1832 sequences from our dataset to test the classification effect and finally achieved an accuracy of 71.77%. After dimension reduction by MRMD, the accuracy is 72.16%.
Collapse
|
99
|
Zhang ZM, Wang JS, Zulfiqar H, Lv H, Dao FY, Lin H. Early Diagnosis of Pancreatic Ductal Adenocarcinoma by Combining Relative Expression Orderings With Machine-Learning Method. Front Cell Dev Biol 2020; 8:582864. [PMID: 33178697 PMCID: PMC7593596 DOI: 10.3389/fcell.2020.582864] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Accepted: 09/15/2020] [Indexed: 12/16/2022] Open
Abstract
Pancreatic ductal adenocarcinoma (PDAC) is an aggressive and lethal cancer deeply affecting human health. Diagnosing early-stage PDAC is the key point to PDAC patients' survival. However, the biomarkers for diagnosing early PDAC are inexact in most cases. Therefore, it is highly desirable to identify an effective PDAC diagnostic biomarker. In the current work, we designed a novel computational approach based on within-sample relative expression orderings (REOs). A feature selection technique called minimum redundancy maximum relevance was used to pick out optimal REOs. We then compared the performances of different classification algorithms for discriminating PDAC and its adjacent normal tissues from non-PDAC tissues. The support vector machine algorithm is the best one for identifying early PDAC diagnostic biomarker. At first, a signature composed of nine gene pairs was acquired from microarray gene expression data sets. These gene pairs could produce satisfactory classification accuracy up to 97.53% in fivefold cross-validation. Subsequently, two types of data from diverse platforms, namely, microarray and RNA-Seq, were used to validate this signature. For microarray data, all (100.00%) of 115 PDAC tissues and all (100.00%) of 31 PDAC adjacent normal tissues were correctly recognized as PDAC. In addition, 88.24% of 17 non-PDAC (normal or pancreatitis) tissues were correctly classified. For the RNA-Seq data, all (100.00%) of 177 PDAC tissues and all (100.00%) of 4 PDAC adjacent normal tissues were correctly recognized as PDAC. Validation results demonstrated that the signature had a good cross-platform effect for early detection of PDAC. This work developed a new robust signature that might be a promising biomarker for early PDAC diagnosis.
Collapse
Affiliation(s)
- Zi-Mei Zhang
- Key Laboratory for Neuro-Information of Ministry of Education, Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Jia-Shu Wang
- Key Laboratory for Neuro-Information of Ministry of Education, Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hasan Zulfiqar
- Key Laboratory for Neuro-Information of Ministry of Education, Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hao Lv
- Key Laboratory for Neuro-Information of Ministry of Education, Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Fu-Ying Dao
- Key Laboratory for Neuro-Information of Ministry of Education, Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, Center for Informational Biology, School of Life Sciences and Technology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
100
|
Identifying Heat Shock Protein Families from Imbalanced Data by Using Combined Features. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2020; 2020:8894478. [PMID: 33029195 PMCID: PMC7530508 DOI: 10.1155/2020/8894478] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Revised: 09/08/2020] [Accepted: 09/14/2020] [Indexed: 11/29/2022]
Abstract
Heat shock proteins (HSPs) are ubiquitous in living organisms. HSPs are an essential component for cell growth and survival; the main function of HSPs is controlling the folding and unfolding process of proteins. According to molecular function and mass, HSPs are categorized into six different families: HSP20 (small HSPS), HSP40 (J-proteins), HSP60, HSP70, HSP90, and HSP100. In this paper, improved methods for HSP prediction are proposed—the split amino acid composition (SAAC), the dipeptide composition (DC), the conjoint triad feature (CTF), and the pseudoaverage chemical shift (PseACS) were selected to predict the HSPs with a support vector machine (SVM). In order to overcome the imbalance data classification problems, the syntactic minority oversampling technique (SMOTE) was used to balance the dataset. The overall accuracy was 99.72% with a balanced dataset in the jackknife test by using the optimized combination feature SAAC+DC+CTF+PseACS, which was 4.81% higher than the imbalanced dataset with the same combination feature. The Sn, Sp, Acc, and MCC of HSP families in our predictive model were higher than those in existing methods. This improved method may be helpful for protein function prediction.
Collapse
|