1
|
Ali A, Salem M. Methods for Bioinformatic Prediction of Genuine sRNAs from Outer Membrane Vesicles. Methods Mol Biol 2024; 2843:37-54. [PMID: 39141293 DOI: 10.1007/978-1-0716-4055-5_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/15/2024]
Abstract
The molecular pathogenesis of Gram-negative bacteria remains a complex and incompletely understood phenomenon. Various factors are believed to contribute to the pathogenicity of these bacteria. One key mechanism utilized by Gram-negative bacteria is the production of outer membrane vesicles (OMVs), which are small spherical particles derived from the bacterial outer membrane. These OMVs are crucial in delivering virulence factors to the host, facilitating host-pathogen interactions. Within these OMVs, small regulatory RNAs (sRNAs) have been identified as important players in modulating the host immune response. One of the main challenges in studying OMVs and their cargo of sRNAs is the difficulty in isolating and purifying sufficient quantities of OMVs, as well as accurately predicting genuine sRNAs computationally. In this chapter, we present protocols aimed at overcoming these obstacles.
Collapse
Affiliation(s)
- Ali Ali
- Department of Animal and Avian Sciences, University of Maryland, College Park, MD, USA
| | - Mohamed Salem
- Department of Animal and Avian Sciences, University of Maryland, College Park, MD, USA.
| |
Collapse
|
2
|
Shavkunov KS, Markelova NY, Glazunova OA, Kolzhetsov NP, Panyukov VV, Ozoline ON. The Fate and Functionality of Alien tRNA Fragments in Culturing Medium and Cells of Escherichia coli. Int J Mol Sci 2023; 24:12960. [PMID: 37629141 PMCID: PMC10455298 DOI: 10.3390/ijms241612960] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 08/15/2023] [Accepted: 08/16/2023] [Indexed: 08/27/2023] Open
Abstract
Numerous observations have supported the idea that various types of noncoding RNAs, including tRNA fragments (tRFs), are involved in communications between the host and its microbial community. The possibility of using their signaling function has stimulated the study of secreted RNAs, potentially involved in the interspecies interaction of bacteria. This work aimed at identifying such RNAs and characterizing their maturation during transport. We applied an approach that allowed us to detect oligoribonucleotides secreted by Prevotella copri (Segatella copri) or Rhodospirillum rubrum inside Escherichia coli cells. Four tRFs imported by E. coli cells co-cultured with these bacteria were obtained via chemical synthesis, and all of them affected the growth of E. coli. Their successive modifications in the culture medium and recipient cells were studied by high-throughput cDNA sequencing. Instead of the expected accidental exonucleolysis, in the milieu, we observed nonrandom cleavage by endonucleases continued in recipient cells. We also found intramolecular rearrangements of synthetic oligonucleotides, which may be considered traces of intermediate RNA circular isomerization. Using custom software, we estimated the frequency of such events in transcriptomes and secretomes of E. coli and observed surprising reproducibility in positions of such rare events, assuming the functionality of ring isoforms or their permuted derivatives in bacteria.
Collapse
Affiliation(s)
- Konstantin S. Shavkunov
- Department of Functional Genomics of Prokaryotes, Institute of Cell Biophysics of the Russian Academy of Sciences, Federal Research Center Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, 142290 Pushchino, Russia
| | - Natalia Yu. Markelova
- Department of Functional Genomics of Prokaryotes, Institute of Cell Biophysics of the Russian Academy of Sciences, Federal Research Center Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, 142290 Pushchino, Russia
| | - Olga A. Glazunova
- Department of Functional Genomics of Prokaryotes, Institute of Cell Biophysics of the Russian Academy of Sciences, Federal Research Center Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, 142290 Pushchino, Russia
| | - Nikolay P. Kolzhetsov
- Department of Functional Genomics of Prokaryotes, Institute of Cell Biophysics of the Russian Academy of Sciences, Federal Research Center Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, 142290 Pushchino, Russia
| | - Valery V. Panyukov
- Institute of Mathematical Problems of Biology RAS—The Branch of Keldysh Institute of Applied Mathematics of the Russian Academy of Sciences, 142290 Pushchino, Russia
| | - Olga N. Ozoline
- Department of Functional Genomics of Prokaryotes, Institute of Cell Biophysics of the Russian Academy of Sciences, Federal Research Center Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences, 142290 Pushchino, Russia
| |
Collapse
|
3
|
Jha T, Mendel J, Cho H, Choudhary M. Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning. Bioinform Biol Insights 2022; 16:11779322221118335. [PMID: 36016866 PMCID: PMC9397377 DOI: 10.1177/11779322221118335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Accepted: 07/18/2022] [Indexed: 11/16/2022] Open
Abstract
Small ribonucleic acid (sRNA) sequences are 50–500 nucleotide long, noncoding RNA (ncRNA) sequences that play an important role in regulating transcription and translation within a bacterial cell. As such, identifying sRNA sequences within an organism’s genome is essential to understand the impact of the RNA molecules on cellular processes. Recently, numerous machine learning models have been applied to predict sRNAs within bacterial genomes. In this study, we considered the sRNA prediction as an imbalanced binary classification problem to distinguish minor positive sRNAs from major negative ones within imbalanced data and then performed a comparative study with six learning algorithms and seven assessment metrics. First, we collected numerical feature groups extracted from known sRNAs previously identified in Salmonella typhimurium LT2 (SLT2) and Escherichia coli K12 (E. coli K12) genomes. Second, as a preliminary study, we characterized the sRNA-size distribution with the conformity test for Benford’s law. Third, we applied six traditional classification algorithms to sRNA features and assessed classification performance with seven metrics, varying positive-to-negative instance ratios, and utilizing stratified 10-fold cross-validation. We revisited important individual features and feature groups and found that classification with combined features perform better than with either an individual feature or a single feature group in terms of Area Under Precision-Recall curve (AUPR). We reconfirmed that AUPR properly measures classification performance on imbalanced data with varying imbalance ratios, which is consistent with previous studies on classification metrics for imbalanced data. Overall, eXtreme Gradient Boosting (XGBoost), even without exploiting optimal hyperparameter values, performed better than the other five algorithms with specific optimal parameter settings. As a future work, we plan to extend XGBoost further to a large amount of published sRNAs in bacterial genomes and compare its classification performance with recent machine learning models’ performance.
Collapse
Affiliation(s)
- Tony Jha
- Department of Mathematics, University of California, Berkeley, Berkeley, CA, USA
| | - Jovinna Mendel
- Department of Biological Sciences, Sam Houston State University, Huntsville, TX, USA
| | - Hyuk Cho
- Department of Computer Science, Sam Houston State University, Huntsville, TX, USA
| | - Madhusudan Choudhary
- Department of Biological Sciences, Sam Houston State University, Huntsville, TX, USA
| |
Collapse
|
4
|
Yu H, Shen ZA, Zhou YK, Du PF. Recent advances in predicting protein-lncRNA interactions using machine learning methods. Curr Gene Ther 2021; 22:228-244. [PMID: 34254917 DOI: 10.2174/1566523221666210712190718] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2021] [Revised: 05/01/2021] [Accepted: 05/31/2021] [Indexed: 11/22/2022]
Abstract
Long non-coding RNAs (LncRNAs) are a type of RNA with little or no protein-coding ability. Their length is more than 200 nucleotides. A large number of studies have indicated that lncRNAs play a significant role in various biological processes, including chromatin organizations, epigenetic programmings, transcriptional regulations, post-transcriptional processing, and circadian mechanism at the cellular level. Since lncRNAs perform vast functions through their interactions with proteins, identifying lncRNA-protein interaction is crucial to the understandings of the lncRNA molecular functions. However, due to the high cost and time-consuming disadvantage of experimental methods, a variety of computational methods have emerged. Recently, many effective and novel machine learning methods have been developed. In general, these methods fall into two categories: semi-supervised learning methods and supervised learning methods. The latter category can be further classified into the deep learning-based method, the ensemble learning-based method, and the hybrid method. In this paper, we focused on supervised learning methods. We summarized the state-of-the-art methods in predicting lncRNA-protein interactions. Furthermore, the performance and the characteristics of different methods have also been compared in this work. Considering the limits of the existing models, we analyzed the problems and discussed future research potentials.
Collapse
Affiliation(s)
- Han Yu
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Zi-Ang Shen
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Yuan-Ke Zhou
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| |
Collapse
|
5
|
Abstract
Background Many transcripts have been generated due to the development of sequencing technologies, and lncRNA is an important type of transcript. Predicting lncRNAs from transcripts is a challenging and important task. Traditional experimental lncRNA prediction methods are time-consuming and labor-intensive. Efficient computational methods for lncRNA prediction are in demand. Results In this paper, we propose two lncRNA prediction methods based on feature ensemble learning strategies named LncPred-IEL and LncPred-ANEL. Specifically, we encode sequences into six different types of features including transcript-specified features and general sequence-derived features. Then we consider two feature ensemble strategies to utilize and integrate the information in different feature types, the iterative ensemble learning (IEL) and the attention network ensemble learning (ANEL). IEL employs a supervised iterative way to ensemble base predictors built on six different types of features. ANEL introduces an attention mechanism-based deep learning model to ensemble features by adaptively learning the weight of individual feature types. Experiments demonstrate that both LncPred-IEL and LncPred-ANEL can effectively separate lncRNAs and other transcripts in feature space. Moreover, comparison experiments demonstrate that LncPred-IEL and LncPred-ANEL outperform several state-of-the-art methods when evaluated by 5-fold cross-validation. Both methods have good performances in cross-species lncRNA prediction. Conclusions LncPred-IEL and LncPred-ANEL are promising lncRNA prediction tools that can effectively utilize and integrate the information in different types of features.
Collapse
Affiliation(s)
- Yanzhen Xu
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Xiaohan Zhao
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Shuai Liu
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
6
|
World competitive contest-based artificial neural network: A new class-specific method for classification of clinical and biological datasets. Genomics 2020; 113:541-552. [PMID: 32991962 PMCID: PMC7521912 DOI: 10.1016/j.ygeno.2020.09.047] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Revised: 09/05/2020] [Accepted: 09/22/2020] [Indexed: 12/26/2022]
Abstract
Many data mining methods have been proposed to generate computer-aided diagnostic systems, which may determine diseases in their early stages by categorizing the data into some proper classes. Considering the importance of the existence of a suitable classifier, the present study aims to introduce an efficient approach based on the World Competitive Contests (WCC) algorithm as well as a multi-layer perceptron artificial neural network (ANN). Unlike the previously introduced methods, which each has developed a universal model for all different kinds of data classes, our proposed approach generates a single specific model for each individual class of data. The experimental results show that the proposed method (ANNWCC), which can be applied to both the balanced and unbalanced datasets, yields more than 76% (without applying feature selection methods) and 90% (with applying feature selection methods) of the average five-fold cross-validation accuracy on the 13 clinical and biological datasets. The findings also indicate that under different conditions, our proposed method can produce better results in comparison to some state-of-art meta-heuristic algorithms and methods in terms of various statistical and classification measurements. To classify the clinical and biological data, a multi-layer ANN and the WCC algorithm were combined. It was shown that developing a specific model for each individual class of data may yield better results compared with creating a universal model for all of the existing data classes. Besides, some efficient algorithms proved to be essential to generate acceptable biological results, and the methods' performance was found to be enhanced by fuzzifying or normalizing the biological data. We combined multi-layer artificial neural networks and world competitive contests algorithms to classify biological datasets The proposed method has been investigated on 13 clinical datasets with different properties Efficient models may yield better classification models and health diagnostic systems Feature selection methods can improve the performance of a model in separating case and control samples
Collapse
|
7
|
Zhu X, He J, Zhao S, Tao W, Xiong Y, Bi S. A comprehensive comparison and analysis of computational predictors for RNA N6-methyladenosine sites of Saccharomyces cerevisiae. Brief Funct Genomics 2020; 18:367-376. [PMID: 31609411 DOI: 10.1093/bfgp/elz018] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2019] [Revised: 07/07/2019] [Accepted: 07/15/2019] [Indexed: 12/16/2022] Open
Abstract
N6-methyladenosine (m6A) modification, as one of the commonest post-transcription modifications in RNAs, has been reported to be highly related to many biological processes. Over the past decade, several tools for m6A sites prediction of Saccharomyces cerevisiae have been developed and are freely available online. However, the quality of predictions by these tools is difficult to quantify and compare. In this study, an independent dataset M6Atest6540 was compiled to systematically evaluate nine publicly available m6A prediction tools for S. cerevisiae. The experimental results indicate that RAM-ESVM achieved the best performance on M6Atest6540; however, most models performed substantially worse than their performances reported in the original papers. The benchmark dataset Met2614, which was used as the training dataset for the nine methods, were further analyzed by using a position bias index. The results demonstrated the significantly different bias of dataset Met2614 compared with the RNA segments around m6A sites recorded in RMBase. Moreover, newMet2614 was collected by randomly selecting RNA segments from non-redundant data recorded in RMBase, and three different kinds of features were extracted. The performances of the models built on Met2614 and newMet2614 with the features were compared, which shows the better generalization of models built on newMet2614. Our results also indicate the position-specific propensity-based features outperform other features, although they are also easily over-fitted on a biased dataset.
Collapse
Affiliation(s)
- Xiaolei Zhu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China.,School of Life Sciences, Anhui University, Hefei, Anhui 230601, China
| | - Jingjing He
- School of Life Sciences, Anhui University, Hefei, Anhui 230601, China
| | - Shihao Zhao
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Wei Tao
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Yi Xiong
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Shoudong Bi
- School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China
| |
Collapse
|
8
|
Wekesa JS, Meng J, Luan Y. Multi-feature fusion for deep learning to predict plant lncRNA-protein interaction. Genomics 2020; 112:2928-2936. [PMID: 32437848 DOI: 10.1016/j.ygeno.2020.05.005] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Revised: 04/22/2020] [Accepted: 05/05/2020] [Indexed: 12/28/2022]
Abstract
Long non-coding RNAs (lncRNAs) play key roles in regulating cellular biological processes through diverse molecular mechanisms including binding to RNA binding proteins. The majority of plant lncRNAs are functionally uncharacterized, thus, accurate prediction of plant lncRNA-protein interaction is imperative for subsequent functional studies. We present an integrative model, namely DRPLPI. Its uniqueness is that it predicts by multi-feature fusion. Structural and four groups of sequence features are used, including tri-nucleotide composition, gapped k-mer, recursive complement and binary profile. We design a multi-head self-attention long short-term memory encoder-decoder network to extract generative high-level features. To obtain robust results, DRPLPI combines categorical boosting and extra trees into a single meta-learner. Experiments on Zea mays and Arabidopsis thaliana obtained 0.9820 and 0.9652 area under precision/recall curve (AUPRC) respectively. The proposed method shows significant enhancement in the prediction performance compared with existing state-of-the-art methods.
Collapse
Affiliation(s)
- Jael Sanyanda Wekesa
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China; School of Computing and Information Technology, Jomo Kenyatta University of Agriculture and Technology, Nairobi 62000-00200, Kenya
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China.
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning 116023, China
| |
Collapse
|
9
|
Dou L, Li X, Ding H, Xu L, Xiang H. Is There Any Sequence Feature in the RNA Pseudouridine Modification Prediction Problem? MOLECULAR THERAPY. NUCLEIC ACIDS 2020; 19:293-303. [PMID: 31865116 PMCID: PMC6931122 DOI: 10.1016/j.omtn.2019.11.014] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/08/2019] [Revised: 10/29/2019] [Accepted: 11/11/2019] [Indexed: 01/01/2023]
Abstract
Pseudouridine (Ψ) is the most abundant RNA modification and has been found in many kinds of RNAs, including snRNA, rRNA, tRNA, mRNA, and snoRNA. Thus, Ψ sites play a significant role in basic research and drug development. Although some experimental techniques have been developed to identify Ψ sites, they are expensive and time consuming, especially in the post-genomic era with the explosive growth of known RNA sequences. Thus, highly accurate computational methods are urgently required to quickly detect the Ψ sites on uncharacterized RNA sequences. Several predictors have been proposed using multifarious features, but their evaluated performances are still unsatisfactory. In this study, we first identified Ψ sites for H. sapiens, S. cerevisiae, and M. musculus using the sequence features from the bi-profile Bayes (BPB) method based on the random forest (RF) and support vector machine (SVM) algorithms, where the performances were evaluated using 5-fold cross-validation and independent tests. It was found that the SVM-based accuracies were 3.55% and 5.09% lower than the iPseU-CUU predictor for the H_990 and S_628 datasets, respectively. Almost the same-level results were obtained for M_994 and an independent H_200 dataset, even showing a 5.0% improvement for S_200. Then, three different kinds of features, including basic Kmer, general parallel correlation pseudo-dinucleotide composition (PC-PseDNC-General), and nucleotide chemical property (NCP) and nucleotide density (ND) from the iRNA-PseU method, were combined with BPB to show their comprehensive performances, where the effective features are selected by the max-relevance-max-distance (MRMD) method. The best evaluated accuracies of the combined features for the S_628 and M_994 datasets were achieved at 70.54% and 72.45%, which were 2.39% and 0.65% higher than iPseU-CUU. For the S_200 dataset, it was also improved 8% from 69% to 77%. However, there was no obvious improvement for H. sapiens, which was evaluated as approximately 63.23% and 72.0% for the H_990 and H_200 datasets, respectively. The overall performances for Ψ identification using BPB features as well as the combined features were not obviously improved. Although some kinds of feature extraction methods based on the RNA sequence information have been applied to construct the predictors in previous studies, the corresponding accuracies are generally in the range of 60%-70%. Thus, researchers need to reconsider whether there is any sequence feature in the RNA Ψ modification prediction problem.
Collapse
Affiliation(s)
- Lijun Dou
- School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen, China; Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Xiaoling Li
- Department of Oncology, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Hui Ding
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, China.
| | - Huaikun Xiang
- School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen, China.
| |
Collapse
|
10
|
Rajaraman S, Antani SK. Modality-specific deep learning model ensembles toward improving TB detection in chest radiographs. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020; 8:27318-27326. [PMID: 32257736 PMCID: PMC7120763 DOI: 10.1109/access.2020.2971257] [Citation(s) in RCA: 46] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/05/2023]
Abstract
The proposed study evaluates the efficacy of knowledge transfer gained through an ensemble of modality-specific deep learning models toward improving the state-of-the-art in Tuberculosis (TB) detection. A custom convolutional neural network (CNN) and selected popular pretrained CNNs are trained to learn modality-specific features from large-scale publicly available chest x-ray (CXR) collections including (i) RSNA dataset (normal = 8851, abnormal = 17833), (ii) Pediatric pneumonia dataset (normal = 1583, abnormal = 4273), and (iii) Indiana dataset (normal = 1726, abnormal = 2378). The knowledge acquired through modality-specific learning is transferred and fine-tuned for TB detection on the publicly available Shenzhen CXR collection (normal = 326, abnormal =336). The predictions of the best performing models are combined using different ensemble methods to demonstrate improved performance over any individual constituent model in classifying TB-infected and normal CXRs. The models are evaluated through cross-validation (n = 5) at the patient-level with an aim to prevent overfitting, improve robustness and generalization. It is observed that a stacked ensemble of the top-3 retrained models demonstrates promising performance (accuracy: 0.941; 95% confidence interval (CI): [0.899, 0.985], area under the curve (AUC): 0.995; 95% CI: [0.945, 1.00]). One-way ANOVA analyses show there are no statistically significant differences in accuracy (P = .759) and AUC (P = .831) among the ensemble methods. Knowledge transferred through modality-specific learning of relevant features helped improve the classification. The ensemble model resulted in reduced prediction variance and sensitivity to training data fluctuations. Results from their combined use are superior to the state-of-the-art.
Collapse
Affiliation(s)
- Sivaramakrishnan Rajaraman
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894 USA
| | - Sameer K. Antani
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894 USA
| |
Collapse
|
11
|
Cai J, Wang D, Chen R, Niu Y, Ye X, Su R, Xiao G, Wei L. A Bioinformatics Tool for the Prediction of DNA N6-Methyladenine Modifications Based on Feature Fusion and Optimization Protocol. Front Bioeng Biotechnol 2020; 8:502. [PMID: 32582654 PMCID: PMC7287168 DOI: 10.3389/fbioe.2020.00502] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2020] [Accepted: 04/29/2020] [Indexed: 01/04/2023] Open
Abstract
DNA N6-methyladenine (6mA) is closely involved with various biological processes. Identifying the distributions of 6mA modifications in genome-scale is of great significance to in-depth understand the functions. In recent years, various experimental and computational methods have been proposed for this purpose. Unfortunately, existing methods cannot provide accurate and fast 6mA prediction. In this study, we present 6mAPred-FO, a bioinformatics tool that enables researchers to make predictions based on sequences only. To sufficiently capture the characteristics of 6mA sites, we integrate the sequence-order information with nucleotide positional specificity information for feature encoding, and further improve the feature representation capacity by analysis of variance-based feature optimization protocol. The experimental results show that using this feature protocol, we can significantly improve the predictive performance. Via further feature analysis, we found that the sequence-order information and positional specificity information are complementary to each other, contributing to the performance improvement. On the other hand, the improvement is also due to the use of the feature optimization protocol, which is capable of effectively capturing the most informative features from the original feature space. Moreover, benchmarking comparison results demonstrate that our 6mAPred-FO outperforms several existing predictors. Finally, we establish a web-server that implements the proposed method for convenience of researchers' use, which is currently available at http://server.malab.cn/6mAPred-FO.
Collapse
Affiliation(s)
- Jianhua Cai
- Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, College of Computer and Control Engineering, Minjiang University, Fuzhou, China
- College of Mathematics and Computer Science, Fuzhou University, Fuzhou, China
| | - Donghua Wang
- Department of General Surgery, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Riqing Chen
- College of Computer and Information Sciences, Fujian Agriculture and Forestry University, Fuzhou, China
| | - Yuzhen Niu
- Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, College of Computer and Control Engineering, Minjiang University, Fuzhou, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Guobao Xiao
- Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, College of Computer and Control Engineering, Minjiang University, Fuzhou, China
- *Correspondence: Guobao Xiao
| | - Leyi Wei
- Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, College of Computer and Control Engineering, Minjiang University, Fuzhou, China
- School of Software, Shandong University, Jinan, China
- Leyi Wei
| |
Collapse
|
12
|
Lv Z, Jin S, Ding H, Zou Q. A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features. Front Bioeng Biotechnol 2019; 7:215. [PMID: 31552241 PMCID: PMC6737778 DOI: 10.3389/fbioe.2019.00215] [Citation(s) in RCA: 75] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2019] [Accepted: 08/22/2019] [Indexed: 02/01/2023] Open
Abstract
To gain insight into the malfunction of the Golgi apparatus and its relationship to various genetic and neurodegenerative diseases, the identification of sub-Golgi proteins, both cis-Golgi and trans-Golgi proteins, is of great significance. In this study, a state-of-art random forests sub-Golgi protein classifier, rfGPT, was developed. The rfGPT used 2-gap dipeptide and split amino acid composition for the feature vectors and was combined with the synthetic minority over-sampling technique (SMOTE) and an analysis of variance (ANOVA) feature selection method. The rfGPT was trained on a sub-Golgi protein sequence data set (137 sequences), with sequence identity less than 25%. For the optimal rfGPT classifier with 93 features, the accuracy (ACC) was 90.5%; the Matthews correlation coefficient (MCC) was 0.811; the sensitivity (Sn) was 92.6%; and the specificity (Sp) was 88.4%. The independent testing scores for the rfGPT were ACC = 90.6%; MCC = 0.696; Sn = 96.1%; and Sp = 69.2%. Although the independent testing accuracy was 4.4% lower than that for the best reported sub-Golgi classifier trained on a data set with 40% sequence identity (304 sequences), the rfGPT is currently the top sub-Golgi protein predictor utilizing feature vectors without any position-specific scoring matrix and its derivative features. Therefore, the rfGPT is a more practical tool, because no sequence alignment is required with tens of millions of protein sequences. To date, the rfGPT is the Golgi classifier with the best independent testing scores, optimized by training on smaller benchmark data sets. Feature importance analysis proves that the non-polar and aliphatic residues composition, the (aromatic residues) + (non-polar, aliphatic residues) dipeptide and aromatic residues composition between NH2-termial and COOH-terminal of protein sequences are the three top biological features for distinguishing the sub-Golgi proteins.
Collapse
Affiliation(s)
- Zhibin Lv
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Shunshan Jin
- Department of Neurology, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Hui Ding
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China.,Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|