1
|
Sultan MF, Karim T, Hossain Shaon MS, Azim SM, Dehzangi I, Akter MS, Ibrahim SM, Ali MM, Ahmed K, Bui FM. DHUpredET: A comparative computational approach for identification of dihydrouridine modification sites in RNA sequence. Anal Biochem 2025; 702:115828. [PMID: 40057221 DOI: 10.1016/j.ab.2025.115828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2025] [Revised: 02/23/2025] [Accepted: 03/04/2025] [Indexed: 03/17/2025]
Abstract
Laboratory-based detection of D sites is laborious and expensive. In this study, we developed effective machine learning models employing efficient feature encoding methods to identify D sites. Initially, we explored various state-of-the-art feature encoding approaches and 30 machine learning techniques for each and selected the top eight models based on their independent testing and cross-validation outcomes. Finally, we developed DHUpredET using the extra tree classifier methods for predicting DHU sites. The DHUpredET model demonstrated balanced performance across all evaluation criteria, outperforming state-of-the-art models by 8 % and 14 % in terms of accuracy and sensitivity, respectively, on an independent test set. Further analysis revealed that the model achieved higher accuracy with position-specific two nucleotide (PS2) features, leading us to conclude that PS2 features are the best suited for the DHUpredET model. Therefore, our proposed model emerges as the most favorite choice for predicting D sites. In addition, we conducted an in-depth analysis of local features and identified a particularly significant attribute with a feature score of 0.035 for PS2_299 attributes. This tool holds immense promise as an advantageous instrument for accelerating the discovery of D modification sites, which contributes too many targeting therapeutic and understanding RNA structure.
Collapse
Affiliation(s)
- Md Fahim Sultan
- Department of Computer Science and Engineering, Oakland University, Rochester, MI, 48309, USA.
| | - Tasmin Karim
- Department of Computer Science and Engineering, Oakland University, Rochester, MI, 48309, USA.
| | | | - Sayed Mehedi Azim
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, 08102, USA.
| | - Iman Dehzangi
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, 08102, USA; Department of Computer Science, Rutgers University, Camden, NJ, 08102, USA.
| | - Mst Shapna Akter
- Department of Computer Science and Engineering, Oakland University, Rochester, MI, 48309, USA.
| | - Sobhy M Ibrahim
- Department of Biochemistry, College of Science, King Saud University, P.O. Box: 2455, Riyadh, 11451, Saudi Arabia.
| | - Md Mamun Ali
- Division of Biomedical Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada; Department of Software Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh.
| | - Kawsar Ahmed
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada; Health Informatics Research Lab, Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City, Birulia, Dhaka, 1216, Bangladesh.
| | - Francis M Bui
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada.
| |
Collapse
|
2
|
Guo C, Chen Y, Ma C, Hao S, Song J. A Survey on AI-Driven Mouse Behavior Analysis Applications and Solutions. Bioengineering (Basel) 2024; 11:1121. [PMID: 39593781 PMCID: PMC11591614 DOI: 10.3390/bioengineering11111121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2024] [Revised: 10/30/2024] [Accepted: 11/04/2024] [Indexed: 11/28/2024] Open
Abstract
The physiological similarities between mice and humans make them vital animal models in biological and medical research. This paper explores the application of artificial intelligence (AI) in analyzing mice behavior, emphasizing AI's potential to identify and classify these behaviors. Traditional methods struggle to capture subtle behavioral features, whereas AI can automatically extract quantitative features from large datasets. Consequently, this study aims to leverage AI to enhance the efficiency and accuracy of mice behavior analysis. The paper reviews various applications of mice behavior analysis, categorizes deep learning tasks based on an AI pyramid, and summarizes AI methods for addressing these tasks. The findings indicate that AI technologies are increasingly applied in mice behavior analysis, including disease detection, assessment of external stimuli effects, social behavior analysis, and neurobehavioral assessment. The selection of AI methods is crucial and must align with specific applications. Despite AI's promising potential in mice behavior analysis, challenges such as insufficient datasets and benchmarks remain. Furthermore, there is a need for a more integrated AI platform, along with standardized datasets and benchmarks, to support these analyses and further advance AI-driven mice behavior analysis.
Collapse
Affiliation(s)
- Chaopeng Guo
- Software College, Northeastern University, Shenyang 110169, China; (C.G.); (Y.C.)
| | - Yuming Chen
- Software College, Northeastern University, Shenyang 110169, China; (C.G.); (Y.C.)
| | - Chengxia Ma
- College of Life and Health Sciences, Northeastern University, Shenyang 110169, China;
| | - Shuang Hao
- College of Life and Health Sciences, Northeastern University, Shenyang 110169, China;
| | - Jie Song
- Software College, Northeastern University, Shenyang 110169, China; (C.G.); (Y.C.)
| |
Collapse
|
3
|
Shaon MSH, Karim T, Ali MM, Ahmed K, Bui FM, Chen L, Moni MA. A robust deep learning approach for identification of RNA 5-methyluridine sites. Sci Rep 2024; 14:25688. [PMID: 39465261 PMCID: PMC11514282 DOI: 10.1038/s41598-024-76148-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Accepted: 10/10/2024] [Indexed: 10/29/2024] Open
Abstract
RNA 5-methyluridine (m5U) sites play a significant role in understanding RNA modifications, which influence numerous biological processes such as gene expression and cellular functioning. Consequently, the identification of m5U sites can play a vital role in the integrity, structure, and function of RNA molecules. Therefore, this study introduces GRUpred-m5U, a novel deep learning-based framework based on a gated recurrent unit in mature RNA and full transcript RNA datasets. We used three descriptor groups: nucleic acid composition, pseudo nucleic acid composition, and physicochemical properties, which include five feature extraction methods ENAC, Kmer, DPCP, DPCP type 2, and PseDNC. Initially, we aggregated all the feature extraction methods and created a new merged set. Three hybrid models were developed employing deep-learning methods and evaluated through 10-fold cross-validation with seven evaluation metrics. After a comprehensive evaluation, the GRUpred-m5U model outperformed the other applied models, obtaining 98.41% and 96.70% accuracy on the two datasets, respectively. To our knowledge, the proposed model outperformed all the existing state-of-the-art technology. The proposed supervised machine learning model was evaluated using unsupervised machine learning techniques such as principal component analysis (PCA), and it was observed that the proposed method provided a valid performance for identifying m5U. Considering its multi-layered construction, the GRUpred-m5U model has tremendous potential for future applications in the biological industry. The model, which consisted of neurons processing complicated input, excelled at pattern recognition and produced reliable results. Despite its greater size, the model obtained accurate results, essential in detecting m5U.
Collapse
Affiliation(s)
| | - Tasmin Karim
- Department of Computer Science and Informatics, Oakland University, Rochester, MI, 48309, USA
| | - Md Mamun Ali
- Division of Biomedical Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
- Department of Software Engineering, Daffodil Smart City (DSC), Daffodil International University, Birulia, Savar, Dhaka, 1216, Bangladesh
| | - Kawsar Ahmed
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada.
- Group of Bio-photomatiχ, Department of Information and Communication Technology, Mawlana Bhashani Science and Technology University, Santosh, 1902, Tangail, Bangladesh.
- Health Informatics Research Lab, Department of Computer Science and Engineering, Daffodil International University, Daffodil Smart City, Dhaka, 1216, Birulia, Bangladesh.
| | - Francis M Bui
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
| | - Li Chen
- Department of Electrical and Computer Engineering, University of Saskatchewan, 57 Campus Drive, Saskatoon, SK, S7N 5A9, Canada
| | - Mohammad Ali Moni
- AI & Digital Health Technology, Artificial Intelligence & Cyber Future Institute, Charles Sturt University, Bathurst, NSW, 2795, Australia.
- AI & Digital Health Technology, Rural Health Research Institute, Charles Sturt University, Orange, NSW, 2800, Australia.
| |
Collapse
|
4
|
Xie GB, Yu Y, Lin ZY, Chen RB, Xie JH, Liu ZG. 4 mC site recognition algorithm based on pruned pre-trained DNABert-Pruning model and fused artificial feature encoding. Anal Biochem 2024; 689:115492. [PMID: 38458307 DOI: 10.1016/j.ab.2024.115492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 02/21/2024] [Indexed: 03/10/2024]
Abstract
DNA 4 mC plays a crucial role in the genetic expression process of organisms. However, existing deep learning algorithms have shortcomings in the ability to represent DNA sequence features. In this paper, we propose a 4 mC site identification algorithm, DNABert-4mC, based on a fusion of the pruned pre-training DNABert-Pruning model and artificial feature encoding to identify 4 mC sites. The algorithm prunes and compresses the DNABert model, resulting in the pruned pre-training model DNABert-Pruning. This model reduces the number of parameters and removes redundancy from output features, yielding more precise feature representations while upholding accuracy.Simultaneously, the algorithm constructs an artificial feature encoding module to assist the DNABert-Pruning model in feature representation, effectively supplementing the information that is missing from the pre-trained features. The algorithm also introduces the AFF-4mC fusion strategy, which combines artificial feature encoding with the DNABert-Pruning model, to improve the feature representation capability of DNA sequences in multi-semantic spaces and better extract 4 mC sites and the distribution of nucleotide importance within the sequence. In experiments on six independent test sets, the DNABert-4mC algorithm achieved an average AUC value of 93.81%, outperforming seven other advanced algorithms with improvements of 2.05%, 5.02%, 11.32%, 5.90%, 12.02%, 2.42% and 2.34%, respectively.
Collapse
Affiliation(s)
- Guo-Bo Xie
- Guangdong University of Technology, Guangzhou, 510000, China
| | - Yi Yu
- Guangdong University of Technology, Guangzhou, 510000, China
| | - Zhi-Yi Lin
- Guangdong University of Technology, Guangzhou, 510000, China.
| | - Rui-Bin Chen
- Guangdong University of Technology, Guangzhou, 510000, China
| | - Jian-Hui Xie
- Guangdong University of Technology, Guangzhou, 510000, China
| | - Zhen-Guo Liu
- Department of Thoracic Surgery, The First Affiliated Hospital of Sun Yat-sen University, 58 Zhongshan 2nd Road, Guangzhou, 510080, China.
| |
Collapse
|
5
|
Zhang S, Zhao Y, Liang Y. AACFlow: an end-to-end model based on attention augmented convolutional neural network and flow-attention mechanism for identification of anticancer peptides. Bioinformatics 2024; 40:btae142. [PMID: 38452348 PMCID: PMC10973939 DOI: 10.1093/bioinformatics/btae142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2023] [Revised: 03/01/2024] [Accepted: 03/06/2024] [Indexed: 03/09/2024] Open
Abstract
MOTIVATION Anticancer peptides (ACPs) have natural cationic properties and can act on the anionic cell membrane of cancer cells to kill cancer cells. Therefore, ACPs have become a potential anticancer drug with good research value and prospect. RESULTS In this article, we propose AACFlow, an end-to-end model for identification of ACPs based on deep learning. End-to-end models have more room to automatically adjust according to the data, making the overall fit better and reducing error propagation. The combination of attention augmented convolutional neural network (AAConv) and multi-layer convolutional neural network (CNN) forms a deep representation learning module, which is used to obtain global and local information on the sequence. Based on the concept of flow network, multi-head flow-attention mechanism is introduced to mine the deep features of the sequence to improve the efficiency of the model. On the independent test dataset, the ACC, Sn, Sp, and AUC values of AACFlow are 83.9%, 83.0%, 84.8%, and 0.892, respectively, which are 4.9%, 1.5%, 8.0%, and 0.016 higher than those of the baseline model. The MCC value is 67.85%. In addition, we visualize the features extracted by each module to enhance the interpretability of the model. Various experiments show that our model is more competitive in predicting ACPs.
Collapse
Affiliation(s)
- Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, China
| | - Ya Zhao
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, China
| | - Yunyun Liang
- School of Science, Xi’an Polytechnic University, Xi'an 710048, China
| |
Collapse
|
6
|
Charoenkwan P, Chumnanpuen P, Schaduangrat N, Shoombuatong W. Accelerating the identification of the allergenic potential of plant proteins using a stacked ensemble-learning framework. J Biomol Struct Dyn 2024:1-13. [PMID: 38385478 DOI: 10.1080/07391102.2024.2318482] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Accepted: 02/08/2024] [Indexed: 02/23/2024]
Abstract
Plant-allergenic proteins (PAPs) have the potential to induce allergic reactions in certain individuals. While these proteins are generally innocuous for the majority of people, they can elicit an immune response in those with particular sensitivities. Thus, screening and prioritizing the allergenic potential of plant proteins is indispensable for the development of diagnostic tools, therapeutic interventions or medications to treat allergic reactions. However, investigating the allergenic potential of plant proteins based on experimental methods is costly and labour-intensive. Therefore, we develop StackPAP, a three-layer stacking ensemble framework for accurate large-scale identification of PAPs. In StackPAP, at the first layer, we conducted a comprehensive analysis of an extensive set of feature descriptors. Subsequently, we selected and fused five potential sequence-based feature descriptors, including amphiphilic pseudo-amino acid composition, dipeptide deviation from expected mean, amino acid composition, pseudo amino acid composition and dipeptide composition. Additionally, we applied an efficient genetic algorithm (GA-SAR) to determine informative feature sets. In the second layer, 12 powerful machine learning (ML) methods, in combination with all the informative feature sets, were employed to construct a pool of base classifiers. Finally, 13 potential base classifiers were selected using the GA-SAR method and combined to develop the final meta-classifier. Our experimental results revealed the promising prediction performance of StackPAP, with an accuracy, Matthew's correlation coefficient and AUC of 0.984, 0.969 and 0.993, respectively, as judged by the independent test dataset. In conclusion, both cross-validation and independent test results indicated the superior performance of StackPAP compared with several ML-based classifiers. To accelerate the identification of the allergenicity of plant proteins, we developed a user-friendly web server for StackPAP (https://pmlabqsar.pythonanywhere.com/StackPAP). We anticipate that StackPAP will be an efficient and useful tool for rapidly screening PAPs from a vast number of plant proteins.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Thailand
| | - Pramote Chumnanpuen
- Department of Zoology, Faculty of Science, Kasetsart University, Bangkok, Thailand
- Omics Center for Agriculture, Bioresources, Food, and Health, Kasetsart University (OmiKU), Bangkok, Thailand
| | - Nalini Schaduangrat
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand
| | - Watshara Shoombuatong
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand
| |
Collapse
|
7
|
Jia J, Deng Y, Yi M, Zhu Y. 4mCPred-GSIMP: Predicting DNA N4-methylcytosine sites in the mouse genome with multi-Scale adaptive features extraction and fusion. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:253-271. [PMID: 38303422 DOI: 10.3934/mbe.2024012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
Abstract
The epigenetic modification of DNA N4-methylcytosine (4mC) is vital for controlling DNA replication and expression. It is crucial to pinpoint 4mC's location to comprehend its role in physiological and pathological processes. However, accurate 4mC detection is difficult to achieve due to technical constraints. In this paper, we propose a deep learning-based approach 4mCPred-GSIMP for predicting 4mC sites in the mouse genome. The approach encodes DNA sequences using four feature encoding methods and combines multi-scale convolution and improved selective kernel convolution to adaptively extract and fuse features from different scales, thereby improving feature representation and optimization effect. In addition, we also use convolutional residual connections, global response normalization and pointwise convolution techniques to optimize the model. On the independent test dataset, 4mCPred-GSIMP shows high sensitivity, specificity, accuracy, Matthews correlation coefficient and area under the curve, which are 0.7812, 0.9312, 0.8562, 0.7207 and 0.9233, respectively. Various experiments demonstrate that 4mCPred-GSIMP outperforms existing prediction tools.
Collapse
Affiliation(s)
- Jianhua Jia
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333403, China
| | - Yu Deng
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333403, China
| | - Mengyue Yi
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333403, China
| | - Yuhui Zhu
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen 333403, China
| |
Collapse
|
8
|
Zhang L, Xiao K, Kong L. A computational method for small molecule-RNA binding sites identification by utilizing position specificity and complex network information. Biosystems 2024; 235:105094. [PMID: 38056591 DOI: 10.1016/j.biosystems.2023.105094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2023] [Revised: 11/23/2023] [Accepted: 11/24/2023] [Indexed: 12/08/2023]
Abstract
Some computational methods have been given for small molecule-RNA binding site identification due to that it plays a significant role in revealing biology function researches. However, it is still challenging to design an accurate model, especially for MCC. We designed a feature extraction technology from two aspects (position specificity and complex network information). Specifically, complex network was employed to express the space topological structure and sequence position information for improving prediction effect. Then, the features fused position specificity and complex network information were input into random forest classifier for model construction. The AUC of 88.22%, 77.92% and 81.46% were obtained on three independent datasets (RB19, CS71, RB78). Compared with the existing method, the best MCC were obtained on three datasets, which were 8.19%, 0.59% and 4.35% higher than the state-of-the-art prediction methods, respectively. The outstanding performances show that our method is a powerful tool to identify RNA binding sites, helping to the design RNA-targeting small molecule drugs. The data and resource codes are available at https://github.com/Kangxiaoneuq/PCN_RNAsite.
Collapse
Affiliation(s)
- Lichao Zhang
- School of Mathematics and Statistics, Northeastern University at Qinhuangdao, Qinhuangdao, 066000, PR China; Hebei Innovation Center for Smart Perception and Applied Technology of Agricultural Data, Qinhuangdao, 066000, PR China.
| | - Kang Xiao
- School of Mathematics and Statistics, Northeastern University at Qinhuangdao, Qinhuangdao, 066000, PR China.
| | - Liang Kong
- Hebei Innovation Center for Smart Perception and Applied Technology of Agricultural Data, Qinhuangdao, 066000, PR China; School of Mathematics and Information Science & Technology, Hebei Normal University of Science & Technology, Qinhuangdao, 066000, PR China.
| |
Collapse
|
9
|
Wang L, Zhou Y. MRM-BERT: a novel deep neural network predictor of multiple RNA modifications by fusing BERT representation and sequence features. RNA Biol 2024; 21:1-10. [PMID: 38357904 PMCID: PMC10877979 DOI: 10.1080/15476286.2024.2315384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Revised: 12/26/2023] [Accepted: 02/02/2024] [Indexed: 02/16/2024] Open
Abstract
RNA modifications play crucial roles in various biological processes and diseases. Accurate prediction of RNA modification sites is essential for understanding their functions. In this study, we propose a hybrid approach that fuses a pre-trained sequence representation with various sequence features to predict multiple types of RNA modifications in one combined prediction framework. We developed MRM-BERT, a deep learning method that combined the pre-trained DNABERT deep sequence representation module and the convolutional neural network (CNN) exploiting four traditional sequence feature encodings to improve the prediction performance. MRM-BERT was evaluated on multiple datasets of 12 commonly occurring RNA modifications, including m6A, m5C, m1A and so on. The results demonstrate that our hybrid model outperforms other models in terms of area under receiver operating characteristic curve (AUC) for all 12 types of RNA modifications. MRM-BERT is available as an online tool (http://117.122.208.21:8501) or source code (https://github.com/abhhba999/MRM-BERT), which allows users to predict RNA modification sites and visualize the results. Overall, our study provides an effective and efficient approach to predict multiple RNA modifications, contributing to the understanding of RNA biology and the development of therapeutic strategies.
Collapse
Affiliation(s)
- Linshu Wang
- Department of Biomedical Informatics, School of Basic Medical Sciences, Peking University, Beijing, China
| | - Yuan Zhou
- Department of Biomedical Informatics, School of Basic Medical Sciences, Peking University, Beijing, China
- Department of Biomedical Informatics, School of Basic Medical Sciences, State Key Laboratory of Vascular Homeostasis and Remodeling, Peking University, Beijing, China
| |
Collapse
|
10
|
Bai J, Yang H, Wu C. MLACNN: an attention mechanism-based CNN architecture for predicting genome-wide DNA methylation. Theory Biosci 2023; 142:359-370. [PMID: 37648910 PMCID: PMC10564812 DOI: 10.1007/s12064-023-00402-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Accepted: 07/31/2023] [Indexed: 09/01/2023]
Abstract
Methylation is an important epigenetic regulation of methylation genes that plays a crucial role in regulating biological processes. While traditional methods for detecting methylation in biological experiments are constantly improving, the development of artificial intelligence has led to the emergence of deep learning and machine learning methods as a new trend. However, traditional machine learning-based methods rely heavily on manual feature extraction, and most deep learning methods for studying methylation extract fewer features due to their simple network structures. To address this, we propose a bottomneck network based on an attention mechanism and use new methods to ensure that the deep network can learn more effective features while minimizing overfitting. This approach enables the model to learn more features from nucleotide sequences and make better predictions of methylation. The model uses three coding methods to encode the original DNA sequence and then applies feature fusion based on attention mechanisms to obtain the best fusion method. Our results demonstrate that MLACNN outperforms previous methods and achieves more satisfactory performance.
Collapse
Affiliation(s)
- JianGuo Bai
- Shandong Jiaotong University, Jinan City, Shandong Province China
| | - Hai Yang
- Shandong Jiaotong University, Jinan City, Shandong Province China
| | - ChangDe Wu
- Shandong Jiaotong University, Jinan City, Shandong Province China
| |
Collapse
|
11
|
Liu N, Zhang Z, Wu Y, Wang Y, Liang Y. CRBSP:Prediction of CircRNA-RBP Binding Sites Based on Multimodal Intermediate Fusion. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2898-2906. [PMID: 37130249 DOI: 10.1109/tcbb.2023.3272400] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Circular RNA (CircRNA) is widely expressed and has physiological and pathological significance, regulating post-transcriptional processes via its protein-binding activity. However, whereas much work has been done on linear RNA and RNA binding protein (RBP), little is known about the binding sites of CircRNA. The current report is on the development of a medium-term multimodal data fusion strategy, CRBSP, to predict CircRNA-RBP binding sites. CRBSP represents the CircRNA trinucleotide semantic, location, composition and frequency information as the corresponding coding methods of Word to vector (Word2vec), Position-specific trinucleotide propensity (PSTNP), Pseudo trinucleotide composition (PseTNC) and Trinucleotide nucleotide composition (TNC), respectively. CNN (Convolution Neural Networks) was used to extract global information and BiLSTM (bidirectional Long- and Short-Term Memory network) encoder and LSTM (Long- and Short-Term Memory network) decoder for local sequence information. Enhancement of the contributions of key features by the self-attention mechanism was followed by mid-term fusion of the four enhanced features. Logistic Regression (LR) classifier showed that CRBSP gives a mean AUC value of 0.9362 through 5-fold Cross Validation of all 37 datasets, a performance which is superior to five current state-of-the-art models. Similar evaluation of linear RNA-RBP binding sites gave an AUC value of 0.7615 which is also higher than other prediction methods, demonstrating the robustness of CRBSP.
Collapse
|
12
|
Chen S, Liao Y, Zhao J, Bin Y, Zheng C. PACVP: Prediction of Anti-Coronavirus Peptides Using a Stacking Learning Strategy With Effective Feature Representation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3106-3116. [PMID: 37022025 DOI: 10.1109/tcbb.2023.3238370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Due to the global outbreak of COVID-19 and its variants, antiviral peptides with anti-coronavirus activity (ACVPs) represent a promising new drug candidate for the treatment of coronavirus infection. At present, several computational tools have been developed to identify ACVPs, but the overall prediction performance is still not enough to meet the actual therapeutic application. In this study, we constructed an efficient and reliable prediction model PACVP (Prediction of Anti-CoronaVirus Peptides) for identifying ACVPs based on effective feature representation and a two-layer stacking learning framework. In the first layer, we use nine feature encoding methods with different feature representation angles to characterize the rich sequence information and fuse them into a feature matrix. Secondly, data normalization and unbalanced data processing are carried out. Next, 12 baseline models are constructed by combining three feature selection methods and four machine learning classification algorithms. In the second layer, we input the optimal probability features into the logistic regression algorithm (LR) to train the final model PACVP. The experiments show that PACVP achieves favorable prediction performance on independent test dataset, with ACC of 0.9208 and AUC of 0.9465. We hope that PACVP will become a useful method for identifying, annotating and characterizing novel ACVPs.
Collapse
|
13
|
Dhakal P, Tayara H, Chong KT. An ensemble of stacking classifiers for improved prediction of miRNA-mRNA interactions. Comput Biol Med 2023; 164:107242. [PMID: 37473564 DOI: 10.1016/j.compbiomed.2023.107242] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 06/21/2023] [Accepted: 07/07/2023] [Indexed: 07/22/2023]
Abstract
MicroRNAs (miRNAs) are small non-coding RNA molecules that play a crucial role in regulating gene expression at the post-transcriptional level by binding to potential target sites of messenger RNAs (mRNAs), facilitated by the Argonaute family of proteins. Selecting the conservative candidate target sites (CTS) is a challenging step, considering that most of the existing computational algorithms primarily focus on canonical site types, which is a time-consuming and inefficient utilization of miRNA target site interactions. We developed a stacking classifier algorithm that addresses the CTS selection criteria using feature-encoding techniques that generates feature vectors, including k-mer nucleotide composition, dinucleotide composition, pseudo-nucleotide composition, and sequence order coupling. This innovative stacking classifier algorithm surpassed previous state-of-the-art algorithms in predicting functional miRNA targets. We evaluated the performance of the proposed model on 10 independent test datasets and obtained an average accuracy of 79.77%, which is a significant improvement of 7.26 % over previous models. This improvement shows that the proposed method has great potential for distinguishing highly functional miRNA targets and can serve as a valuable tool in biomedical and drug development research.
Collapse
Affiliation(s)
- Priyash Dhakal
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea.
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea; Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju-si, 54896, Jeollabuk-do, Republic of Korea.
| |
Collapse
|
14
|
Ataş PK. A novel hybrid model to predict concomitant diseases for Hashimoto's thyroiditis. BMC Bioinformatics 2023; 24:319. [PMID: 37620755 PMCID: PMC10464155 DOI: 10.1186/s12859-023-05443-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Accepted: 08/10/2023] [Indexed: 08/26/2023] Open
Abstract
Hashimoto's thyroiditis is an autoimmune disorder characterized by the destruction of thyroid cells through immune-mediated mechanisms involving cells and antibodies. The condition can trigger disturbances in metabolism, leading to the development of other autoimmune diseases, known as concomitant diseases. Multiple concomitant diseases may coexist in a single individual, making it challenging to diagnose and manage them effectively. This study aims to propose a novel hybrid algorithm that classifies concomitant diseases associated with Hashimoto's thyroiditis based on sequences. The approach involves building distinct prediction models for each class and using the output of one model as input for the subsequent one, resulting in a dynamic decision-making process. Genes associated with concomitant diseases were collected alongside those related to Hashimoto's thyroiditis, and their sequences were obtained from the NCBI site in fasta format. The hybrid algorithm was evaluated against common machine learning algorithms and their various combinations. The experimental results demonstrate that the proposed hybrid model outperforms existing classification methods in terms of performance metrics. The significance of this study lies in its two distinctive aspects. Firstly, it presents a new benchmarking dataset that has not been previously developed in this field, using diverse methods. Secondly, it proposes a more effective and efficient solution that accounts for the dynamic nature of the dataset. The hybrid approach holds promise in investigating the genetic heterogeneity of complex diseases such as Hashimoto's thyroiditis and identifying new autoimmune disease genes. Additionally, the results of this study may aid in the development of genetic screening tools and laboratory experiments targeting Hashimoto's thyroiditis genetic risk factors. New software, models, and techniques for computing, including systems biology, machine learning, and artificial intelligence, are used in our study.
Collapse
Affiliation(s)
- Pınar Karadayı Ataş
- Department of Computer Engineering, Istanbul Arel University, 34537, Buyukcekmece, Istanbul, Turkey.
| |
Collapse
|
15
|
Ju H, Bai J, Jiang J, Che Y, Chen X. Comparative evaluation and analysis of DNA N4-methylcytosine methylation sites using deep learning. Front Genet 2023; 14:1254827. [PMID: 37671040 PMCID: PMC10476523 DOI: 10.3389/fgene.2023.1254827] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 07/31/2023] [Indexed: 09/07/2023] Open
Abstract
DNA N4-methylcytosine (4mC) is significantly involved in biological processes, such as DNA expression, repair, and replication. Therefore, accurate prediction methods are urgently needed. Deep learning methods have transformed applications that previously require sequencing expertise into engineering challenges that do not require expertise to solve. Here, we compare a variety of state-of-the-art deep learning models on six benchmark datasets to evaluate their performance in 4mC methylation site detection. We visualize the statistical analysis of the datasets and the performance of different deep-learning models. We conclude that deep learning can greatly expand the potential of methylation site prediction.
Collapse
Affiliation(s)
- Hong Ju
- Heilongjiang Agricultural Engineering Vocational College, Harbin, China
| | - Jie Bai
- Engineering Research Center of Integration and Application of Digital Learning Technology, Ministry of Education, Hangzhou, China
| | - Jing Jiang
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Yusheng Che
- Heilongjiang Agricultural Engineering Vocational College, Harbin, China
| | - Xin Chen
- Department of Neurosurgical Laboratory, The First Affiliated Hospital of Harbin Medical University, Harbin, China
| |
Collapse
|
16
|
Nguyen-Vo TH, Trinh QH, Nguyen L, Nguyen-Hoang PU, Rahardja S, Nguyen BP. i4mC-GRU: Identifying DNA N 4-Methylcytosine sites in mouse genomes using bidirectional gated recurrent unit and sequence-embedded features. Comput Struct Biotechnol J 2023; 21:3045-3053. [PMID: 37273848 PMCID: PMC10238585 DOI: 10.1016/j.csbj.2023.05.014] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 05/12/2023] [Accepted: 05/12/2023] [Indexed: 06/06/2023] Open
Abstract
N4-methylcytosine (4mC) is one of the most common DNA methylation modifications found in both prokaryotic and eukaryotic genomes. Since the 4mC has various essential biological roles, determining its location helps reveal unexplored physiological and pathological pathways. In this study, we propose an effective computational method called i4mC-GRU using a gated recurrent unit and duplet sequence-embedded features to predict potential 4mC sites in mouse (Mus musculus) genomes. To fairly assess the performance of the model, we compared our method with several state-of-the-art methods using two different benchmark datasets. Our results showed that i4mC-GRU achieved area under the receiver operating characteristic curve values of 0.97 and 0.89 and area under the precision-recall curve values of 0.98 and 0.90 on the first and second benchmark datasets, respectively. Briefly, our method outperformed existing methods in predicting 4mC sites in mouse genomes. Also, we deployed i4mC-GRU as an online web server, supporting users in genomics studies.
Collapse
Affiliation(s)
- Thanh-Hoang Nguyen-Vo
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
- School of Innovation, Design and Technology, Wellington Institute of Technology, Wellington 5012, New Zealand
| | - Quang H. Trinh
- School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi 100000, Vietnam
| | - Loc Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
| | - Phuong-Uyen Nguyen-Hoang
- Computational Biology Center, International University - VNU HCMC, Ho Chi Minh City 700000, Vietnam
| | - Susanto Rahardja
- School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China
- Infocomm Technology Cluster, Singapore Institute of Technology, Singapore 138683, Singapore
| | - Binh P. Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
| |
Collapse
|
17
|
Yang S, Yang Z, Yang J. 4mCBERT: A computing tool for the identification of DNA N4-methylcytosine sites by sequence- and chemical-derived information based on ensemble learning strategies. Int J Biol Macromol 2023; 231:123180. [PMID: 36646347 DOI: 10.1016/j.ijbiomac.2023.123180] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2022] [Revised: 11/26/2022] [Accepted: 12/30/2022] [Indexed: 01/15/2023]
Abstract
N4-methylcytosine (4mC) is an important DNA chemical modification pattern which is a new methylation modification discovered in recent years and plays critical roles in gene expression regulation, defense against invading genetic elements, genomic imprinting, and so on. Identifying 4mC site from DNA sequence segment contributes to discovering more novel modification patterns. In this paper, we present a model called 4mCBERT that encodes DNA sequence segments by sequence characteristics including one-hot, electron-ion interaction pseudopotential, nucleotide chemical property, word2vec and chemical information containing physicochemical properties (PCP), chemical bidirectional encoder representations from transformers (chemical BERT) and employs ensemble learning framework to develop a prediction model. PCP and chemical BERT features are firstly constructed and applied to predict 4mC sites and show positive contributions to identifying 4mC. For the Matthew's Correlation Coefficient, 4mCBERT significantly outperformed other state-of-the-art models on six independent benchmark datasets including A. thaliana, C. elegans, D. melanogaster, E. coli, G. Pickering, and G. subterraneous by 4.32 % to 24.39 %, 2.52 % to 31.65 %, 2 % to 16.49 %, 6.63 % to 35.15, 8.59 % to 61.85 %, and 8.45 % to 34.45 %. Moreover, 4mCBERT is designed to allow users to predict 4mC sites and retrain 4mC prediction models. In brief, 4mCBERT shows higher performance on six benchmark datasets by incorporating sequence- and chemical-driven information and is available at http://cczubio.top/4mCBERT and https://github.com/abcair/4mCBERT.
Collapse
Affiliation(s)
- Sen Yang
- School of Computer Science and Artificial Intelligence, Aliyun School of Big Data, School of Software, Changzhou 213164, China; The Affiliated Changzhou No 2 People's Hospital of Nanjing Medical University, Changzhou 213164, China.
| | - Zexi Yang
- School of Computer Science and Artificial Intelligence, Aliyun School of Big Data, School of Software, Changzhou 213164, China
| | - Jun Yang
- School of Educational Sciences, Yili Normal University, Yining 835000, China
| |
Collapse
|
18
|
Luo H, Shan W, Chen C, Ding P, Luo L. Improving language model of human genome for DNA-protein binding prediction based on task-specific pre-training. Interdiscip Sci 2023; 15:32-43. [PMID: 36136096 DOI: 10.1007/s12539-022-00537-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 08/30/2022] [Accepted: 09/07/2022] [Indexed: 11/27/2022]
Abstract
The DNA-protein binding plays a pivotal role in regulating gene expression and evolution, and computational identification of DNA-protein has drawn more and more attention in bioinformatics. Recently, variants of BERT are also used to capture the semantic information of DNA sequences for predicting DNA-protein bindings. In this study, we leverage a task-specific pre-training strategy on BERT using large-scale multi-source DNA-protein binding data and present TFBert. TFBert treats DNA sequences as natural sentences and k-mer nucleotides as words. It can effectively extract upstream and downstream nucleotide context information by pre-training the 690 unlabeled ChIP-seq datasets. Experiments show that the pre-trained model can achieve promising performance on every single dataset in the 690 ChIP-seq datasets after simple fine tuning, especially on small datasets. The average AUC is 94.7%, outperforming existing popular methods. In conclusion, this study provides a variant of BERT based on pre-training and achieved state-of-the-art results in predicting DNA-protein bindings. We believe that TFBert can provide insights into other biological sequence classification problems.
Collapse
Affiliation(s)
- Hanyu Luo
- School of Computer Science, University of South China, Hengyang, Hunan, 421001, People's Republic of China
| | - Wenyu Shan
- School of Computer Science, University of South China, Hengyang, Hunan, 421001, People's Republic of China
| | - Cheng Chen
- School of Computer Science, University of South China, Hengyang, Hunan, 421001, People's Republic of China
| | - Pingjian Ding
- School of Computer Science, University of South China, Hengyang, Hunan, 421001, People's Republic of China
| | - Lingyun Luo
- School of Computer Science, University of South China, Hengyang, Hunan, 421001, People's Republic of China. .,Hunan Medical Big Data International Science and Technology Innovation Cooperation Base, Hengyang, Hunan, 421001, People's Republic of China.
| |
Collapse
|
19
|
MultiScale-CNN-4mCPred: a multi-scale CNN and adaptive embedding-based method for mouse genome DNA N4-methylcytosine prediction. BMC Bioinformatics 2023; 24:21. [PMID: 36653789 PMCID: PMC9847203 DOI: 10.1186/s12859-023-05135-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Accepted: 01/04/2023] [Indexed: 01/19/2023] Open
Abstract
N4-methylcytosine (4mC) is an important epigenetic mechanism, which regulates many cellular processes such as cell differentiation and gene expression. The knowledge about the 4mC sites is a key foundation to exploring its roles. Due to the limitation of techniques, precise detection of 4mC is still a challenging task. In this paper, we presented a multi-scale convolution neural network (CNN) and adaptive embedding-based computational method for predicting 4mC sites in mouse genome, which was referred to as MultiScale-CNN-4mCPred. The MultiScale-CNN-4mCPred used adaptive embedding to encode nucleotides, and then utilized multi-scale CNNs as well as long short-term memory to extract more in-depth local properties and contextual semantics in the sequences. The MultiScale-CNN-4mCPred is an end-to-end learning method, which requires no sophisticated feature design. The MultiScale-CNN-4mCPred reached an accuracy of 81.66% in the 10-fold cross-validation, and an accuracy of 84.69% in the independent test, outperforming state-of-the-art methods. We implemented the proposed method into a user-friendly web application which is freely available at: http://www.biolscience.cn/MultiScale-CNN-4mCPred/ .
Collapse
|
20
|
Zhou J, Wang X, Wei Z, Meng J, Huang D. 4acCPred: Weakly supervised prediction of N4-acetyldeoxycytosine DNA modification from sequences. MOLECULAR THERAPY - NUCLEIC ACIDS 2022; 30:337-345. [DOI: 10.1016/j.omtn.2022.10.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 10/12/2022] [Indexed: 11/06/2022]
|
21
|
Zhang P, Zhang H, Wu H. iPro-WAEL: a comprehensive and robust framework for identifying promoters in multiple species. Nucleic Acids Res 2022; 50:10278-10289. [PMID: 36161334 PMCID: PMC9561371 DOI: 10.1093/nar/gkac824] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Revised: 08/24/2022] [Accepted: 09/14/2022] [Indexed: 11/27/2022] Open
Abstract
Promoters are consensus DNA sequences located near the transcription start sites and they play an important role in transcription initiation. Due to their importance in biological processes, the identification of promoters is significantly important for characterizing the expression of the genes. Numerous computational methods have been proposed to predict promoters. However, it is difficult for these methods to achieve satisfactory performance in multiple species. In this study, we propose a novel weighted average ensemble learning model, termed iPro-WAEL, for identifying promoters in multiple species, including Human, Mouse, E.coli, Arabidopsis, B.amyloliquefaciens, B.subtilis and R.capsulatus. Extensive benchmarking experiments illustrate that iPro-WAEL has optimal performance and is superior to the current methods in promoter prediction. The experimental results also demonstrate a satisfactory prediction ability of iPro-WAEL on cross-cell lines, promoters annotated by other methods and distinguishing between promoters and enhancers. Moreover, we identify the most important transcription factor binding site (TFBS) motif in promoter regions to facilitate the study of identifying important motifs in the promoter regions. The source code of iPro-WAEL is freely available at https://github.com/HaoWuLab-Bioinformatics/iPro-WAEL.
Collapse
Affiliation(s)
- Pengyu Zhang
- School of Software, Shandong University, Jinan, 250101, Shandong, China.,College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China
| | - Hongming Zhang
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China
| | - Hao Wu
- School of Software, Shandong University, Jinan, 250101, Shandong, China
| |
Collapse
|
22
|
PSP-PJMI: An innovative feature representation algorithm for identifying DNA N4-methylcytosine sites. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.05.060] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
23
|
Chen Z, Liu X, Zhao P, Li C, Wang Y, Li F, Akutsu T, Bain C, Gasser RB, Li J, Yang Z, Gao X, Kurgan L, Song J. iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets. Nucleic Acids Res 2022; 50:W434-W447. [PMID: 35524557 PMCID: PMC9252729 DOI: 10.1093/nar/gkac351] [Citation(s) in RCA: 42] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 04/22/2022] [Accepted: 04/25/2022] [Indexed: 01/07/2023] Open
Abstract
The rapid accumulation of molecular data motivates development of innovative approaches to computationally characterize sequences, structures and functions of biological and chemical molecules in an efficient, accessible and accurate manner. Notwithstanding several computational tools that characterize protein or nucleic acids data, there are no one-stop computational toolkits that comprehensively characterize a wide range of biomolecules. We address this vital need by developing a holistic platform that generates features from sequence and structural data for a diverse collection of molecule types. Our freely available and easy-to-use iFeatureOmega platform generates, analyzes and visualizes 189 representations for biological sequences, structures and ligands. To the best of our knowledge, iFeatureOmega provides the largest scope when directly compared to the current solutions, in terms of the number of feature extraction and analysis approaches and coverage of different molecules. We release three versions of iFeatureOmega including a webserver, command line interface and graphical interface to satisfy needs of experienced bioinformaticians and less computer-savvy biologists and biochemists. With the assistance of iFeatureOmega, users can encode their molecular data into representations that facilitate construction of predictive models and analytical studies. We highlight benefits of iFeatureOmega based on three research applications, demonstrating how it can be used to accelerate and streamline research in bioinformatics, computational biology, and cheminformatics areas. The iFeatureOmega webserver is freely available at http://ifeatureomega.erc.monash.edu and the standalone versions can be downloaded from https://github.com/Superzchen/iFeatureOmega-GUI/ and https://github.com/Superzchen/iFeatureOmega-CLI/.
Collapse
Affiliation(s)
- Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
- Center for Crop Genome Engineering, Henan Agricultural University, Zhengzhou 450046, China
| | - Xuhan Liu
- Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Einsteinweg 55, Leiden 2333 CC, The Netherlands
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Chen Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Yanan Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Fuyi Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Chris Bain
- Monash Data Future Institutes, Monash University, Melbourne, Victoria 3800, Australia
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Junzhou Li
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
| | - Zuoren Yang
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
- Monash Data Future Institutes, Monash University, Melbourne, Victoria 3800, Australia
| |
Collapse
|
24
|
Zhu L, Li W. Roles of Physicochemical and Structural Properties of RNA-Binding Proteins in Predicting the Activities of Trans-Acting Splicing Factors with Machine Learning. Int J Mol Sci 2022; 23:ijms23084426. [PMID: 35457243 PMCID: PMC9030803 DOI: 10.3390/ijms23084426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 04/13/2022] [Accepted: 04/14/2022] [Indexed: 02/06/2023] Open
Abstract
Trans-acting splicing factors play a pivotal role in modulating alternative splicing by specifically binding to cis-elements in pre-mRNAs. There are approximately 1500 RNA-binding proteins (RBPs) in the human genome, but the activities of these RBPs in alternative splicing are unknown. Since determining RBP activities through experimental methods is expensive and time consuming, the development of an efficient computational method for predicting the activities of RBPs in alternative splicing from their sequences is of great practical importance. Recently, a machine learning model for predicting the activities of splicing factors was built based on features of single and dual amino acid compositions. Here, we explored the role of physicochemical and structural properties in predicting their activities in alternative splicing using machine learning approaches and found that the prediction performance is significantly improved by including these properties. By combining the minimum redundancy–maximum relevance (mRMR) method and forward feature searching strategy, a promising feature subset with 24 features was obtained to predict the activities of RBPs. The feature subset consists of 16 dual amino acid compositions, 5 physicochemical features, and 3 structural features. The physicochemical and structural properties were as important as the sequence composition features for an accurate prediction of the activities of splicing factors. The hydrophobicity and distribution of coil are suggested to be the key physicochemical and structural features, respectively.
Collapse
Affiliation(s)
| | - Wenjin Li
- Correspondence: ; Tel.: +86-0755-26942336
| |
Collapse
|
25
|
Yu L, Zhang Y, Xue L, Liu F, Chen Q, Luo J, Jing R. Systematic Analysis and Accurate Identification of DNA N4-Methylcytosine Sites by Deep Learning. Front Microbiol 2022; 13:843425. [PMID: 35401453 PMCID: PMC8989013 DOI: 10.3389/fmicb.2022.843425] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2021] [Accepted: 02/21/2022] [Indexed: 11/13/2022] Open
Abstract
DNA N4-methylcytosine (4mC) is a pivotal epigenetic modification that plays an essential role in DNA replication, repair, expression and differentiation. To gain insight into the biological functions of 4mC, it is critical to identify their modification sites in the genomics. Recently, deep learning has become increasingly popular in recent years and frequently employed for the 4mC site identification. However, a systematic analysis of how to build predictive models using deep learning techniques is still lacking. In this work, we first summarized all existing deep learning-based predictors and systematically analyzed their models, features and datasets, etc. Then, using a typical standard dataset with three species (A. thaliana, C. elegans, and D. melanogaster), we assessed the contribution of different model architectures, encoding methods and the attention mechanism in establishing a deep learning-based model for the 4mC site prediction. After a series of optimizations, convolutional-recurrent neural network architecture using the one-hot encoding and attention mechanism achieved the best overall prediction performance. Extensive comparison experiments were conducted based on the same dataset. This work will be helpful for researchers who would like to build the 4mC prediction models using deep learning in the future.
Collapse
Affiliation(s)
- Lezheng Yu
- School of Chemistry and Materials Science, Guizhou Education University, Guiyang, China
| | - Yonglin Zhang
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China
| | - Li Xue
- School of Public Health, Southwest Medical University, Luzhou, China
| | - Fengjuan Liu
- School of Geography and Resources, Guizhou Education University, Guiyang, China
| | - Qi Chen
- Department of Endocrinology and Metabolism, The Affiliated Hospital of Southwest Medical University, Luzhou, China
| | - Jiesi Luo
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China.,Department of Pharmacy, The Affiliated Hospital of Southwest Medical University, Luzhou, China
| | - Runyu Jing
- School of Cyber Science and Engineering, Sichuan University, Chengdu, China
| |
Collapse
|
26
|
Identification of D Modification Sites Using a Random Forest Model Based on Nucleotide Chemical Properties. Int J Mol Sci 2022; 23:ijms23063044. [PMID: 35328461 PMCID: PMC8950657 DOI: 10.3390/ijms23063044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 02/25/2022] [Accepted: 03/09/2022] [Indexed: 12/03/2022] Open
Abstract
Dihydrouridine (D) is an abundant post-transcriptional modification present in transfer RNA from eukaryotes, bacteria, and archaea. D has contributed to treatments for cancerous diseases. Therefore, the precise detection of D modification sites can enable further understanding of its functional roles. Traditional experimental techniques to identify D are laborious and time-consuming. In addition, there are few computational tools for such analysis. In this study, we utilized eleven sequence-derived feature extraction methods and implemented five popular machine algorithms to identify an optimal model. During data preprocessing, data were partitioned for training and testing. Oversampling was also adopted to reduce the effect of the imbalance between positive and negative samples. The best-performing model was obtained through a combination of random forest and nucleotide chemical property modeling. The optimized model presented high sensitivity and specificity values of 0.9688 and 0.9706 in independent tests, respectively. Our proposed model surpassed published tools in independent tests. Furthermore, a series of validations across several aspects was conducted in order to demonstrate the robustness and reliability of our model.
Collapse
|
27
|
Tsukiyama S, Hasan MM, Deng HW, Kurata H. BERT6mA: prediction of DNA N6-methyladenine site using deep learning-based approaches. Brief Bioinform 2022; 23:6539171. [PMID: 35225328 PMCID: PMC8921755 DOI: 10.1093/bib/bbac053] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Revised: 01/28/2022] [Accepted: 01/31/2022] [Indexed: 01/29/2023] Open
Abstract
N6-methyladenine (6mA) is associated with important roles in DNA replication, DNA repair, transcription, regulation of gene expression. Several experimental methods were used to identify DNA modifications. However, these experimental methods are costly and time-consuming. To detect the 6mA and complement these shortcomings of experimental methods, we proposed a novel, deep leaning approach called BERT6mA. To compare the BERT6mA with other deep learning approaches, we used the benchmark datasets including 11 species. The BERT6mA presented the highest AUCs in eight species in independent tests. Furthermore, BERT6mA showed higher and comparable performance with the state-of-the-art models while the BERT6mA showed poor performances in a few species with a small sample size. To overcome this issue, pretraining and fine-tuning between two species were applied to the BERT6mA. The pretrained and fine-tuned models on specific species presented higher performances than other models even for the species with a small sample size. In addition to the prediction, we analyzed the attention weights generated by BERT6mA to reveal how the BERT6mA model extracts critical features responsible for the 6mA prediction. To facilitate biological sciences, the BERT6mA online web server and its source codes are freely accessible at https://github.com/kuratahiroyuki/BERT6mA.git, respectively.
Collapse
Affiliation(s)
- Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Hong-Wen Deng
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Hiroyuki Kurata
- Corresponding author: Hiroyuki Kurata, Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan. Tel: 81-948-29-7828; E-mail:
| |
Collapse
|
28
|
Zulfiqar H, Huang QL, Lv H, Sun ZJ, Dao FY, Lin H. Deep-4mCGP: A Deep Learning Approach to Predict 4mC Sites in Geobacter pickeringii by Using Correlation-Based Feature Selection Technique. Int J Mol Sci 2022; 23:1251. [PMID: 35163174 PMCID: PMC8836036 DOI: 10.3390/ijms23031251] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Revised: 01/19/2022] [Accepted: 01/20/2022] [Indexed: 12/15/2022] Open
Abstract
4mC is a type of DNA alteration that has the ability to synchronize multiple biological movements, for example, DNA replication, gene expressions, and transcriptional regulations. Accurate prediction of 4mC sites can provide exact information to their hereditary functions. The purpose of this study was to establish a robust deep learning model to recognize 4mC sites in Geobacter pickeringii. In the anticipated model, two kinds of feature descriptors, namely, binary and k-mer composition were used to encode the DNA sequences of Geobacter pickeringii. The obtained features from their fusion were optimized by using correlation and gradient-boosting decision tree (GBDT)-based algorithm with incremental feature selection (IFS) method. Then, these optimized features were inserted into 1D convolutional neural network (CNN) to classify 4mC sites from non-4mC sites in Geobacter pickeringii. The performance of the anticipated model on independent data exhibited an accuracy of 0.868, which was 4.2% higher than the existing model.
Collapse
Affiliation(s)
| | | | | | | | | | - Hao Lin
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; (H.Z.); (Q.-L.H.); (H.L.); (Z.-J.S.); (F.-Y.D.)
| |
Collapse
|
29
|
Mouse4mC-BGRU: deep learning for predicting DNA N4-methylcytosine sites in mouse genome. Methods 2022; 204:258-262. [DOI: 10.1016/j.ymeth.2022.01.009] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Revised: 01/14/2022] [Accepted: 01/24/2022] [Indexed: 12/12/2022] Open
|
30
|
Rehman MU, Tayara H, Chong KT. DCNN-4mC: Densely connected neural network based N4-methylcytosine site prediction in multiple species. Comput Struct Biotechnol J 2021; 19:6009-6019. [PMID: 34849205 PMCID: PMC8605313 DOI: 10.1016/j.csbj.2021.10.034] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Revised: 10/27/2021] [Accepted: 10/28/2021] [Indexed: 01/17/2023] Open
Abstract
DNA N4-methylcytosine (4mC) being a significant genetic modification holds a dominant role in controlling different biological functions, i.e., DNA replication, DNA repair, gene regulations and gene expression levels. The identification of 4mC sites is important to get insight information regarding different organics mechanisms. However, getting modification prediction from experimental methods is a challenging task due to high expenses and time-consuming techniques. Therefore, computational tools can be a great option for modification identification. Various computational tools are proposed in literature but their generalization and prediction performance require improvement. For this motive, we have proposed a neural network based tool named DCNN-4mC for identifying 4mC sites. The proposed model involves a set of neural network layers with a skip connection which allows to share the shallow features with dense layers. Skip connection have allowed to gather crucial information regarding 4mC sites. In literature, different models are employed on different species hence in many cases different datasets are available for a single species. In this research, we have combined all available datasets to create a single benchmark dataset for every species. To the best of our knowledge, no model in literature is employed on more than six different species. To ensure the generalizability of DCNN-4mC we have used 12 different species for performance evaluation. The DCNN-4mC tool has attained 2% to 14% higher accuracy than state-of-the-art tools on all available datasets of different species. Furthermore, independent test datasets are also engaged and DCNN-4mC have overall yielded high performance in them as well.
Collapse
Affiliation(s)
- Mobeen Ur Rehman
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
- Department of Avionics Engineering, Air University, Islamabad 44000, Pakistan
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea
- Corresponding author at: School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea (Hilal Tayara); Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea. (Kil To Chong)
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
- Advances Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
- Corresponding author at: School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea (Hilal Tayara); Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea. (Kil To Chong)
| |
Collapse
|
31
|
Guo Y, Hou L, Zhu W, Wang P. Prediction of Hormone-Binding Proteins Based on K-mer Feature Representation and Naive Bayes. Front Genet 2021; 12:797641. [PMID: 34887905 PMCID: PMC8650314 DOI: 10.3389/fgene.2021.797641] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Accepted: 11/05/2021] [Indexed: 11/29/2022] Open
Abstract
Hormone binding protein (HBP) is a soluble carrier protein that interacts selectively with different types of hormones and has various effects on the body's life activities. HBPs play an important role in the growth process of organisms, but their specific role is still unclear. Therefore, correctly identifying HBPs is the first step towards understanding and studying their biological function. However, due to their high cost and long experimental period, it is difficult for traditional biochemical experiments to correctly identify HBPs from an increasing number of proteins, so the real characterization of HBPs has become a challenging task for researchers. To measure the effectiveness of HBPs, an accurate and reliable prediction model for their identification is desirable. In this paper, we construct the prediction model HBP_NB. First, HBPs data were collected from the UniProt database, and a dataset was established. Then, based on the established high-quality dataset, the k-mer (K = 3) feature representation method was used to extract features. Second, the feature selection algorithm was used to reduce the dimensionality of the extracted features and select the appropriate optimal feature set. Finally, the selected features are input into Naive Bayes to construct the prediction model, and the model is evaluated by using 10-fold cross-validation. The final results were 95.45% accuracy, 94.17% sensitivity and 96.73% specificity. These results indicate that our model is feasible and effective.
Collapse
Affiliation(s)
- Yuxin Guo
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Yangtze Delta Region Institute, University of Electronic Science and Technology of China, Quzhou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Liping Hou
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Wen Zhu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Peng Wang
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| |
Collapse
|
32
|
Zhao Q, Ma J, Wang Y, Xie F, Lv Z, Xu Y, Shi H, Han K. Mul-SNO: A novel prediction tool for S-nitrosylation sites based on deep learning methods. IEEE J Biomed Health Inform 2021; 26:2379-2387. [PMID: 34762593 DOI: 10.1109/jbhi.2021.3123503] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Protein s-nitrosylation (SNO is one of the most important post-translational modifications and is formed by the covalent modification of nitric oxide and cysteine residues. Extensive studies have shown that SNO plays a pivotal role in the plant immune response and treating various major human diseases. In recent years, SNO sites have become a hot research topic. Traditional biochemical methods for SNO site identification are time-consuming and costly. In this study, we developed an economical and efficient SNO site prediction tool named Mul-SNO. Mul-SNO ensembled current popular and powerful deep learning model bidirectional long short-term memory (BiLSTM and bidirectional encoder representations from Transformers (BERT . Compared with existing state-of-the-art methods, Mul-SNO obtained better ACC of 0.911 and 0.796 based on 10-fold cross-validation and independent data sets, respectively. The prediction server can be obtained for free at http://lab.malab.cn/~mjq/Mul-SNO/.
Collapse
|
33
|
Han K, Shen LC, Zhu YH, Xu J, Song J, Yu DJ. MAResNet: predicting transcription factor binding sites by combining multi-scale bottom-up and top-down attention and residual network. Brief Bioinform 2021; 23:6399874. [PMID: 34664074 DOI: 10.1093/bib/bbab445] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2021] [Revised: 09/06/2021] [Accepted: 09/28/2021] [Indexed: 11/14/2022] Open
Abstract
Accurate identification of transcription factor binding sites is of great significance in understanding gene expression, biological development and drug design. Although a variety of methods based on deep-learning models and large-scale data have been developed to predict transcription factor binding sites in DNA sequences, there is room for further improvement in prediction performance. In addition, effective interpretation of deep-learning models is greatly desirable. Here we present MAResNet, a new deep-learning method, for predicting transcription factor binding sites on 690 ChIP-seq datasets. More specifically, MAResNet combines the bottom-up and top-down attention mechanisms and a state-of-the-art feed-forward network (ResNet), which is constructed by stacking attention modules that generate attention-aware features. In particular, the multi-scale attention mechanism is utilized at the first stage to extract rich and representative sequence features. We further discuss the attention-aware features learned from different attention modules in accordance with the changes as the layers go deeper. The features learned by MAResNet are also visualized through the TMAP tool to illustrate that the method can extract the unique characteristics of transcription factor binding sites. The performance of MAResNet is extensively tested on 690 test subsets with an average AUC of 0.927, which is higher than that of the current state-of-the-art methods. Overall, this study provides a new and useful framework for the prediction of transcription factor binding sites by combining the funnel attention modules with the residual network.
Collapse
Affiliation(s)
- Ke Han
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Long-Chen Shen
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Jian Xu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| |
Collapse
|
34
|
Ao C, Gao L, Yu L. Research progress in predicting DNA methylation modifications and the relation with human diseases. Curr Med Chem 2021; 29:822-836. [PMID: 34533438 DOI: 10.2174/0929867328666210917115733] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 07/05/2021] [Accepted: 07/11/2021] [Indexed: 11/22/2022]
Abstract
DNA methylation is an important mode of regulation in epigenetic mechanisms, and it is one of the research foci in the field of epigenetics. DNA methylation modification affects a series of biological processes, such as eukaryotic cell growth, differentiation and transformation mechanisms, by regulating gene expression. In this review, we systematically summarized the DNA methylation databases, prediction tools for DNA methylation modification, machine learning algorithms for predicting DNA methylation modification, and the relationship between DNA methylation modification and diseases such as hypertension, Alzheimer's disease, diabetic nephropathy, and cancer. An in-depth understanding of DNA methylation mechanisms can promote accurate prediction of DNA methylation modifications and the treatment and diagnosis of related diseases.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| |
Collapse
|
35
|
BERT-m7G: A Transformer Architecture Based on BERT and Stacking Ensemble to Identify RNA N7-Methylguanosine Sites from Sequence Information. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:7764764. [PMID: 34484416 PMCID: PMC8413034 DOI: 10.1155/2021/7764764] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 08/13/2021] [Indexed: 01/19/2023]
Abstract
As one of the most prevalent posttranscriptional modifications of RNA, N7-methylguanosine (m7G) plays an essential role in the regulation of gene expression. Accurate identification of m7G sites in the transcriptome is invaluable for better revealing their potential functional mechanisms. Although high-throughput experimental methods can locate m7G sites precisely, they are overpriced and time-consuming. Hence, it is imperative to design an efficient computational method that can accurately identify the m7G sites. In this study, we propose a novel method via incorporating BERT-based multilingual model in bioinformatics to represent the information of RNA sequences. Firstly, we treat RNA sequences as natural sentences and then employ bidirectional encoder representations from transformers (BERT) model to transform them into fixed-length numerical matrices. Secondly, a feature selection scheme based on the elastic net method is constructed to eliminate redundant features and retain important features. Finally, the selected feature subset is input into a stacking ensemble classifier to predict m7G sites, and the hyperparameters of the classifier are tuned with tree-structured Parzen estimator (TPE) approach. By 10-fold cross-validation, the performance of BERT-m7G is measured with an ACC of 95.48% and an MCC of 0.9100. The experimental results indicate that the proposed method significantly outperforms state-of-the-art prediction methods in the identification of m7G modifications.
Collapse
|
36
|
Shen LC, Liu Y, Song J, Yu DJ. SAResNet: self-attention residual network for predicting DNA-protein binding. Brief Bioinform 2021; 22:bbab101. [PMID: 33837387 PMCID: PMC8579196 DOI: 10.1093/bib/bbab101] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 03/03/2021] [Accepted: 03/08/2021] [Indexed: 11/12/2022] Open
Abstract
Knowledge of the specificity of DNA-protein binding is crucial for understanding the mechanisms of gene expression, regulation and gene therapy. In recent years, deep-learning-based methods for predicting DNA-protein binding from sequence data have achieved significant success. Nevertheless, the current state-of-the-art computational methods have some drawbacks associated with the use of limited datasets with insufficient experimental data. To address this, we propose a novel transfer learning-based method, termed SAResNet, which combines the self-attention mechanism and residual network structure. More specifically, the attention-driven module captures the position information of the sequence, while the residual network structure guarantees that the high-level features of the binding site can be extracted. Meanwhile, the pre-training strategy used by SAResNet improves the learning ability of the network and accelerates the convergence speed of the network during transfer learning. The performance of SAResNet is extensively tested on 690 datasets from the ChIP-seq experiments with an average AUC of 92.0%, which is 4.4% higher than that of the best state-of-the-art method currently available. When tested on smaller datasets, the predictive performance is more clearly improved. Overall, we demonstrate that the superior performance of DNA-protein binding prediction on DNA sequences can be achieved by combining the attention mechanism and residual structure, and a novel pipeline is accordingly developed. The proposed methodology is generally applicable and can be used to address any other sequence classification problems.
Collapse
Affiliation(s)
- Long-Chen Shen
- School of Computer Science and Engineering, Nanjing University of Science and Technology, China
| | - Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, China
| |
Collapse
|
37
|
Jiang M, Zhao B, Luo S, Wang Q, Chu Y, Chen T, Mao X, Liu Y, Wang Y, Jiang X, Wei DQ, Xiong Y. NeuroPpred-Fuse: an interpretable stacking model for prediction of neuropeptides by fusing sequence information and feature selection methods. Brief Bioinform 2021; 22:6350884. [PMID: 34396388 DOI: 10.1093/bib/bbab310] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 07/01/2021] [Accepted: 07/18/2021] [Indexed: 12/13/2022] Open
Abstract
Neuropeptides acting as signaling molecules in the nervous system of various animals play crucial roles in a wide range of physiological functions and hormone regulation behaviors. Neuropeptides offer many opportunities for the discovery of new drugs and targets for the treatment of neurological diseases. In recent years, there have been several data-driven computational predictors of various types of bioactive peptides, but the relevant work about neuropeptides is little at present. In this work, we developed an interpretable stacking model, named NeuroPpred-Fuse, for the prediction of neuropeptides through fusing a variety of sequence-derived features and feature selection methods. Specifically, we used six types of sequence-derived features to encode the peptide sequences and then combined them. In the first layer, we ensembled three base classifiers and four feature selection algorithms, which select non-redundant important features complementarily. In the second layer, the output of the first layer was merged and fed into logistic regression (LR) classifier to train the model. Moreover, we analyzed the selected features and explained the feasibility of the selected features. Experimental results show that our model achieved 90.6% accuracy and 95.8% AUC on the independent test set, outperforming the state-of-the-art models. In addition, we exhibited the distribution of selected features by these tree models and compared the results on the training set to that on the test set. These results fully showed that our model has a certain generalization ability. Therefore, we expect that our model would provide important advances in the discovery of neuropeptides as new drugs for the treatment of neurological diseases.
Collapse
Affiliation(s)
- Mingming Jiang
- State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Bowen Zhao
- State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Shenggan Luo
- State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Qiankun Wang
- State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yanyi Chu
- State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Tianhang Chen
- State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Xueying Mao
- State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yatong Liu
- State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yanjing Wang
- State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Xue Jiang
- State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Dong-Qing Wei
- State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yi Xiong
- State Key Laboratory of Microbial Metabolism, and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| |
Collapse
|
38
|
Charoenkwan P, Chiangjong W, Hasan MM, Nantasenamat C, Shoombuatong W. Review and comparative analysis of machine learning-based predictors for predicting and analyzing of anti-angiogenic peptides. Curr Med Chem 2021; 29:849-864. [PMID: 34375178 DOI: 10.2174/0929867328666210810145806] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Revised: 06/17/2021] [Accepted: 06/22/2021] [Indexed: 11/22/2022]
Abstract
Cancer is one of the leading causes of death worldwide and underlying this is angiogenesis that represents one of the hallmarks of cancer. Ongoing effort is already under way in the discovery of anti-angiogenic peptides (AAPs) as a promising therapeutic route by tackling the formation of new blood vessels. As such, the identification of AAPs constitutes a viable path for understanding their mechanistic properties pertinent for the discovery of new anti-cancer drugs. In spite of the abundance of peptide sequences in public databases, experimental efforts in the identification of anti-angiogenic peptides have progressed very slowly owing to its high expenditures and laborious nature. Owing to its inherent ability to make sense of large volumes of data, machine learning (ML) represents a lucrative technique that can be harnessed for peptide-based drug discovery. In this review, we conducted a comprehensive and comparative analysis of ML-based AAP predictors in terms of their employed feature descriptors, ML algorithms, cross-validation methods and prediction performance. Moreover, the common framework of these AAP predictors and their inherent weaknesses are also discussed. Particularly, we explore future perspectives for improving the prediction accuracy and model interpretability, which represents an interesting avenue for overcoming some of the inherent weaknesses of existing AAP predictors. We anticipate that this review would assist researchers in the rapid screening and identification of promising AAPs for clinical use.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai, Thailand
| | - Wararat Chiangjong
- Pediatric Translational Research Unit, Department of Pediatrics, Faculty of Medicine, Ramathibodi Hospital, Mahidol University, Bangkok 10400, Thailand
| | - Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, United States
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand
| |
Collapse
|
39
|
Charoenkwan P, Anuwongcharoen N, Nantasenamat C, Hasan MM, Shoombuatong W. In Silico Approaches for the Prediction and Analysis of Antiviral Peptides: A Review. Curr Pharm Des 2021; 27:2180-2188. [PMID: 33138759 DOI: 10.2174/1381612826666201102105827] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2020] [Accepted: 08/20/2020] [Indexed: 11/22/2022]
Abstract
In light of the growing resistance toward current antiviral drugs, efforts to discover novel and effective antiviral therapeutic agents remain a pressing scientific effort. Antiviral peptides (AVPs) represent promising therapeutic agents due to their extraordinary advantages in terms of potency, efficacy and pharmacokinetic properties. The growing volume of newly discovered peptide sequences in the post-genomic era requires computational approaches for timely and accurate identification of AVPs. Machine learning (ML) methods such as random forest and support vector machine represent robust learning algorithms that are instrumental in successful peptide-based drug discovery. Therefore, this review summarizes the current state-of-the-art application of ML methods for identifying AVPs directly from the sequence information. We compare the efficiency of these methods in terms of the underlying characteristics of the dataset used along with feature encoding methods, ML algorithms, cross-validation methods and prediction performance. Finally, guidelines for the development of robust AVP models are also discussed. It is anticipated that this review will serve as a useful guide for the design and development of robust AVP and related therapeutic peptide predictors in the future.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai 50200, Thailand
| | - Nuttapat Anuwongcharoen
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Chanin Nantasenamat
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| | - Md Mehedi Hasan
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Watshara Shoombuatong
- Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, 10700, Thailand
| |
Collapse
|
40
|
Perpetuo L, Klein J, Ferreira R, Guedes S, Amado F, Leite-Moreira A, Silva AMS, Thongboonkerd V, Vitorino R. How can artificial intelligence be used for peptidomics? Expert Rev Proteomics 2021; 18:527-556. [PMID: 34343059 DOI: 10.1080/14789450.2021.1962303] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
INTRODUCTION Peptidomics is an emerging field of omics sciences using advanced isolation, analysis, and computational techniques that enable qualitative and quantitative analyses of various peptides in biological samples. Peptides can act as useful biomarkers and as therapeutic molecules for diseases. AREAS COVERED The use of therapeutic peptides can be predicted quickly and efficiently using data-driven computational methods, particularly artificial intelligence (AI) approach. Various AI approaches are useful for peptide-based drug discovery, such as support vector machine, random forest, extremely randomized trees, and other more recently developed deep learning methods. AI methods are relatively new to the development of peptide-based therapies, but these techniques already become essential tools in protein science by dissecting novel therapeutic peptides and their functions (Figure 1).[Figure: see text]. EXPERT OPINION Researchers have shown that AI models can facilitate the development of peptidomics and selective peptide therapies in the field of peptide science. Biopeptide prediction is important for the discovery and development of successful peptide-based drugs. Due to their ability to predict therapeutic roles based on sequence details, many AI-dependent prediction tools have been developed (Figure 1).
Collapse
Affiliation(s)
- Luís Perpetuo
- iBiMED, Department of Medical Sciences, University of Aveiro, Aveiro
| | - Julie Klein
- Institut National de la Santé et de la Recherche Médicale (INSERM), U1297, Institute of Cardiovascular and Metabolic Disease, Université Toulouse III, Toulouse, France
| | - Rita Ferreira
- LAQV/REQUIMTE, Department of Chemistry, University of Aveiro, Aveiro
| | - Sofia Guedes
- LAQV/REQUIMTE, Department of Chemistry, University of Aveiro, Aveiro
| | - Francisco Amado
- LAQV/REQUIMTE, Department of Chemistry, University of Aveiro, Aveiro
| | - Adelino Leite-Moreira
- UnIC, Departamento de Cirurgia e Fisiologia, Faculdade de Medicina da Universidade do Porto, Porto
| | - Artur M S Silva
- LAQV/REQUIMTE, Department of Chemistry, University of Aveiro, Aveiro
| | - Visith Thongboonkerd
- Medical Proteomics Unit, Office for Research and Development, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok 10700, Thailand
| | - Rui Vitorino
- iBiMED, Department of Medical Sciences, University of Aveiro, Aveiro.,LAQV/REQUIMTE, Department of Chemistry, University of Aveiro, Aveiro.,UnIC, Departamento de Cirurgia e Fisiologia, Faculdade de Medicina da Universidade do Porto, Porto
| |
Collapse
|
41
|
Zulfiqar H, Sun ZJ, Huang QL, Yuan SS, Lv H, Dao FY, Lin H, Li YW. Deep-4mCW2V: A sequence-based predictor to identify N4-methylcytosine sites in Escherichia coli. Methods 2021; 203:558-563. [PMID: 34352373 DOI: 10.1016/j.ymeth.2021.07.011] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 07/22/2021] [Accepted: 07/29/2021] [Indexed: 10/20/2022] Open
Abstract
N4-methylcytosine (4mC) is a type of DNA modification which could regulate several biological progressions such as transcription regulation, replication and gene expressions. Precisely recognizing 4mC sites in genomic sequences can provide specific knowledge about their genetic roles. This study aimed to develop a deep learning-based model to predict 4mC sites in the Escherichia coli. In the model, DNA sequences were encoded by word embedding technique 'word2vec'. The obtained features were inputted into 1-D convolutional neural network (CNN) to discriminate 4mC sites from non-4mC sites in Escherichia coli genome. The examination on independent dataset showed that our model could yield the overall accuracy of 0.861, which was about 4.3% higher than the existing model. To provide convenience to scholars, we provided the data and source code of the model which can be freely download from https://github.com/linDing-groups/Deep-4mCW2V.
Collapse
Affiliation(s)
- Hasan Zulfiqar
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Zi-Jie Sun
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Qin-Lai Huang
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Shi-Shi Yuan
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hao Lv
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Fu-Ying Dao
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Hao Lin
- Center for Informational Biology and School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Yan-Wen Li
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China; Key Laboratory of Intelligent Information Processing of Jilin Province, Northeast Normal University, Changchun 130117, China; Institute of Computational Biology, Northeast Normal University, Changchun 130117, China.
| |
Collapse
|
42
|
Chen Z, Zhao P, Li C, Li F, Xiang D, Chen YZ, Akutsu T, Daly RJ, Webb GI, Zhao Q, Kurgan L, Song J. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res 2021; 49:e60. [PMID: 33660783 PMCID: PMC8191785 DOI: 10.1093/nar/gkab122] [Citation(s) in RCA: 156] [Impact Index Per Article: 39.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Revised: 02/05/2021] [Accepted: 02/25/2021] [Indexed: 12/14/2022] Open
Abstract
Sequence-based analysis and prediction are fundamental bioinformatic tasks that facilitate understanding of the sequence(-structure)-function paradigm for DNAs, RNAs and proteins. Rapid accumulation of sequences requires equally pervasive development of new predictive models, which depends on the availability of effective tools that support these efforts. We introduce iLearnPlus, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine-learning pipelines for analysis and predictions using nucleic acid and protein sequences. iLearnPlus provides a comprehensive set of algorithms and automates sequence-based feature extraction and analysis, construction and deployment of models, assessment of predictive performance, statistical analysis, and data visualization; all without programming. iLearnPlus includes a wide range of feature sets which encode information from the input sequences and over twenty machine-learning algorithms that cover several deep-learning approaches, outnumbering the current solutions by a wide margin. Our solution caters to experienced bioinformaticians, given the broad range of options, and biologists with no programming background, given the point-and-click interface and easy-to-follow design process. We showcase iLearnPlus with two case studies concerning prediction of long noncoding RNAs (lncRNAs) from RNA transcripts and prediction of crotonylation sites in protein chains. iLearnPlus is an open-source platform available at https://github.com/Superzchen/iLearnPlus/ with the webserver at http://ilearnplus.erc.monash.edu/.
Collapse
Affiliation(s)
- Zhen Chen
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China
| | - Pei Zhao
- State Key Laboratory of Cotton Biology, Institute of Cotton Research of Chinese Academy of Agricultural Sciences (CAAS), Anyang 455000, China
| | - Chen Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia.,Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, Victoria 3000, Australia
| | - Dongxu Xiang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Yong-Zi Chen
- Laboratory of Tumor Cell Biology, Key Laboratory of Cancer Prevention and Therapy, National Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and Hospital, Tianjin Medical University, Tianjin 300060, China
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Roger J Daly
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Quanzhi Zhao
- Collaborative Innovation Center of Henan Grain Crops, Henan Agricultural University, Zhengzhou 450046, China.,Key Laboratory of Rice Biology in Henan Province, Henan Agricultural University, Zhengzhou 450046, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
43
|
Zhu W, Guo Y, Zou Q. Prediction of presynaptic and postsynaptic neurotoxins based on feature extraction. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:5943-5958. [PMID: 34517517 DOI: 10.3934/mbe.2021297] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
A neurotoxin is essentially a protein that mainly acts on the nervous system; it has a selective toxic effect on the central nervous system and neuromuscular nodes, can cause muscle paralysis and respiratory paralysis, and has strong lethality. According to their principle of action, neurotoxins are divided into presynaptic neurotoxins and postsynaptic neurotoxins. Correctly identifying presynaptic and postsynaptic nerve toxins provides important clues for future drug development and the discovery of drug targets. Therefore, a predictive model, Neu_LR, was constructed in this paper. The monoMonokGap method was used to extract the frequency characteristics of presynaptic and postsynaptic neurotoxin sequences and carry out feature selection, then, based on the important features obtained after dimensionality reduction, the prediction model Neu_LR was constructed using a logistic regression algorithm, and ten-fold cross-validation and independent test set validation were used. The final accuracy rates were 99.6078 and 94.1176%, respectively, which proved that the Neu_LR model had good predictive performance and robustness, and could meet the prediction requirements of presynaptic and postsynaptic neurotoxins. The data and source code of the model can be freely download from https://github.com/gyx123681/.
Collapse
Affiliation(s)
- Wen Zhu
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Yuxin Guo
- Key Laboratory of Computational Science and Application of Hainan Province, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Hainan Normal University, Ministry of Education, Haikou, China
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| |
Collapse
|
44
|
Feng P, Feng L, Tang C. Comparison and Analysis of Computational Methods for Identifying N6-Methyladenosine Sites in Saccharomyces cerevisiae. Curr Pharm Des 2021; 27:1219-1229. [PMID: 33167827 DOI: 10.2174/1381612826666201109110703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2020] [Accepted: 07/20/2020] [Indexed: 11/22/2022]
Abstract
BACKGROUND N6-methyladenosine (m6A) plays critical roles in a broad range of biological processes. Knowledge about the precise location of m6A site in the transcriptome is vital for deciphering its biological functions. Although experimental techniques have made substantial contributions to identify m6A, they are still labor intensive and time consuming. As complement to experimental methods, in the past few years, a series of computational approaches have been proposed to identify m6A sites. METHODS In order to facilitate researchers to select appropriate methods for identifying m6A sites, it is necessary to conduct a comprehensive review and comparison of existing methods. RESULTS Since research works on m6A in Saccharomyces cerevisiae are relatively clear, in this review, we summarized recent progress of computational prediction of m6A sites in S. cerevisiae and assessed the performance of existing computational methods. Finally, future directions of computationally identifying m6A sites are presented. CONCLUSION Taken together, we anticipate that this review will serve as an important guide for computational analysis of m6A modifications.
Collapse
Affiliation(s)
- Pengmian Feng
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| | - Lijing Feng
- School of Sciences, North China University of Science and Technology, Tangshan 063000, China
| | - Chaohui Tang
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| |
Collapse
|
45
|
i4mC-EL: Identifying DNA N4-Methylcytosine Sites in the Mouse Genome Using Ensemble Learning. BIOMED RESEARCH INTERNATIONAL 2021; 2021:5515342. [PMID: 34159192 PMCID: PMC8187051 DOI: 10.1155/2021/5515342] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 05/21/2021] [Indexed: 12/03/2022]
Abstract
As one of important epigenetic modifications, DNA N4-methylcytosine (4mC) plays a crucial role in controlling gene replication, expression, cell cycle, DNA replication, and differentiation. The accurate identification of 4mC sites is necessary to understand biological functions. In the paper, we use ensemble learning to develop a model named i4mC-EL to identify 4mC sites in the mouse genome. Firstly, a multifeature encoding scheme consisting of Kmer and EIIP was adopted to describe the DNA sequences. Secondly, on the basis of the multifeature encoding scheme, we developed a stacked ensemble model, in which four machine learning algorithms, namely, BayesNet, NaiveBayes, LibSVM, and Voted Perceptron, were utilized to implement an ensemble of base classifiers that produce intermediate results as input of the metaclassifier, Logistic. The experimental results on the independent test dataset demonstrate that the overall rate of predictive accurate of i4mC-EL is 82.19%, which is better than the existing methods. The user-friendly website implementing i4mC-EL can be accessed freely at the following.
Collapse
|
46
|
Zulfiqar H, Khan RS, Hassan F, Hippe K, Hunt C, Ding H, Song XM, Cao R. Computational identification of N4-methylcytosine sites in the mouse genome with machine-learning method. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:3348-3363. [PMID: 34198389 DOI: 10.3934/mbe.2021167] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/24/2023]
Abstract
N4-methylcytosine (4mC) is a kind of DNA modification which could regulate multiple biological processes. Correctly identifying 4mC sites in genomic sequences can provide precise knowledge about their genetic roles. This study aimed to develop an ensemble model to predict 4mC sites in the mouse genome. In the proposed model, DNA sequences were encoded by k-mer, enhanced nucleic acid composition and composition of k-spaced nucleic acid pairs. Subsequently, these features were optimized by using minimum redundancy maximum relevance (mRMR) with incremental feature selection (IFS) and five-fold cross-validation. The obtained optimal features were inputted into random forest classifier for discriminating 4mC from non-4mC sites in mouse. On the independent dataset, our model could yield the overall accuracy of 85.41%, which was approximately 3.8% -6.3% higher than the two existing models, i4mC-Mouse and 4mCpred-EL respectively. The data and source code of the model can be freely download from https://github.com/linDing-groups/model_4mc.
Collapse
Affiliation(s)
- Hasan Zulfiqar
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Rida Sarwar Khan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Farwa Hassan
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Kyle Hippe
- Department of Computer Science, Pacific Lutheran University, Tacoma 98447, USA
| | - Cassandra Hunt
- Department of Computer Science, Pacific Lutheran University, Tacoma 98447, USA
| | - Hui Ding
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Xiao-Ming Song
- School of Life Science and Technology and Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
- School of Life Sciences, North China University of Science and Technology, Tangshan, Hebei 063210, China
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University, Tacoma 98447, USA
| |
Collapse
|
47
|
Yang X, Ye X, Li X, Wei L. iDNA-MT: Identification DNA Modification Sites in Multiple Species by Using Multi-Task Learning Based a Neural Network Tool. Front Genet 2021; 12:663572. [PMID: 33868390 PMCID: PMC8044371 DOI: 10.3389/fgene.2021.663572] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 03/02/2021] [Indexed: 02/04/2023] Open
Abstract
Motivation DNA N4-methylcytosine (4mC) and N6-methyladenine (6mA) are two important DNA modifications and play crucial roles in a variety of biological processes. Accurate identification of the modifications is essential to better understand their biological functions and mechanisms. However, existing methods to identify 4mA or 6mC sites are all single tasks, which demonstrates that they can identify only a certain modification in one species. Therefore, it is desirable to develop a novel computational method to identify the modification sites in multiple species simultaneously. Results In this study, we proposed a computational method, called iDNA-MT, to identify 4mC sites and 6mA sites in multiple species, respectively. The proposed iDNA-MT mainly employed multi-task learning coupled with the bidirectional gated recurrent units (BGRU) to capture the sharing information among different species directly from DNA primary sequences. Experimental comparative results on two benchmark datasets, containing different species respectively, show that either for identifying 4mA or for 6mC site in multiple species, the proposed iDNA-MT outperforms other state-of-the-art single-task methods. The promising results have demonstrated that iDNA-MT has great potential to be a powerful and practically useful tool to accurately identify DNA modifications.
Collapse
Affiliation(s)
- Xiao Yang
- School of Software, Shandong University, Jinan, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| | - Xuehong Li
- Department of Rehabilitation, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Lesong Wei
- Department of Computer Science, University of Tsukuba, Tsukuba, Japan
| |
Collapse
|
48
|
Khanal J, Tayara H, Zou Q, Chong KT. Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation. Comput Struct Biotechnol J 2021; 19:1612-1619. [PMID: 33868598 PMCID: PMC8042287 DOI: 10.1016/j.csbj.2021.03.015] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2021] [Revised: 03/12/2021] [Accepted: 03/13/2021] [Indexed: 12/11/2022] Open
Abstract
DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focused on finding hand-crafted features. This area of research, therefore, would benefit from the development of a computational approach that relies on automatic feature selection to identify relevant sites. We here report 4mC-w2vec, a computational method that learned automatic feature discrimination in the Rosaceae genomes, especially in Rosa chinensis (R. chinensis) and Fragaria vesca (F. vesca), based on distributed feature representation and through the word embedding technique ‘word2vec’. While a few bioinformatics tools are currently employed to identify 4mC sites in these genomes, their prediction performance is inadequate. Our system processed 4mC and non-4mC sites through a word embedding process, including sub-word information of its biological words through k-mer, which then served as features that were fed into a double layer of convolutional neural network (CNN) to classify whether the sample sequences contained 4mCs or non-4mCs sites. Our tool demonstrated performance superior to current tools that use the same genomic datasets. Additionally, 4mC-w2vec is effective for balanced and imbalanced class datasets alike, and the online web-server is currently available at: http://nsclbio.jbnu.ac.kr/tools/4mC-w2vec/.
Collapse
Affiliation(s)
- Jhabindra Khanal
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Hilal Tayara
- School of international Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea.,Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
| |
Collapse
|
49
|
Zhang L, Qin X, Liu M, Xu Z, Liu G. DNN-m6A: A Cross-Species Method for Identifying RNA N6-Methyladenosine Sites Based on Deep Neural Network with Multi-Information Fusion. Genes (Basel) 2021; 12:354. [PMID: 33670877 PMCID: PMC7997228 DOI: 10.3390/genes12030354] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2021] [Revised: 02/22/2021] [Accepted: 02/25/2021] [Indexed: 12/16/2022] Open
Abstract
As a prevalent existing post-transcriptional modification of RNA, N6-methyladenosine (m6A) plays a crucial role in various biological processes. To better radically reveal its regulatory mechanism and provide new insights for drug design, the accurate identification of m6A sites in genome-wide is vital. As the traditional experimental methods are time-consuming and cost-prohibitive, it is necessary to design a more efficient computational method to detect the m6A sites. In this study, we propose a novel cross-species computational method DNN-m6A based on the deep neural network (DNN) to identify m6A sites in multiple tissues of human, mouse and rat. Firstly, binary encoding (BE), tri-nucleotide composition (TNC), enhanced nucleic acid composition (ENAC), K-spaced nucleotide pair frequencies (KSNPFs), nucleotide chemical property (NCP), pseudo dinucleotide composition (PseDNC), position-specific nucleotide propensity (PSNP) and position-specific dinucleotide propensity (PSDP) are employed to extract RNA sequence features which are subsequently fused to construct the initial feature vector set. Secondly, we use elastic net to eliminate redundant features while building the optimal feature subset. Finally, the hyper-parameters of DNN are tuned with Bayesian hyper-parameter optimization based on the selected feature subset. The five-fold cross-validation test on training datasets show that the proposed DNN-m6A method outperformed the state-of-the-art method for predicting m6A sites, with an accuracy (ACC) of 73.58%-83.38% and an area under the curve (AUC) of 81.39%-91.04%. Furthermore, the independent datasets achieved an ACC of 72.95%-83.04% and an AUC of 80.79%-91.09%, which shows an excellent generalization ability of our proposed method.
Collapse
Affiliation(s)
- Lu Zhang
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China; (L.Z.); (X.Q.); (M.L.)
| | - Xinyi Qin
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China; (L.Z.); (X.Q.); (M.L.)
| | - Min Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China; (L.Z.); (X.Q.); (M.L.)
| | - Ziwei Xu
- Polytech Nantes, Bâtiment Ireste, 44300 Nantes, France;
| | - Guangzhong Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China; (L.Z.); (X.Q.); (M.L.)
| |
Collapse
|
50
|
Abbas Z, Tayara H, Chong KT. 4mCPred-CNN-Prediction of DNA N4-Methylcytosine in the Mouse Genome Using a Convolutional Neural Network. Genes (Basel) 2021; 12:296. [PMID: 33672576 PMCID: PMC7924022 DOI: 10.3390/genes12020296] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2021] [Revised: 02/16/2021] [Accepted: 02/17/2021] [Indexed: 02/07/2023] Open
Abstract
Among DNA modifications, N4-methylcytosine (4mC) is one of the most significant ones, and it is linked to the development of cell proliferation and gene expression. To know different its biological functions, the accurate detection of 4mC sites is required. Although we have several techniques for the prediction of 4mC sites in different genomes based on both machine learning (ML) and convolutional neural networks (CNNs), there is no CNN-based tool for the identification of 4mC sites in the mouse genome. In this article, a CNN-based model named 4mCPred-CNN was developed to classify 4mC locations in the mouse genome. Until now, we had only two ML-based models for this purpose; they utilized several feature encoding schemes, and thus still had a lot of space available to improve the prediction accuracy. Utilizing only a single feature encoding scheme-one-hot encoding-we outperformed both of the previous ML-based techniques. In a ten-fold validation test, the proposed model, 4mCPred-CNN, achieved an accuracy of 85.71% and Matthews correlation coefficient (MCC) of 0.717. On an independent dataset, the achieved accuracy was 87.50% with an MCC value of 0.750. The attained results exhibit that the proposed model can be of great use for researchers in the fields of biology and bioinformatics.
Collapse
Affiliation(s)
- Zeeshan Abbas
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea;
- Institute of Avionics and Aeronautics (IAA), Air University, Islamabad 44000, Pakistan
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Korea;
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Korea
| |
Collapse
|