1
|
Fu H, Ding Z, Wang W. Trans-m5C: A transformer-based model for predicting 5-methylcytosine (m5C) sites. Methods 2025; 234:178-186. [PMID: 39742984 DOI: 10.1016/j.ymeth.2024.12.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2023] [Revised: 10/31/2024] [Accepted: 12/11/2024] [Indexed: 01/04/2025] Open
Abstract
5-Methylcytosine (m5C) plays a pivotal role in various RNA metabolic processes, including RNA localization, stability, and translation. Current high-throughput sequencing technologies for m5C site identification are resource-intensive in terms of cost, labor, and time. As such, there is a pressing need for efficient computational approaches. Many existing computational methods rely on intricate hand-crafted features, requiring unavailable features, often leading to suboptimal prediction accuracy. Addressing these challenges, we introduce a novel deep-learning method, Trans-m5C. We first categorize m5C sites into NSUN2-dependent and NSUN6-dependent types for independent feature extraction. Subsequently, meticulously crafted transformer neural networks are employed to distill global features. The prediction of m5C sites is then accomplished using a discriminator built from a multi-layer perceptron. A rigorous evaluation for the performance of Trans-m5C on experimentally validated m5C data from human and mouse species reveals that our method offers a competitive edge over both baseline and existing methodologies.
Collapse
Affiliation(s)
- Haitao Fu
- School of Artificial Intelligence, Hubei University, Wuhan, 430062, China
| | - Zewen Ding
- University of Edinburgh, Centre for Discovery Brain Sciences, Edinburgh, EH89XD, United Kingdom
| | - Wen Wang
- University of Edinburgh, Queen's Medical Research Institute, Edinburgh, EH164TJ, United Kingdom.
| |
Collapse
|
2
|
Tahir M, Hussain S, Alarfaj FK. An Integrated Multi-Model Framework Utilizing Convolutional Neural Networks Coupled with Feature Extraction for Identification of 4mC Sites in DNA Sequences. Comput Biol Med 2024; 183:109281. [PMID: 39461102 DOI: 10.1016/j.compbiomed.2024.109281] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 09/19/2024] [Accepted: 10/14/2024] [Indexed: 10/29/2024]
Abstract
N4-methylcytosine (4mC) is a chemical modification that occurs on one of the four nucleotide bases in DNA and plays a vital role in DNA expression, repair, and replication. It also actively participates in the regulation of cell differentiation and gene expression. Consequently, it is important to comprehend the role of 4mC in the epigenetic regulation for revealing the complications of the gene expression and their associated governing cellular operations. However, the inherent resource requirements and time constraints of the experimental procedure, present challenges to the cellular culture process. While data-driven methodologies present promising solutions to mitigate the demand for extensive experimental efforts, their performance relies on the suitability and existence of high-quality data. This study presents a multi-model framework that integrates convolutional neural network (CNN) with the distributed k-mer and embedding feature extraction techniques to enhance the identification of 4mC sites in DNA sequences. The integration of k-mers ensures the effective representation of the local sequence patterns, while the utilization of embedding enables a more holistic encoding by considering the broader context and semantics of the sequence data. Following the initial step, the obtained distributed representation of the DNA sequence seamlessly enters the CNN, triggering a crucial convolution operation wherein a set of adaptable filters systematically convolves across the sequence to detect vital local patterns. The proposed integrated multi-model framework was applied to six publicly available datasets and evaluated against the cutting-edge 4mCPred, 4mCCNN, iDNA4mC, Meta-4mCpred, DeepTorrent, 4mCPred-SVM, and DMKL-HFIS methods. The evaluation was based on accuracy, specificity, sensitivity, and Matthews Correlation Coefficient. The results demonstrated that the proposed multi-model framework outperformed the state-of-the-art methods, as well as one-hot encoding and the hybrid of one-hot & TNC features, in accurately identifying 4mC sites.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, Manitoba, R3T5V6, Canada; Department of Computer Science, Abdul Wali Khan University, Mardan, 23200, Pakistan.
| | - Shahid Hussain
- Innovation Value Institute (IVI), School of Business, National University of Ireland Maynooth (NUIM), Maynooth, Co. Kildare, W23 F2H6, Ireland.
| | - Fawaz Khaled Alarfaj
- Department of Management Information Systems (MIS), School of Business, King Faisal University (KFU), Al-Ahsa, 31982, Saudi Arabia.
| |
Collapse
|
3
|
Lu Q, Xu J, Zhang R, Liu H, Wang M, Liu X, Yue Z, Gao Y. RiceSNP-ABST: a deep learning approach to identify abiotic stress-associated single nucleotide polymorphisms in rice. Brief Bioinform 2024; 26:bbae702. [PMID: 39757606 PMCID: PMC11962596 DOI: 10.1093/bib/bbae702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Revised: 11/16/2024] [Accepted: 12/23/2024] [Indexed: 01/07/2025] Open
Abstract
Given the adverse effects faced by rice due to abiotic stresses, the precise and rapid identification of single nucleotide polymorphisms (SNPs) associated with abiotic stress traits (ABST-SNPs) in rice is crucial for developing resistant rice varieties. The scarcity of high-quality data related to abiotic stress in rice has hindered the development of computational models and constrained research efforts aimed at rice improvement and breeding. Genome-wide association studies provide a better statistical power to consider ABST-SNPs in rice. Meanwhile, deep learning methods have shown their capability in predicting disease- or phenotype-associated loci, but have primarily focused on human species. Therefore, developing predictive models for identifying ABST-SNPs in rice is both urgent and valuable. In this paper, a model called RiceSNP-ABST is proposed for predicting ABST-SNPs in rice. Firstly, six training datasets were generated using a novel strategy for negative sample construction. Secondly, four feature encoding methods were proposed based on DNA sequence fragments, followed by feature selection. Finally, convolutional neural networks with residual connections were used to determine whether the sequences contained rice ABST-SNPs. RiceSNP-ABST outperformed traditional machine learning and state-of-the-art methods on the benchmark dataset and demonstrated consistent generalization on an independent dataset and cross-species datasets. Notably, multi-granularity causal structure learning was employed to elucidate the relationships among DNA structural features, aiming to identify key genetic variants more effectively. The web-based tool for the RiceSNP-ABST can be accessed at http://rice-snp-abst.aielab.cc.
Collapse
Affiliation(s)
- Quan Lu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Jiajun Xu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Renyi Zhang
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Hangcheng Liu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Meng Wang
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Xiaoshuang Liu
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Zhenyu Yue
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Yujia Gao
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| |
Collapse
|
4
|
Jin J, Feng J. iDHS-RGME: Identification of DNase I hypersensitive sites by integrating information on nucleotide composition and physicochemical properties. Biochem Biophys Res Commun 2024; 734:150618. [PMID: 39222575 DOI: 10.1016/j.bbrc.2024.150618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2024] [Revised: 08/19/2024] [Accepted: 08/28/2024] [Indexed: 09/04/2024]
Abstract
As pivotal markers of chromatin accessibility, DNase I hypersensitive sites (DHSs) intimately link to fundamental biological processes encompassing gene expression regulation and disease pathogenesis. Developing efficient and precise algorithms for DHSs identification holds paramount importance for unraveling genome functionality and elucidating disease mechanisms. This study innovatively presents iDHS-RGME, an Extremely Randomized Trees (Extra-Trees)-based algorithm that integrates unique feature extraction techniques for enhanced DHSs prediction. Specifically, iDHS-RGME utilizes two feature extraction approaches: Reverse Complementary Kmer (RCKmer) and Geary Spatial Autocorrelation (GSA), which comprehensively capture sequence attributes from diverse angles, bolstering information richness and accuracy. To address data imbalance, Borderline-SMOTE is employed, followed by Maximum Information Coefficient (MIC) for meticulous feature selection. Comparative evaluations underscored the superiority of the Extra-Trees classifier, which was subsequently adopted for model prediction. Through rigorous five-fold cross-validation, iDHS-RGME achieved remarkable accuracies of 94.71 % and 95.07 % on two independent datasets, outperforming previous models in terms of both precision and effectiveness.
Collapse
Affiliation(s)
- Jian Jin
- School of Science, Minzu University of China, Beijing, 100081, China
| | - Jie Feng
- School of Science, Minzu University of China, Beijing, 100081, China.
| |
Collapse
|
5
|
Yu X, Yani C, Wang Z, Long H, Zeng R, Liu X, Anas B, Ren J. iDNA-ITLM: An interpretable and transferable learning model for identifying DNA methylation. PLoS One 2024; 19:e0301791. [PMID: 39480834 PMCID: PMC11527195 DOI: 10.1371/journal.pone.0301791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 03/20/2024] [Indexed: 11/02/2024] Open
Abstract
In this study, from the perspective of image processing, we propose the iDNA-ITLM model, using a novel data enhance strategy by continuously self-replicating a short DNA sequence into a longer DNA sequence and then embedding it into a high-dimensional matrix to enlarge the receptive field, for identifying DNA methylation sites. Our model consistently outperforms the current state-of-the-art sequence-based DNA methylation site recognition methods when evaluated on 17 benchmark datasets that cover multiple species and include three DNA methylation modifications (4mC, 5hmC, and 6mA). The experimental results demonstrate the robustness and superior performance of our model across these datasets. In addition, our model can transfer learning to RNA methylation sequences and produce good results without modifying the hyperparameters in the model. The proposed iDNA-ITLM model can be considered a universal predictor across DNA and RNA methylation species.
Collapse
Affiliation(s)
- Xia Yu
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Cui Yani
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| | - Zhichao Wang
- Unit 32033, The People’s Liberation Army, Beijing, China
| | - Haixia Long
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Rao Zeng
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Xiling Liu
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Bilal Anas
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Jia Ren
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| |
Collapse
|
6
|
Han K, Wang J, Chu Y, Liao Q, Ding Y, Zheng D, Wan J, Guo X, Zou Q. Deep learning based method for predicting DNA N6-methyladenosine sites. Methods 2024; 230:91-98. [PMID: 39097179 DOI: 10.1016/j.ymeth.2024.07.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Revised: 07/22/2024] [Accepted: 07/29/2024] [Indexed: 08/05/2024] Open
Abstract
DNA N6 methyladenine (6mA) plays an important role in many biological processes, and accurately identifying its sites helps one to understand its biological effects more comprehensively. Previous traditional experimental methods are very labor-intensive and traditional machine learning methods also seem to be somewhat insufficient as the database of 6mA methylation groups becomes progressively larger, so we propose a deep learning-based method called multi-scale convolutional model based on global response normalization (CG6mA) to solve the prediction problem of 6mA site. This method is tested with other methods on three different kinds of benchmark datasets, and the results show that our model can get more excellent prediction results.
Collapse
Affiliation(s)
- Ke Han
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
| | - Jianchun Wang
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
| | - Ying Chu
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
| | - Qian Liao
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Dequan Zheng
- School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
| | - Jie Wan
- Laboratory for Space Environment and Physical Sciences, Harbin Institute of Technology, Harbin 150001, China
| | - Xiaoyi Guo
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China.
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China.
| |
Collapse
|
7
|
Nguyen VN, Ho TT, Doan TD, Le NQK. Using a hybrid neural network architecture for DNA sequence representation: A study on N 4-methylcytosine sites. Comput Biol Med 2024; 178:108664. [PMID: 38875905 DOI: 10.1016/j.compbiomed.2024.108664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Revised: 05/11/2024] [Accepted: 05/26/2024] [Indexed: 06/16/2024]
Abstract
N4-methylcytosine (4mC) is a modified form of cytosine found in DNA, contributing to epigenetic regulation. It exists in various genomes, including the Rosaceae family encompassing significant fruit crops like apples, cherries, and roses. Previous investigations have examined the distribution and functional implications of 4mC sites within the Rosaceae genome, focusing on their potential roles in gene expression regulation, environmental adaptation, and evolution. This research aims to improve the accuracy of predicting 4mC sites within the genome of Fragaria vesca, a Rosaceae plant species. Building upon the original 4mc-w2vec method, which combines word embedding processing and a convolutional neural network (CNN), we have incorporated additional feature encoding techniques and leveraged pre-trained natural language processing (NLP) models with different deep learning architectures including different forms of CNN, recurrent neural networks (RNN) and long short-term memory (LSTM). Our assessments have shown that the best model is derived from a CNN model using fastText encoding. This model demonstrates enhanced performance, achieving a sensitivity of 0.909, specificity of 0.77, and accuracy of 0.879 on an independent dataset. Furthermore, our model surpasses previously published works on the same dataset, thus showcasing its superior predictive capabilities.
Collapse
Affiliation(s)
- Van-Nui Nguyen
- University of Information and Communication Technology, Thai Nguyen University, Thai Nguyen, Viet Nam
| | - Trang-Thi Ho
- Department of Computer Science and Information Engineering, TamKang University, New Taipei, 251301, Taiwan
| | - Thu-Dung Doan
- International Degree Program in Animal Vaccine Technology, International College, National Pingtung University of Science and Technology, Pingtung, Taiwan
| | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei, 110, Taiwan; Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei, 110, Taiwan; AIBioMed Research Group, Taipei Medical University, Taipei, 110, Taiwan; Translational Imaging Research Center, Taipei Medical University Hospital, Taipei, 110, Taiwan.
| |
Collapse
|
8
|
Xie GB, Yu Y, Lin ZY, Chen RB, Xie JH, Liu ZG. 4 mC site recognition algorithm based on pruned pre-trained DNABert-Pruning model and fused artificial feature encoding. Anal Biochem 2024; 689:115492. [PMID: 38458307 DOI: 10.1016/j.ab.2024.115492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 02/21/2024] [Indexed: 03/10/2024]
Abstract
DNA 4 mC plays a crucial role in the genetic expression process of organisms. However, existing deep learning algorithms have shortcomings in the ability to represent DNA sequence features. In this paper, we propose a 4 mC site identification algorithm, DNABert-4mC, based on a fusion of the pruned pre-training DNABert-Pruning model and artificial feature encoding to identify 4 mC sites. The algorithm prunes and compresses the DNABert model, resulting in the pruned pre-training model DNABert-Pruning. This model reduces the number of parameters and removes redundancy from output features, yielding more precise feature representations while upholding accuracy.Simultaneously, the algorithm constructs an artificial feature encoding module to assist the DNABert-Pruning model in feature representation, effectively supplementing the information that is missing from the pre-trained features. The algorithm also introduces the AFF-4mC fusion strategy, which combines artificial feature encoding with the DNABert-Pruning model, to improve the feature representation capability of DNA sequences in multi-semantic spaces and better extract 4 mC sites and the distribution of nucleotide importance within the sequence. In experiments on six independent test sets, the DNABert-4mC algorithm achieved an average AUC value of 93.81%, outperforming seven other advanced algorithms with improvements of 2.05%, 5.02%, 11.32%, 5.90%, 12.02%, 2.42% and 2.34%, respectively.
Collapse
Affiliation(s)
- Guo-Bo Xie
- Guangdong University of Technology, Guangzhou, 510000, China
| | - Yi Yu
- Guangdong University of Technology, Guangzhou, 510000, China
| | - Zhi-Yi Lin
- Guangdong University of Technology, Guangzhou, 510000, China.
| | - Rui-Bin Chen
- Guangdong University of Technology, Guangzhou, 510000, China
| | - Jian-Hui Xie
- Guangdong University of Technology, Guangzhou, 510000, China
| | - Zhen-Guo Liu
- Department of Thoracic Surgery, The First Affiliated Hospital of Sun Yat-sen University, 58 Zhongshan 2nd Road, Guangzhou, 510080, China.
| |
Collapse
|
9
|
Yu X, Ren J, Long H, Zeng R, Zhang G, Bilal A, Cui Y. iDNA-OpenPrompt: OpenPrompt learning model for identifying DNA methylation. Front Genet 2024; 15:1377285. [PMID: 38689652 PMCID: PMC11058834 DOI: 10.3389/fgene.2024.1377285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Accepted: 03/07/2024] [Indexed: 05/02/2024] Open
Abstract
Introduction: DNA methylation is a critical epigenetic modification involving the addition of a methyl group to the DNA molecule, playing a key role in regulating gene expression without changing the DNA sequence. The main difficulty in identifying DNA methylation sites lies in the subtle and complex nature of methylation patterns, which may vary across different tissues, developmental stages, and environmental conditions. Traditional methods for methylation site identification, such as bisulfite sequencing, are typically labor-intensive, costly, and require large amounts of DNA, hindering high-throughput analysis. Moreover, these methods may not always provide the resolution needed to detect methylation at specific sites, especially in genomic regions that are rich in repetitive sequences or have low levels of methylation. Furthermore, current deep learning approaches generally lack sufficient accuracy. Methods: This study introduces the iDNA-OpenPrompt model, leveraging the novel OpenPrompt learning framework. The model combines a prompt template, prompt verbalizer, and Pre-trained Language Model (PLM) to construct the prompt-learning framework for DNA methylation sequences. Moreover, a DNA vocabulary library, BERT tokenizer, and specific label words are also introduced into the model to enable accurate identification of DNA methylation sites. Results and Discussion: An extensive analysis is conducted to evaluate the predictive, reliability, and consistency capabilities of the iDNA-OpenPrompt model. The experimental outcomes, covering 17 benchmark datasets that include various species and three DNA methylation modifications (4mC, 5hmC, 6mA), consistently indicate that our model surpasses outstanding performance and robustness approaches.
Collapse
Affiliation(s)
- Xia Yu
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Jia Ren
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| | - Haixia Long
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Rao Zeng
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Guoqiang Zhang
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Anas Bilal
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Yani Cui
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| |
Collapse
|
10
|
Xin R, Zhang F, Zheng J, Zhang Y, Yu C, Feng X. SDBA: Score Domain-Based Attention for DNA N4-Methylcytosine Site Prediction from Multiperspectives. J Chem Inf Model 2024; 64:2839-2853. [PMID: 37646411 DOI: 10.1021/acs.jcim.3c00688] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
In tasks related to DNA sequence classification, choosing the appropriate encoding methods is challenging. Some of the methods encode sequences based on prior knowledge that limits the ability of the model to obtain multiperspective information from the sequences. We introduced a new trainable ensemble method based on the attention mechanism SDBA, which stands for Score Domain-Based Attention. Unlike other methods, we fed the task-independent encoding results into the models and dynamically ensembled features from different perspectives using the SDBA mechanism. This approach allows the model to acquire and weight sequence features voluntarily. SDBA is conceptually general and empirically powerful. It has achieved new state-of-the-art results on the benchmark data sets associated with DNA N4-methylcytosine site prediction.
Collapse
Affiliation(s)
- Ruihao Xin
- College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin 130000, P.R. China
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, P.R. China
| | - Fan Zhang
- College of Information and Control Engineering, Jilin Institute of Chemical Technology, Jilin 130000, P.R. China
| | - Jiaxin Zheng
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, P.R. China
| | - Yangyi Zhang
- University of Melbourne Centre for Cancer Research, Victorian Comprehensive Cancer Centre, University of Melbourne, Parkville, Victoria 3050, Australia
| | - Cuinan Yu
- College of Computer Science and Technology, and Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, P.R. China
| | - Xin Feng
- School of Science, Jilin Institute of Chemical Technology, Jilin 130000, P.R. China
- State Key Laboratory of Inorganic Synthesis and Preparative Chemistry, College of Chemistry, Jilin University, Changchun 130012, P.R. China
| |
Collapse
|
11
|
Yao Z, Li F, Xie W, Chen J, Wu J, Zhan Y, Wu X, Wang Z, Zhang G. DeepSF-4mC: A deep learning model for predicting DNA cytosine 4mC methylation sites leveraging sequence features. Comput Biol Med 2024; 171:108166. [PMID: 38382385 DOI: 10.1016/j.compbiomed.2024.108166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 02/15/2024] [Accepted: 02/15/2024] [Indexed: 02/23/2024]
Abstract
N4-methylcytosine (4mC) is a DNA modification involving the addition of a methyl group to the fourth nitrogen atom of the cytosine base. This modification may influence gene regulation, providing potential insights into gene control mechanisms. Traditional laboratory methods for detecting 4mC DNA methylation have limitations, but the rise of artificial intelligence has introduced efficient computational strategies for 4mC site prediction. Despite this progress, challenges persist in terms of model performance and interpretability. To tackle these challenges, we propose DeepSF-4mC, a deep learning model specifically designed for predicting DNA cytosine 4mC methylation sites by leveraging sequence features. Our approach incorporates multiple encoding techniques to enhance prediction accuracy, increase model stability, and reduce the computational resources needed. Leveraging transfer learning, we harness existing models to enhance performance through learned representations or fine-tuning. Ensemble learning techniques combine predictions from multiple models, boosting robustness and accuracy. This research contributes to DNA methylation analysis and lays the groundwork for understanding 4mC's multifaceted role in biological processes. The web server for DeepSF-4mC is accessible at: http://deepsf-4mc.top/and the original code can be found at: https://github.com/754131799/DeepSF-4mC.
Collapse
Affiliation(s)
- Zhaomin Yao
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Fei Li
- College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Weiming Xie
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Jiaming Chen
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Jiezhang Wu
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Ying Zhan
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Xiaodan Wu
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China
| | - Zhiguo Wang
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China.
| | - Guoxu Zhang
- Department of Nuclear Medicine, General Hospital of Northern Theater Command, Shenyang, Liaoning, 110016, China; College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China.
| |
Collapse
|
12
|
Chen S, Semenov I, Zhang F, Yang Y, Geng J, Feng X, Meng Q, Lei K. An effective framework for predicting drug-drug interactions based on molecular substructures and knowledge graph neural network. Comput Biol Med 2024; 169:107900. [PMID: 38199213 DOI: 10.1016/j.compbiomed.2023.107900] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Revised: 11/27/2023] [Accepted: 12/23/2023] [Indexed: 01/12/2024]
Abstract
Drug-drug interactions (DDIs) play a central role in drug research, as the simultaneous administration of multiple drugs can have harmful or beneficial effects. Harmful interactions lead to adverse reactions, some of which can be life-threatening, while beneficial interactions can promote efficacy. Therefore, it is crucial for physicians, patients, and the research community to identify potential DDIs. Although many AI-based techniques have been proposed for predicting DDIs, most existing computational models primarily focus on integrating multiple data sources or combining popular embedding methods. Researchers often overlook the valuable information within the molecular structure of drugs or only consider the structural information of drugs, neglecting the relationship or topological information between drugs and other biological objects. In this study, we propose MSKG-DDI - a two-component framework that incorporates the Drug Chemical Structure Graph-based component and the Drug Knowledge Graph-based component to capture multimodal characteristics of drugs. Subsequently, a multimodal fusion neural layer is utilized to explore the complementarity between multimodal representations of drugs. Extensive experiments were conducted using two real-world datasets, and the results demonstrate that MSKG-DDI outperforms other state-of-the-art models in binary-class, multi-class, and multi-label prediction tasks under both transductive and inductive settings. Furthermore, the ablation analysis further confirms the practical usefulness of MSKG-DDI.
Collapse
Affiliation(s)
- Siqi Chen
- School of Information Science and Engineering, Chongqing Jiaotong University, Chongqing, 400074, China
| | - Ivan Semenov
- College of Intelligence and Computing, Tianjin University, Tianjin, 300072, China
| | - Fengyun Zhang
- College of Intelligence and Computing, Tianjin University, Tianjin, 300072, China
| | - Yang Yang
- College of Intelligence and Computing, Tianjin University, Tianjin, 300072, China
| | - Jie Geng
- TianJin Chest Hospital, Tianjin University, Tianjin, 300222, China
| | - Xuequan Feng
- Tianjin First Central Hospital, Tianjin, 300192, China.
| | - Qinghua Meng
- Tianjin Key Laboratory of Sports Physiology and Sports Medicine, Tianjin University of Sport, Tianjin, 301617, China
| | - Kaiyou Lei
- College of Computer and Information Science, Southwest University, Chongqing, 400715, China
| |
Collapse
|
13
|
Zhang L, Xiao K, Kong L. A computational method for small molecule-RNA binding sites identification by utilizing position specificity and complex network information. Biosystems 2024; 235:105094. [PMID: 38056591 DOI: 10.1016/j.biosystems.2023.105094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2023] [Revised: 11/23/2023] [Accepted: 11/24/2023] [Indexed: 12/08/2023]
Abstract
Some computational methods have been given for small molecule-RNA binding site identification due to that it plays a significant role in revealing biology function researches. However, it is still challenging to design an accurate model, especially for MCC. We designed a feature extraction technology from two aspects (position specificity and complex network information). Specifically, complex network was employed to express the space topological structure and sequence position information for improving prediction effect. Then, the features fused position specificity and complex network information were input into random forest classifier for model construction. The AUC of 88.22%, 77.92% and 81.46% were obtained on three independent datasets (RB19, CS71, RB78). Compared with the existing method, the best MCC were obtained on three datasets, which were 8.19%, 0.59% and 4.35% higher than the state-of-the-art prediction methods, respectively. The outstanding performances show that our method is a powerful tool to identify RNA binding sites, helping to the design RNA-targeting small molecule drugs. The data and resource codes are available at https://github.com/Kangxiaoneuq/PCN_RNAsite.
Collapse
Affiliation(s)
- Lichao Zhang
- School of Mathematics and Statistics, Northeastern University at Qinhuangdao, Qinhuangdao, 066000, PR China; Hebei Innovation Center for Smart Perception and Applied Technology of Agricultural Data, Qinhuangdao, 066000, PR China.
| | - Kang Xiao
- School of Mathematics and Statistics, Northeastern University at Qinhuangdao, Qinhuangdao, 066000, PR China.
| | - Liang Kong
- Hebei Innovation Center for Smart Perception and Applied Technology of Agricultural Data, Qinhuangdao, 066000, PR China; School of Mathematics and Information Science & Technology, Hebei Normal University of Science & Technology, Qinhuangdao, 066000, PR China.
| |
Collapse
|
14
|
Sultana A, Mitu SJ, Pathan MN, Uddin MN, Uddin MA, Aryal S. 4mC-CGRU: Identification of N4-Methylcytosine (4mC) sites using convolution gated recurrent unit in Rosaceae genome. Comput Biol Chem 2023; 107:107974. [PMID: 37944386 DOI: 10.1016/j.compbiolchem.2023.107974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 09/22/2023] [Accepted: 10/24/2023] [Indexed: 11/12/2023]
Abstract
An epigenetic modification is DNA N4-methylcytosine (4mC) that affects several biological functions without altering the DNA nucleotides, including DNA conformation, cell development, replication, stability, and DNA structural changes. To prevent restriction enzyme from damaging self-DNA, 4mC performs a critical role in restriction-modification functions. Existing studies mainly focused on finding hand-crafted features to identify 4mC locations, but these methods are inefficient due to high time consuming and high costs. In our research work, we propose a 4mC-CGRU which is a deep learning-based computational model with a standard encoding method to identify the 4mC sites from DNA sequences that learned autonomous feature selection in the Rosaceae genome, particularly in Rosa chinensis (R. chinensis) and Fragaria vesca (F. vesca). The proposed model consists of a convolutional neural network (CNN) and a gated recurrent unit network (GRU)-based model for identifying 4mC sites from Fragaria vesca and Rosa chinensis in the genomes. The CNN model extracts useful features from the datasets and the GRU classifies the DNA sequences. Thus, our approach can automatically extract important features to detect relative sites from DNA sequence. The performance analysis shows that the proposed model consistently outperforms over the state-of-the-art works in detecting 4mC sites.
Collapse
Affiliation(s)
- Abida Sultana
- Department of Computer Science and Engineering, Green University of Bangladesh, Dhaka, Bangladesh.
| | - Sadia Jannat Mitu
- Department of Computer Science and Engineering, Jagannath University, Dhaka, Bangladesh.
| | - Md Naimul Pathan
- Department of Computer Science and Engineering, Green University of Bangladesh, Dhaka, Bangladesh.
| | - Mohammed Nasir Uddin
- Department of Computer Science and Engineering, Jagannath University, Dhaka, Bangladesh.
| | - Md Ashraf Uddin
- School of Information Technology, Deakin University Geelong, Australia.
| | - Sunil Aryal
- School of Information Technology, Deakin University Geelong, Australia.
| |
Collapse
|
15
|
Ding Y, Zhou H, Zou Q, Yuan L. Identification of drug-side effect association via correntropy-loss based matrix factorization with neural tangent kernel. Methods 2023; 219:73-81. [PMID: 37783242 DOI: 10.1016/j.ymeth.2023.09.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 09/18/2023] [Accepted: 09/20/2023] [Indexed: 10/04/2023] Open
Abstract
Adverse drug reactions include side effects, allergic reactions, and secondary infections. Severe adverse reactions can cause cancer, deformity, or mutation. The monitoring of drug side effects is an important support for post marketing safety supervision of drugs, and an important basis for revising drug instructions. Its purpose is to timely detect and control drug safety risks. Traditional methods are time-consuming. To accelerate the discovery of side effects, we propose a machine learning based method, called correntropy-loss based matrix factorization with neural tangent kernel (CLMF-NTK), to solve the prediction of drug side effects. Our method and other computational methods are tested on three benchmark datasets, and the results show that our method achieves the best predictive performance.
Collapse
Affiliation(s)
- Yijie Ding
- Key Laboratory of Computational Science and Application of Hainan Province, Hainan Normal University, Haikou 571158, China; Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China; School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Hongmei Zhou
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China.
| | - Lei Yuan
- Department of Hepatobiliary Surgery, Quzhou People's Hospital, 100# Minjiang Main Road, Quzhou 324000, China.
| |
Collapse
|
16
|
Jia J, Wei Z, Cao X. EMDL-ac4C: identifying N4-acetylcytidine based on ensemble two-branch residual connection DenseNet and attention. Front Genet 2023; 14:1232038. [PMID: 37519885 PMCID: PMC10372626 DOI: 10.3389/fgene.2023.1232038] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Accepted: 06/29/2023] [Indexed: 08/01/2023] Open
Abstract
Introduction: N4-acetylcytidine (ac4C) is a critical acetylation modification that has an essential function in protein translation and is associated with a number of human diseases. Methods: The process of identifying ac4C sites by biological experiments is too cumbersome and costly. And the performance of several existing computational models needs to be improved. Therefore, we propose a new deep learning tool EMDL-ac4C to predict ac4C sites, which uses a simple one-hot encoding for a unbalanced dataset using a downsampled ensemble deep learning network to extract important features to identify ac4C sites. The base learner of this ensemble model consists of a modified DenseNet and Squeeze-and-Excitation Networks. In addition, we innovatively add a convolutional residual structure in parallel with the dense block to achieve the effect of two-layer feature extraction. Results: The average accuracy (Acc), mathews correlation coefficient (MCC), and area under the curve Area under curve of EMDL-ac4C on ten independent testing sets are 80.84%, 61.77%, and 87.94%, respectively. Discussion: Multiple experimental comparisons indicate that EMDL-ac4C outperforms existing predictors and it greatly improved the predictive performance of the ac4C sites. At the same time, EMDL-ac4C could provide a valuable reference for the next part of the study. The source code and experimental data are available at: https://github.com/13133989982/EMDLac4C.
Collapse
Affiliation(s)
- Jianhua Jia
- *Correspondence: Jianhua Jia, ; Zhangying Wei,
| | | | | |
Collapse
|
17
|
Nguyen-Vo TH, Trinh QH, Nguyen L, Nguyen-Hoang PU, Rahardja S, Nguyen BP. i4mC-GRU: Identifying DNA N 4-Methylcytosine sites in mouse genomes using bidirectional gated recurrent unit and sequence-embedded features. Comput Struct Biotechnol J 2023; 21:3045-3053. [PMID: 37273848 PMCID: PMC10238585 DOI: 10.1016/j.csbj.2023.05.014] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 05/12/2023] [Accepted: 05/12/2023] [Indexed: 06/06/2023] Open
Abstract
N4-methylcytosine (4mC) is one of the most common DNA methylation modifications found in both prokaryotic and eukaryotic genomes. Since the 4mC has various essential biological roles, determining its location helps reveal unexplored physiological and pathological pathways. In this study, we propose an effective computational method called i4mC-GRU using a gated recurrent unit and duplet sequence-embedded features to predict potential 4mC sites in mouse (Mus musculus) genomes. To fairly assess the performance of the model, we compared our method with several state-of-the-art methods using two different benchmark datasets. Our results showed that i4mC-GRU achieved area under the receiver operating characteristic curve values of 0.97 and 0.89 and area under the precision-recall curve values of 0.98 and 0.90 on the first and second benchmark datasets, respectively. Briefly, our method outperformed existing methods in predicting 4mC sites in mouse genomes. Also, we deployed i4mC-GRU as an online web server, supporting users in genomics studies.
Collapse
Affiliation(s)
- Thanh-Hoang Nguyen-Vo
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
- School of Innovation, Design and Technology, Wellington Institute of Technology, Wellington 5012, New Zealand
| | - Quang H. Trinh
- School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi 100000, Vietnam
| | - Loc Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
| | - Phuong-Uyen Nguyen-Hoang
- Computational Biology Center, International University - VNU HCMC, Ho Chi Minh City 700000, Vietnam
| | - Susanto Rahardja
- School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China
- Infocomm Technology Cluster, Singapore Institute of Technology, Singapore 138683, Singapore
| | - Binh P. Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
| |
Collapse
|
18
|
Li F, Liu S, Li K, Zhang Y, Duan M, Yao Z, Zhu G, Guo Y, Wang Y, Huang L, Zhou F. EpiTEAmDNA: Sequence feature representation via transfer learning and ensemble learning for identifying multiple DNA epigenetic modification types across species. Comput Biol Med 2023; 160:107030. [PMID: 37196456 DOI: 10.1016/j.compbiomed.2023.107030] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2023] [Revised: 04/21/2023] [Accepted: 05/10/2023] [Indexed: 05/19/2023]
Abstract
Methylation is a major DNA epigenetic modification for regulating the biological processes without altering the DNA sequence, and multiple types of DNA methylations have been discovered, including 6mA, 5hmC, and 4mC. Multiple computational approaches were developed to automatically identify the DNA methylation residues using machine learning or deep learning algorithms. The machine learning (ML) based methods are difficult to be transferred to the other predicting tasks of the DNA methylation sites using additional knowledge. Deep learning (DL) may facilitate the transfer learning of knowledge from similar tasks, but they are often ineffective on small datasets. This study proposes an integrated feature representation framework EpiTEAmDNA based on the strategies of transfer learning and ensemble learning, which is evaluated on multiple DNA methylation types across 15 species. EpiTEAmDNA integrates convolutional neural network (CNN) and conventional machine learning methods, and shows improved performances than the existing DL-based methods on small datasets when no additional knowledge is available. The experimental data suggests that the EpiTEAmDNA models may be further improved via transfer learning based on additional knowledge. The evaluation experiments on the independent test datasets also suggest that the proposed EpiTEAmDNA framework outperforms the existing models in most prediction tasks of the 3 DNA methylation types across 15 species. The source code, pre-trained global model, and the EpiTEAmDNA feature representation framework are freely available at http://www.healthinformaticslab.org/supp/.
Collapse
Affiliation(s)
- Fei Li
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Shuai Liu
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Kewei Li
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Yaqi Zhang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Meiyu Duan
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China.
| | - Zhaomin Yao
- College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, Liaoning, 110167, China
| | - Gancheng Zhu
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Yutong Guo
- College of Life Sciences, Jilin University, Changchun, Jilin, 130012, China
| | - Ying Wang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Lan Huang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China
| | - Fengfeng Zhou
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, 130012, China; College of Computer Science and Technology, Jilin University, Changchun, Jilin, 130012, China.
| |
Collapse
|
19
|
Yu X, Ren J, Cui Y, Zeng R, Long H, Ma C. DRSN4mCPred: accurately predicting sites of DNA N4-methylcytosine using deep residual shrinkage network for diagnosis and treatment of gastrointestinal cancer in the precision medicine era. Front Med (Lausanne) 2023; 10:1187430. [PMID: 37215722 PMCID: PMC10192687 DOI: 10.3389/fmed.2023.1187430] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Accepted: 04/05/2023] [Indexed: 05/24/2023] Open
Abstract
Introduction The DNA N4-methylcytosine (4mC) site levels of those suffering from digestive system cancers were higher, and the pathogenesis of digestive system cancers may also be related to the changes in DNA 4mC levels. Identifying DNA 4mC sites is a very important step in studying the analysis of biological function and cancer prediction. Extracting accurate features from DNA sequences is the key to establishing a prediction model of effective DNA 4mC sites. This study sought to develop a new predictive model, DRSN4mCPred, which aimed to improve the performance of the predicting DNA 4mC sites. Methods The model adopted multi-scale channel attention to extract features and used attention feature fusion (AFF) to fuse features. In order to capture features information more accurately and effectively, this model utilized Deep Residual Shrinkage Network with Channel-Wise thresholds (DRSN-CW) to eliminate noise-related features and achieve a more precise feature representation, thereby, distinguishing the sites in DNA with 4mC and non-4mC. Additionally, the predictive model incorporated an inverted residual block, a Multi-scale Channel Attention Module (MS-CAM), a Bi-directional Long Short Term Memory Network (Bi-LSTM), AFF, and DRSN-CW. Results and Discussion The results indicated the predictive model DRSN4mCPred had extremely good performance in predicting the DNA 4mC sites across different species. This paper will potentially provide support for the diagnosis and treatment of gastrointestinal cancer based on artificial intelligence in the precise medical era.
Collapse
Affiliation(s)
- Xia Yu
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Jia Ren
- Industrial Design School, Shandong University of ART and Design, Jinan, Shandong, China
| | - Yani Cui
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| | - Rao Zeng
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Haixia Long
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Cuihua Ma
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| |
Collapse
|
20
|
Zhao S, Meng J, Wekesa JS, Luan Y. Identification of small open reading frames in plant lncRNA using class-imbalance learning. Comput Biol Med 2023; 157:106773. [PMID: 36924731 DOI: 10.1016/j.compbiomed.2023.106773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Revised: 02/21/2023] [Accepted: 03/09/2023] [Indexed: 03/12/2023]
Abstract
Recently, small open reading frames (sORFs) in long noncoding RNA (lncRNA) have been demonstrated to encode small peptides that can help study the mechanisms of growth and development in organisms. Since machine learning-based computational methods are less costly compared with biological experiments, they can be used to identify sORFs and provide a basis for biological experiments. However, few computational methods and data resources have been exploited for identifying sORFs in plant lncRNA. Besides, machine learning models produce underperforming classifiers when faced with a class-imbalance problem. In this study, an alternative method called SMOTE based on weighted cosine distance (WCDSMOTE) which enables interaction with feature selection is put forward to synthesize minority class samples and weighted edited nearest neighbor (WENN) is applied to clean up majority class samples, thus, hybrid sampling WCDSMOTE-ENN is proposed to deal with imbalanced datasets with the multi-angle feature. A heterogeneous classifier ensemble is introduced to complete the classification task. Therefore, a novel computational method that is based on class-imbalance learning to identify the sORFs with coding potential in plant lncRNA (sORFplnc) is presented. Experimental results manifest that sORFplnc outperforms existing computational methods in identifying sORFs with coding potential. We anticipate that the proposed work can be a reference for relevant research and contribute to agriculture and biomedicine.
Collapse
Affiliation(s)
- Siyuan Zhao
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116024, China
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, 116024, China.
| | - Jael Sanyanda Wekesa
- Department of Information Technology, Jomo Kenyatta University of Agriculture and Technology, Nairobi, 62000-00200, Kenya
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning, 116024, China
| |
Collapse
|
21
|
MSINGB: A Novel Computational Method Based on NGBoost for Identifying Microsatellite Instability Status from Tumor Mutation Annotation Data. Interdiscip Sci 2023; 15:100-110. [PMID: 36350503 DOI: 10.1007/s12539-022-00544-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Revised: 10/19/2022] [Accepted: 10/22/2022] [Indexed: 11/11/2022]
Abstract
Microsatellite instability (MSI), a vital mutator phenotype caused by DNA mismatch repair deficiency, is frequently observed in several tumors. MSI is recognized as a critical molecular biomarker for diagnosis, prognosis, and therapeutic selection in several cancers. Identifying MSI status for current gold standard methods based on experimental analysis is laborious, time-consuming, and costly. Although several computational methods based on machine learning have been proposed to identify MSI status, we need to further understand which machine learning model would favor identification for MSI and which feature subset is strongly related to MSI. On this basis, more effective machine learning-based methods can be developed to improve the performance of MSI status identification. In this work, we present MSINGB, an NGBoost-based method for identifying MSI status from tumor somatic mutation annotation data. MSINGB first evaluates the prediction performance of 11 popular machine learning algorithms and 9 deep learning models to identify MSI. Among 20 models, NGBoost, a novel natural gradient boosting method, achieves the overall best performance. MSINGB then introduces two feature selection strategies to find the compact feature subset, which is strongly related to MSI, and employs the SHAP approach to interpreting how selected features impact the model prediction. MSINGB achieves a better prediction performance on both the tenfold cross-validation test and independent test compared with state-of-the-art methods.
Collapse
|
22
|
Kaur A, Chauhan APS, Aggarwal AK. Prediction of Enhancers in DNA Sequence Data using a Hybrid CNN-DLSTM Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1327-1336. [PMID: 35417351 DOI: 10.1109/tcbb.2022.3167090] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Enhancer, a distal cis-regulatory element controls gene expression. Experimental prediction of enhancer elements is time-consuming and expensive. Consequently, various inexpensive deep learning-based fast methods have been developed for predicting the enhancers and determining their strength. In this paper, we have proposed a two-stage deep learning-based framework leveraging DNA structural features, natural language processing, convolutional neural network, and long short-term memory to predict the enhancer elements accurately in the genomics data. In the first stage, we extracted the features from DNA sequence data by using three feature representation techniques viz., k-mer based feature extraction along with word2vector based interpretation of underlined patterns, one-hot encoding, and the DNAshape technique. In the second stage, strength of enhancers is predicted from the extracted features using a hybrid deep learning model. The method is capable of adapting itself to varying sizes of datasets. Also, as proposed model can capture long-range sequencing patterns, the robustness of the method remains unaffected against minor variations in the genomics sequence. The method outperforms the other state-of-the-art methods at both stages in terms of performance metrics of prediction accuracy, specificity, Mathew's correlation coefficient, and area under the ROC curve. In summary, the proposed method is a reliable method for enhancer prediction.
Collapse
|
23
|
Ju Z, Wang SY. Prediction of lysine HMGylation sites using multiple feature extraction and fuzzy support vector machine. Anal Biochem 2023; 663:115032. [PMID: 36592921 DOI: 10.1016/j.ab.2022.115032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Accepted: 12/25/2022] [Indexed: 12/31/2022]
Abstract
Protein 3-hydroxyl-3-methylglutarylation (HMGylation) is newly discovered lysine acylation modification in mitochondrion. The accurate identification of HMGylation sites is the premise and key to further explore the molecular mechanisms of HMGylation. In this study, a novel bioinformatics tool named HMGPred is developed to predict HMGylation sites. Multiple effective features, including amino acid composition, amino acid factors, binary encoding, and the composition of k-spaced amino acid pairs, are integrated to encode HMGylation sites. And F-score feature ranking with incremental feature selection was used to eliminate redundant features. Moreover, a fuzzy support vector machine algorithm is used to effectively reduce the influence of noise problem by assigning different samples to different fuzzy membership degrees. As illustrated by 10-fold cross-validation, HMGPred achieves a satisfactory performance with an area under receiver operating characteristic curve of 0.9110. Feature analysis indicates that some k-spaced amino acid pair features, such as 'KxxxT' and 'DxxxE', play a critical role in the prediction of HMGylation sites. The results of prediction and analysis might be helpful for investigating the mechanisms of HMGylation. For the convenience of experimental researchers, HMGPred is implemented as a web server at http://123.206.31.171/HMGPred/.
Collapse
Affiliation(s)
- Zhe Ju
- College of Science, Shenyang Aerospace University, 110136, People's Republic of China.
| | - Shi-Yun Wang
- College of Science, Shenyang Aerospace University, 110136, People's Republic of China
| |
Collapse
|
24
|
A deep multiple kernel learning-based higher-order fuzzy inference system for identifying DNA N4-methylcytosine sites. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.01.149] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/12/2023]
|
25
|
Nabeel Asim M, Ali Ibrahim M, Fazeel A, Dengel A, Ahmed S. DNA-MP: a generalized DNA modifications predictor for multiple species based on powerful sequence encoding method. Brief Bioinform 2023; 24:6931721. [PMID: 36528802 DOI: 10.1093/bib/bbac546] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Revised: 11/06/2022] [Accepted: 11/12/2022] [Indexed: 12/23/2022] Open
Abstract
Accurate prediction of deoxyribonucleic acid (DNA) modifications is essential to explore and discern the process of cell differentiation, gene expression and epigenetic regulation. Several computational approaches have been proposed for particular type-specific DNA modification prediction. Two recent generalized computational predictors are capable of detecting three different types of DNA modifications; however, type-specific and generalized modifications predictors produce limited performance across multiple species mainly due to the use of ineffective sequence encoding methods. The paper in hand presents a generalized computational approach "DNA-MP" that is competent to more precisely predict three different DNA modifications across multiple species. Proposed DNA-MP approach makes use of a powerful encoding method "position specific nucleotides occurrence based 117 on modification and non-modification class densities normalized difference" (POCD-ND) to generate the statistical representations of DNA sequences and a deep forest classifier for modifications prediction. POCD-ND encoder generates statistical representations by extracting position specific distributional information of nucleotides in the DNA sequences. We perform a comprehensive intrinsic and extrinsic evaluation of the proposed encoder and compare its performance with 32 most widely used encoding methods on $17$ benchmark DNA modifications prediction datasets of $12$ different species using $10$ different machine learning classifiers. Overall, with all classifiers, the proposed POCD-ND encoder outperforms existing $32$ different encoders. Furthermore, combinedly over 5-fold cross validation benchmark datasets and independent test sets, proposed DNA-MP predictor outperforms state-of-the-art type-specific and generalized modifications predictors by an average accuracy of 7% across 4mc datasets, 1.35% across 5hmc datasets and 10% for 6ma datasets. To facilitate the scientific community, the DNA-MP web application is available at https://sds_genetic_analysis.opendfki.de/DNA_Modifications/.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Ahtisham Fazeel
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Andreas Dengel
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern 67663, Germany.,German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern 67663, Germany
| |
Collapse
|
26
|
Ding Y, He W, Tang J, Zou Q, Guo F. Laplacian Regularized Sparse Representation Based Classifier for Identifying DNA N4-Methylcytosine Sites via L 2,1/2-Matrix Norm. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:500-511. [PMID: 34882559 DOI: 10.1109/tcbb.2021.3133309] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
N4-methylcytosine (4mC) is one of important epigenetic modifications in DNA sequences. Detecting 4mC sites is time-consuming. The computational method based on machine learning has provided effective help for identifying 4mC. To further improve the performance of prediction, we propose a Laplacian Regularized Sparse Representation based Classifier with L2,1/2-matrix norm (LapRSRC). We also utilize kernel trick to derive the kernel LapRSRC for nonlinear modeling. Matrix factorization technology is employed to solve the sparse representation coefficients of all test samples in the training set. And an efficient iterative algorithm is proposed to solve the objective function. We implement our model on six benchmark datasets of 4mC and eight UCI datasets to evaluate performance. The results show that the performance of our method is better or comparable.
Collapse
|
27
|
Ao C, Jiao S, Wang Y, Yu L, Zou Q. Biological Sequence Classification: A Review on Data and General Methods. RESEARCH (WASHINGTON, D.C.) 2022; 2022:0011. [PMID: 39285948 PMCID: PMC11404319 DOI: 10.34133/research.0011] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Accepted: 10/25/2022] [Indexed: 09/19/2024]
Abstract
With the rapid development of biotechnology, the number of biological sequences has grown exponentially. The continuous expansion of biological sequence data promotes the application of machine learning in biological sequences to construct predictive models for mining biological sequence information. There are many branches of biological sequence classification research. In this review, we mainly focus on the function and modification classification of biological sequences based on machine learning. Sequence-based prediction and analysis are the basic tasks to understand the biological functions of DNA, RNA, proteins, and peptides. However, there are hundreds of classification models developed for biological sequences, and the quite varied specific methods seem dizzying at first glance. Here, we aim to establish a long-term support website (http://lab.malab.cn/~acy/BioseqData/home.html), which provides readers with detailed information on the classification method and download links to relevant datasets. We briefly introduce the steps to build an effective model framework for biological sequence data. In addition, a brief introduction to single-cell sequencing data analysis methods and applications in biology is also included. Finally, we discuss the current challenges and future perspectives of biological sequence classification research.
Collapse
Affiliation(s)
- Chunyan Ao
- School of Computer Science and Technology, Xidian University, Xi'an, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Yansu Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Liang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
28
|
Accurate Prediction of Anti-hypertensive Peptides Based on Convolutional Neural Network and Gated Recurrent unit. Interdiscip Sci 2022; 14:879-894. [PMID: 35474167 DOI: 10.1007/s12539-022-00521-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2021] [Revised: 03/30/2022] [Accepted: 04/06/2022] [Indexed: 12/30/2022]
Abstract
Hypertension (HT) is a general disease, and also one of the most ordinary and major causes of cardiovascular disease. Some diseases are caused by high blood pressure, including impairment of heart and kidney function, cerebral hemorrhage and myocardial infarction. Due to the limitations of laboratory methods, bioactive peptides for the treatment of HT need a long time to be identified. Therefore, it is of great immediate significance for the identification of anti-hypertensive peptides (AHTPs). With the prevalence of machine learning, it is suggested to use it as a supplementary method for AHTPs classification. Therefore, we develop a new model to identify AHTPs based on multiple features and deep learning. And the deep model is constructed by combining a convolutional neural network (CNN) and a gated recurrent unit (GRU). The unique convolution structure is used to reduce the feature dimension and running time. The data processed by CNN is input into the recurrent structure GRU, and important information is filtered out through the reset gate and update gate. Finally, the output layer adopts Sigmoid activation function. Firstly, we use Kmer, the deviation between the dipeptide frequency and the expected mean (DDE), encoding based on grouped weight (EBGW), enhanced grouped amino acid composition (EGAAC) and dipeptide binary profile and frequency (DBPF) to extract features. For Kmer, DDE, EBGW and EGAAC, it is widely used in the field of protein research. DBPF is a new feature representation method designed by us. It corresponds dipeptides to binary numbers, and finally obtains a binary coding file and a frequency file. Then these features are spliced together and input into our proposed model for prediction and analysis. After a tenfold cross-validation test, this model has a better competitive advantage than the previous methods, and the accuracy is 96.23% and 99.10%, respectively. From the results, compared with the previous methods, it has been greatly improved. It shows that the combination of convolution calculation and recurrent structure has a positive impact on the classification of AHTPs. The results show that this method is a feasible, efficient and competitive sequence analysis tool for AHTPs. Meanwhile, we design a friendly online prediction tool and it is freely accessible at http://ahtps.zhanglab.site/ .
Collapse
|
29
|
Wang H, Li H, Gao W, Xie J. PrUb-EL: A hybrid framework based on deep learning for identifying ubiquitination sites in Arabidopsis thaliana using ensemble learning strategy. Anal Biochem 2022; 658:114935. [PMID: 36206844 DOI: 10.1016/j.ab.2022.114935] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Revised: 09/25/2022] [Accepted: 09/26/2022] [Indexed: 12/30/2022]
Abstract
Identification of ubiquitination sites is central to many biological experiments. Ubiquitination is a kind of post-translational protein modification (PTM). It is a key mechanism for increasing protein diversity and plays a vital role in regulating cell function. In recent years, many models have been developed to predict ubiquitination sites in humans, mice and yeast. However, few studies have predicted ubiquitination sites in Arabidopsis thaliana. In view of this, a deep network model named PrUb-EL is proposed to predict ubiquitination sites in Arabidopsis thaliana. Firstly, six features based on the protein sequence are extracted with amino acid index database (AAindex), dipeptide deviates from the expected mean (DDE), dipeptide composition (DPC), blocks substitution matrix (BLOSUM62), enhanced amino acid composition (EAAC) and binary encoding. Secondly, the synthetic minority over-sampling technique (SMOTE) is utilized to process the imbalanced data set. Then a new classifier named DG is presented, which includes Dense block, Residual block and Gated recurrent unit (GRU) block. Finally, each of six feature extraction methods is integrated into the DG model, and the ensemble learning strategy is used to gain the final prediction result. Experimental results show that PrUb-EL has good predictive ability with the accuracy (ACC) and area under the ROC curve (auROC) values of 91.00% and 97.70% using 5-fold cross-validation, respectively. Note that the values of ACC and auROC are 88.58% and 96.09% in the independent test, respectively. Compared with previous studies, our model has significantly improved performance thus it is an excellent method for identifying ubiquitination sites in Arabidopsis thaliana. The datasets and code used for the article are available at https://github.com/Tom-Wangy/PreUb-EL.git.
Collapse
Affiliation(s)
- Houqiang Wang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Hong Li
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China.
| | - Weifeng Gao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Jin Xie
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
30
|
Zhou J, Wang X, Wei Z, Meng J, Huang D. 4acCPred: Weakly supervised prediction of N4-acetyldeoxycytosine DNA modification from sequences. MOLECULAR THERAPY - NUCLEIC ACIDS 2022; 30:337-345. [DOI: 10.1016/j.omtn.2022.10.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 10/12/2022] [Indexed: 11/06/2022]
|
31
|
Liu C, Song J, Ogata H, Akutsu T. MSNet-4mC: learning effective multi-scale representations for identifying DNA N4-methylcytosine sites. Bioinformatics 2022; 38:5160-5167. [PMID: 36205602 DOI: 10.1093/bioinformatics/btac671] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 09/09/2022] [Accepted: 10/05/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION N4-methylcytosine (4mC) is an essential kind of epigenetic modification that regulates a wide range of biological processes. However, experimental methods for detecting 4mC sites are time-consuming and labor-intensive. As an alternative, computational methods that are capable of automatically identifying 4mC with data analysis techniques become a reasonable option. A major challenge is how to develop effective methods to fully exploit the complex interactions within the DNA sequences to improve the predictive capability. RESULTS In this work, we propose MSNet-4mC, a lightweight neural network building upon convolutional operations with multi-scale receptive fields to perceive cross-element relationships over both short and long ranges of given DNA sequences. With strong imbalances in the number of candidates in different species in mind, we compute and apply class weights in the cross-entropy loss to balance the training process. Extensive benchmarking experiments show that our method achieves a significant performance improvement and outperforms other state-of-the-art methods. AVAILABILITY AND IMPLEMENTATION The source code and models are freely available for download at https://github.com/LIU-CT/MSNet-4mC, implemented in Python and supported on Linux and Windows. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chunting Liu
- Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Kyoto, Kyoto 606-8501, Japan.,Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Hiroyuki Ogata
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan
| | - Tatsuya Akutsu
- Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Kyoto, Kyoto 606-8501, Japan.,Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan
| |
Collapse
|
32
|
Yan W, Li Z, Pian C, Wu Y. PlantBind: an attention-based multi-label neural network for predicting plant transcription factor binding sites. Brief Bioinform 2022; 23:6713513. [PMID: 36155619 DOI: 10.1093/bib/bbac425] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Revised: 08/29/2022] [Accepted: 08/31/2022] [Indexed: 12/14/2022] Open
Abstract
Identification of transcription factor binding sites (TFBSs) is essential to understanding of gene regulation. Designing computational models for accurate prediction of TFBSs is crucial because it is not feasible to experimentally assay all transcription factors (TFs) in all sequenced eukaryotic genomes. Although many methods have been proposed for the identification of TFBSs in humans, methods designed for plants are comparatively underdeveloped. Here, we present PlantBind, a method for integrated prediction and interpretation of TFBSs based on DNA sequences and DNA shape profiles. Built on an attention-based multi-label deep learning framework, PlantBind not only simultaneously predicts the potential binding sites of 315 TFs, but also identifies the motifs bound by transcription factors. During the training process, this model revealed a strong similarity among TF family members with respect to target binding sequences. Trans-species prediction performance using four Zea mays TFs demonstrated the suitability of this model for transfer learning. Overall, this study provides an effective solution for identifying plant TFBSs, which will promote greater understanding of transcriptional regulatory mechanisms in plants.
Collapse
Affiliation(s)
| | - Zutan Li
- Nanjing Agricultur al University
| | - Cong Pian
- College of Sciences at Nanjing Agricultural University
| | - Yufeng Wu
- State Key Laboratory for Crop Genetics and Germplasm Enhancement, Bioinformatics Center, College of Agriculture, Academy for Advanced Interdisciplinary Studies at Nanjing Agricultural University
| |
Collapse
|
33
|
Chen M, Zhang X, Ju Y, Liu Q, Ding Y. iPseU-TWSVM: Identification of RNA pseudouridine sites based on TWSVM. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:13829-13850. [PMID: 36654069 DOI: 10.3934/mbe.2022644] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Biological sequence analysis is an important basic research work in the field of bioinformatics. With the explosive growth of data, machine learning methods play an increasingly important role in biological sequence analysis. By constructing a classifier for prediction, the input sequence feature vector is predicted and evaluated, and the knowledge of gene structure, function and evolution is obtained from a large amount of sequence information, which lays a foundation for researchers to carry out in-depth research. At present, many machine learning methods have been applied to biological sequence analysis such as RNA gene recognition and protein secondary structure prediction. As a biological sequence, RNA plays an important biological role in the encoding, decoding, regulation and expression of genes. The analysis of RNA data is currently carried out from the aspects of structure and function, including secondary structure prediction, non-coding RNA identification and functional site prediction. Pseudouridine (У) is the most widespread and rich RNA modification and has been discovered in a variety of RNAs. It is highly essential for the study of related functional mechanisms and disease diagnosis to accurately identify У sites in RNA sequences. At present, several computational approaches have been suggested as an alternative to experimental methods to detect У sites, but there is still potential for improvement in their performance. In this study, we present a model based on twin support vector machine (TWSVM) for У site identification. The model combines a variety of feature representation techniques and uses the max-relevance and min-redundancy methods to obtain the optimum feature subset for training. The independent testing accuracy is improved by 3.4% in comparison to current advanced У site predictors. The outcomes demonstrate that our model has better generalization performance and improves the accuracy of У site identification. iPseU-TWSVM can be a helpful tool to identify У sites.
Collapse
Affiliation(s)
- Mingshuai Chen
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Xin Zhang
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China
| | - Qing Liu
- Department of Anesthesiology, Hospital (T.C.M) Affiliated to Southwest Medical University, Luzhou, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| |
Collapse
|
34
|
Li X, Zhang S, Shi H. An improved residual network using deep fusion for identifying RNA 5-methylcytosine sites. Bioinformatics 2022; 38:4271-4277. [PMID: 35866985 DOI: 10.1093/bioinformatics/btac532] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 06/30/2022] [Accepted: 07/21/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION 5-Methylcytosine (m5C) is a crucial post-transcriptional modification. With the development of technology, it is widely found in various RNAs. Numerous studies have indicated that m5C plays an essential role in various activities of organisms, such as tRNA recognition, stabilization of RNA structure, RNA metabolism and so on. Traditional identification is costly and time-consuming by wet biological experiments. Therefore, computational models are commonly used to identify the m5C sites. Due to the vast computing advantages of deep learning, it is feasible to construct the predictive model through deep learning algorithms. RESULTS In this study, we construct a model to identify m5C based on a deep fusion approach with an improved residual network. First, sequence features are extracted from the RNA sequences using Kmer, K-tuple nucleotide frequency component (KNFC), Pseudo dinucleotide composition (PseDNC) and Physical and chemical property (PCP). Kmer and KNFC extract information from a statistical point of view. PseDNC and PCP extract information from the physicochemical properties of RNA sequences. Then, two parts of information are fused with new features using bidirectional long- and short-term memory and attention mechanisms, respectively. Immediately after, the fused features are fed into the improved residual network for classification. Finally, 10-fold cross-validation and independent set testing are used to verify the credibility of the model. The results show that the accuracy reaches 91.87%, 95.55%, 92.27% and 95.60% on the training sets and independent test sets of Arabidopsis thaliana and M.musculus, respectively. This is a considerable improvement compared to previous studies and demonstrates the robust performance of our model. AVAILABILITY AND IMPLEMENTATION The data and code related to the study are available at https://github.com/alivelxj/m5c-DFRESG.
Collapse
Affiliation(s)
- Xinjie Li
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| | - Hongyan Shi
- School of Mathematics and Statistics, Xidian University, Xi'an 710071, P. R. China
| |
Collapse
|
35
|
Identification of DNA-binding proteins via Multi-view LSSVM with independence criterion. Methods 2022; 207:29-37. [PMID: 36087888 DOI: 10.1016/j.ymeth.2022.08.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 08/06/2022] [Accepted: 08/25/2022] [Indexed: 11/24/2022] Open
Abstract
DNA-binding proteins actively participate in life activities such as DNA replication, recombination, gene expression and regulation and play a prominent role in these processes. As DNA-binding proteins continue to be discovered and increase, it is imperative to design an efficient and accurate identification tool. Considering the time-consuming and expensive traditional experimental technology and the insufficient number of samples in the biological computing method based on structural information, we proposed a machine learning algorithm based on sequence information to identify DNA binding proteins, named multi-view Least Squares Support Vector Machine via Hilbert-Schmidt Independence Criterion (multi-view LSSVM via HSIC). This method took 6 feature sets as multi-view input and trains a single view through the LSSVM algorithm. Then, we integrated HSIC into LSSVM as a regular term to reduce the dependence between views and explored the complementary information of multiple views. Subsequently, we trained and coordinated the submodels and finally combined the submodels in the form of weights to obtain the final prediction model. On training set PDB1075, the prediction results of our model were better than those of most existing methods. Independent tests are conducted on the datasets PDB186 and PDB2272. The accuracy of the prediction results was 85.5% and 79.36%, respectively. This result exceeded the current state-of-the-art methods, which showed that the multi-view LSSVM via HSIC can be used as an efficient predictor.
Collapse
|
36
|
PSP-PJMI: An innovative feature representation algorithm for identifying DNA N4-methylcytosine sites. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.05.060] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
37
|
Abbas Z, Tayara H, Chong KT. ZayyuNet - A Unified Deep Learning Model for the Identification of Epigenetic Modifications Using Raw Genomic Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2533-2544. [PMID: 34038365 DOI: 10.1109/tcbb.2021.3083789] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Epigenetic modifications have a vital role in gene expression and are linked to cellular processes such as differentiation, development, and tumorigenesis. Thus, the availability of reliable and accurate methods for identifying and defining these changes facilitates greater insights into the regulatory mechanisms that rely on epigenetic modifications. The current experimental methods provide a genome-wide identification of epigenetic modifications; however, they are expensive and time-consuming. To date, several machine learning methods have been proposed for identifying modifications such as DNA N6-Methyladenine (6mA), RNA N6-Methyladenosine (m6A), DNA N4-methylcytosine (4mC), and RNA pseudouridine ( Ψ). However, these methods are task-specific computational tools and require different encoding representations of DNA/RNA sequences. In this study, we propose a unified deep learning model, called ZayyuNet, for the identification of various epigenetic modifications. The proposed model is based on an architecture called, SpinalNet, inspired by the human somatosensory system that can efficiently receive large inputs and achieve better performance. The proposed model has been evaluated on various epigenetic modifications such as 6mA, m6A, 4mC, and Ψ and the results achieved outperform current state-of-the-art models. A user-friendly web server has been built and made freely available at http://nsclbio.jbnu.ac.kr/tools/ZayyuNet/.
Collapse
|
38
|
Liang Y, Wu Y, Zhang Z, Liu N, Peng J, Tang J. Hyb4mC: a hybrid DNA2vec-based model for DNA N4-methylcytosine sites prediction. BMC Bioinformatics 2022; 23:258. [PMID: 35768759 PMCID: PMC9241225 DOI: 10.1186/s12859-022-04789-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Accepted: 06/10/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND DNA N4-methylcytosine is part of the restrictive modification system, which works by regulating some biological processes, for example, the initiation of DNA replication, mismatch repair and inactivation of transposon. However, using experimental methods to detect 4mC sites is time-consuming and expensive. Besides, considering the huge differences in the number of 4mC samples among different species, it is challenging to achieve a robust multi-species 4mC site prediction performance. Hence, it is of great significance to develop effective computational tools to identify 4mC sites. RESULTS This work proposes a flexible deep learning-based framework to predict 4mC sites, called Hyb4mC. Hyb4mC adopts the DNA2vec method for sequence embedding, which captures more efficient and comprehensive information compared with the sequence-based feature method. Then, two different subnets are used for further analysis: Hyb_Caps and Hyb_Conv. Hyb_Caps is composed of a capsule neural network and can generalize from fewer samples. Hyb_Conv combines the attention mechanism with a text convolutional neural network for further feature learning. CONCLUSIONS Extensive benchmark tests have shown that Hyb4mC can significantly enhance the performance of predicting 4mC sites compared with the recently proposed methods.
Collapse
Affiliation(s)
- Ying Liang
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China.
| | - Yanan Wu
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China
| | - Zequn Zhang
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China
| | - Niannian Liu
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China
| | - Jun Peng
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China
| | - Jianjun Tang
- College of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang, China
| |
Collapse
|
39
|
EMDLP: Ensemble multiscale deep learning model for RNA methylation site prediction. BMC Bioinformatics 2022; 23:221. [PMID: 35676633 PMCID: PMC9178860 DOI: 10.1186/s12859-022-04756-1] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2022] [Accepted: 05/27/2022] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND Recent research recommends that epi-transcriptome regulation through post-transcriptional RNA modifications is essential for all sorts of RNA. Exact identification of RNA modification is vital for understanding their purposes and regulatory mechanisms. However, traditional experimental methods of identifying RNA modification sites are relatively complicated, time-consuming, and laborious. Machine learning approaches have been applied in the procedures of RNA sequence features extraction and classification in a computational way, which may supplement experimental approaches more efficiently. Recently, convolutional neural network (CNN) and long short-term memory (LSTM) have been demonstrated achievements in modification site prediction on account of their powerful functions in representation learning. However, CNN can learn the local response from the spatial data but cannot learn sequential correlations. And LSTM is specialized for sequential modeling and can access both the contextual representation but lacks spatial data extraction compared with CNN. There is strong motivation to construct a prediction framework using natural language processing (NLP), deep learning (DL) for these reasons. RESULTS This study presents an ensemble multiscale deep learning predictor (EMDLP) to identify RNA methylation sites in an NLP and DL way. It organically combines the dilated convolution and Bidirectional LSTM (BiLSTM), which helps to take better advantage of the local and global information for site prediction. The first step of EMDLP is to represent the RNA sequences in an NLP way. Thus, three encodings, e.g., RNA word embedding, One-hot encoding, and RGloVe, which is an improved learning method of word vector representation based on GloVe, are adopted to decipher sites from the viewpoints of the local and global information. Then, a dilated convolutional Bidirectional LSTM network (DCB) model is constructed with the dilated convolutional neural network (DCNN) followed by BiLSTM to extract potential contributing features for methylation site prediction. Finally, these three encoding methods are integrated by a soft vote to obtain better predictive performance. Experiment results on m1A and m6A reveal that the area under the receiver operating characteristic(AUROC) of EMDLP obtains respectively 95.56%, 85.24%, and outperforms the state-of-the-art models. To maximize user convenience, a user-friendly webserver for EMDLP was publicly available at http://www.labiip.net/EMDLP/index.php ( http://47.104.130.81/EMDLP/index.php ). CONCLUSIONS We developed a predictor for m1A and m6A methylation sites.
Collapse
|
40
|
Li Z, Pan X, Cai YD. Identification of Type 2 Diabetes Biomarkers From Mixed Single-Cell Sequencing Data With Feature Selection Methods. Front Bioeng Biotechnol 2022; 10:890901. [PMID: 35721855 PMCID: PMC9201257 DOI: 10.3389/fbioe.2022.890901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Accepted: 04/04/2022] [Indexed: 11/18/2022] Open
Abstract
Diabetes is the most common disease and a major threat to human health. Type 2 diabetes (T2D) makes up about 90% of all cases. With the development of high-throughput sequencing technologies, more and more fundamental pathogenesis of T2D at genetic and transcriptomic levels has been revealed. The recent single-cell sequencing can further reveal the cellular heterogenicity of complex diseases in an unprecedented way. With the expectation on the molecular essence of T2D across multiple cell types, we investigated the expression profiling of more than 1,600 single cells (949 cells from T2D patients and 651 cells from normal controls) and identified the differential expression profiling and characteristics at the transcriptomics level that can distinguish such two groups of cells at the single-cell level. The expression profile was analyzed by several machine learning algorithms, including Monte Carlo feature selection, support vector machine, and repeated incremental pruning to produce error reduction (RIPPER). On one hand, some T2D-associated genes (MTND4P24, MTND2P28, and LOC100128906) were discovered. On the other hand, we revealed novel potential pathogenic mechanisms in a rule manner. They are induced by newly recognized genes and neglected by traditional bulk sequencing techniques. Particularly, the newly identified T2D genes were shown to follow specific quantitative rules with diabetes prediction potentials, and such rules further indicated several potential functional crosstalks involved in T2D.
Collapse
Affiliation(s)
- Zhandong Li
- College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun, China
| | - Xiaoyong Pan
- Key Laboratory of System Control and Information Processing, Institute of Image Processing and Pattern Recognition, Ministry of Education of China, Shanghai Jiao Tong University, Shanghai, China
| | - Yu-Dong Cai
- School of Life Sciences, Shanghai University, Shanghai, China
- *Correspondence: Yu-Dong Cai,
| |
Collapse
|
41
|
|
42
|
Wang Y, Fu Y, Luo X. Identification of Pathogenetic Brain Regions via Neuroimaging Data for Diagnosis of Autism Spectrum Disorders. Front Neurosci 2022; 16:900330. [PMID: 35655751 PMCID: PMC9152096 DOI: 10.3389/fnins.2022.900330] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 04/11/2022] [Indexed: 11/13/2022] Open
Abstract
Autism spectrum disorder (ASD) is a kind of neurodevelopmental disorder that often occurs in children and has a hidden onset. Patients usually have lagged development of communication ability and social behavior and thus suffer an unhealthy physical and mental state. Evidence has indicated that diseases related to ASD have commonalities in brain imaging characteristics. This study aims to study the pathogenesis of ASD based on brain imaging data to locate the ASD-related brain regions. Specifically, we collected the functional magnetic resonance image data of 479 patients with ASD and 478 normal subjects matched in age and gender and used a machine-learning framework named random support vector machine cluster to extract distinctive brain regions from the preprocessed data. According to the experimental results, compared with other existing approaches, the method used in this study can more accurately distinguish patients from normal individuals based on brain imaging data. At the same time, this study found that the development of ASD was highly correlated with certain brain regions, e.g., lingual gyrus, superior frontal gyrus, medial gyrus, insular lobe, and olfactory cortex. This study explores the effectiveness of a novel machine-learning approach in the study of ASD brain imaging and provides a reference brain area for the medical research and clinical treatment of ASD.
Collapse
Affiliation(s)
- Yu Wang
- Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, China
- College of Information Science and Engineering, Hunan Normal University, Changsha, China
- Hunan Xiangjiang Artificial Intelligence Academy, Changsha, China
| | - Yu Fu
- Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, China
- College of Information Science and Engineering, Hunan Normal University, Changsha, China
- Hunan Xiangjiang Artificial Intelligence Academy, Changsha, China
- *Correspondence: Yu Fu
| | - Xun Luo
- Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, China
- College of Information Science and Engineering, Hunan Normal University, Changsha, China
- Hunan Xiangjiang Artificial Intelligence Academy, Changsha, China
| |
Collapse
|
43
|
Zhanga S, Yao Y, Wang J, Liang Y. Identification of DNA N4-methylcytosine sites based on multi-source features and gradient boosting decision tree. Anal Biochem 2022; 652:114746. [DOI: 10.1016/j.ab.2022.114746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 05/13/2022] [Accepted: 05/18/2022] [Indexed: 11/16/2022]
|
44
|
Yu B, Zhang Y, Wang X, Gao H, Sun J, Gao X. Identification of DNA modification sites based on elastic net and bidirectional gated recurrent unit with convolutional neural network. Biomed Signal Process Control 2022. [DOI: 10.1016/j.bspc.2022.103566] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
45
|
Yu L, Zhang Y, Xue L, Liu F, Chen Q, Luo J, Jing R. Systematic Analysis and Accurate Identification of DNA N4-Methylcytosine Sites by Deep Learning. Front Microbiol 2022; 13:843425. [PMID: 35401453 PMCID: PMC8989013 DOI: 10.3389/fmicb.2022.843425] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2021] [Accepted: 02/21/2022] [Indexed: 11/13/2022] Open
Abstract
DNA N4-methylcytosine (4mC) is a pivotal epigenetic modification that plays an essential role in DNA replication, repair, expression and differentiation. To gain insight into the biological functions of 4mC, it is critical to identify their modification sites in the genomics. Recently, deep learning has become increasingly popular in recent years and frequently employed for the 4mC site identification. However, a systematic analysis of how to build predictive models using deep learning techniques is still lacking. In this work, we first summarized all existing deep learning-based predictors and systematically analyzed their models, features and datasets, etc. Then, using a typical standard dataset with three species (A. thaliana, C. elegans, and D. melanogaster), we assessed the contribution of different model architectures, encoding methods and the attention mechanism in establishing a deep learning-based model for the 4mC site prediction. After a series of optimizations, convolutional-recurrent neural network architecture using the one-hot encoding and attention mechanism achieved the best overall prediction performance. Extensive comparison experiments were conducted based on the same dataset. This work will be helpful for researchers who would like to build the 4mC prediction models using deep learning in the future.
Collapse
Affiliation(s)
- Lezheng Yu
- School of Chemistry and Materials Science, Guizhou Education University, Guiyang, China
| | - Yonglin Zhang
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China
| | - Li Xue
- School of Public Health, Southwest Medical University, Luzhou, China
| | - Fengjuan Liu
- School of Geography and Resources, Guizhou Education University, Guiyang, China
| | - Qi Chen
- Department of Endocrinology and Metabolism, The Affiliated Hospital of Southwest Medical University, Luzhou, China
| | - Jiesi Luo
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, China.,Department of Pharmacy, The Affiliated Hospital of Southwest Medical University, Luzhou, China
| | - Runyu Jing
- School of Cyber Science and Engineering, Sichuan University, Chengdu, China
| |
Collapse
|
46
|
Jiao S, Chen Z, Zhang L, Zhou X, Shi L. ATGPred-FL: sequence-based prediction of autophagy proteins with feature representation learning. Amino Acids 2022; 54:799-809. [PMID: 35286461 DOI: 10.1007/s00726-022-03145-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2021] [Accepted: 01/28/2022] [Indexed: 11/26/2022]
Abstract
Autophagy plays an important role in biological evolution and is regulated by many autophagy proteins. Accurate identification of autophagy proteins is crucially important to reveal their biological functions. Due to the expense and labor cost of experimental methods, it is urgent to develop automated, accurate and reliable sequence-based computational tools to enable the identification of novel autophagy proteins among numerous proteins and peptides. For this purpose, a new predictor named ATGPred-FL was proposed for the efficient identification of autophagy proteins. We investigated various sequence-based feature descriptors and adopted the feature learning method to generate corresponding, more informative probability features. Then, a two-step feature selection strategy based on accuracy was utilized to remove irrelevant and redundant features, leading to the most discriminative 14-dimensional feature set. The final predictor was built using a support vector machine classifier, which performed favorably on both the training and testing sets with accuracy values of 94.40% and 90.50%, respectively. ATGPred-FL is the first ATG machine learning predictor based on protein primary sequences. We envision that ATGPred-FL will be an effective and useful tool for autophagy protein identification, and it is available for free at http://lab.malab.cn/~acy/ATGPred-FL , the source code and datasets are accessible at https://github.com/jiaoshihu/ATGPred .
Collapse
Affiliation(s)
- Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Zheng Chen
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, 7098 Liuxian Street, Shenzhen, 518055, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No.4 Block 2 North Jianshe Road, Chengdu, 61005, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, 518172, China
| | - Xun Zhou
- Beidahuang Industry Group General Hospital, Harbin, 150001, China.
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, No 415, Fengyang Road, Huangpu District, Shanghai, 210000, China.
| |
Collapse
|
47
|
iEnhancer-Deep: A Computational Predictor for Enhancer Sites and Their Strength Using Deep Learning. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12042120] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Enhancers are short motifs that contain high position variability and free scattering. Identifying these non-coding DNA fragments and their strength is vital because they play an important role in the control of gene regulation. Enhancer identification is more complicated than other genetic factors due to free scattering and their very high amount of locational variation. To classify this biological difficulty, several computational tools in bioinformatics have been created over the last few years as current learning models are still lacking. To overcome these limitations, we introduce iEnhancer-Deep, a deep learning-based framework that uses One-Hot Encoding and a convolutional neural network for model construction, primarily for the identification of enhancers and secondarily for the classification of their strength. Parallels between the iEnhancer-Deep and existing state-of-the-art methodologies were drawn to evaluate the performance of the proposed model. Furthermore, a cross-species test was carried out to assess the generalizability of the proposed model. In general, the results show that the proposed model produced comparable results with the state-of-the-art models.
Collapse
|
48
|
Cai J, Xiao G, Su R. GC6mA-Pred: A deep learning approach to identify DNA N6-methyladenine sites in the rice genome. Methods 2022; 204:14-21. [PMID: 35149214 DOI: 10.1016/j.ymeth.2022.02.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Revised: 01/31/2022] [Accepted: 02/05/2022] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION DNA N6-methyladenine (6mA) is a pivotal DNA modification for various biological processes. More accurate prediction of 6mA methylation sites plays an irreplaceable part in grasping the internal rationale of related biological activities. However, the existing prediction methods only extract information from a single dimension, which has some limitations. Therefore, it is very necessary to obtain the information of 6mA sites from different dimensions, so as to establish a reliable prediction method. RESULTS In this study, a neural network based bioinformatics model named GC6mA-Pred is proposed to predict N6-methyladenine modifications in DNA sequences. GC6mA-Pred extracts significant information from both sequence level and graph level. In the sequence level, GC6mA-Pred uses a three-layer convolution neural network (CNN) model to represent the sequence. In the graph level, GC6mA-Pred employs graph neural network (GNN) method to integrate various information contained in the chemical molecular formula corresponding to DNA sequence. In our newly built dataset, GC6mA-Pred shows better performance than other existing models. The results of comparative experiments have illustrated that GC6mA-Pred is capable of producing a marked effect in accurately identifying DNA 6mA modifications.
Collapse
Affiliation(s)
- Jianhua Cai
- Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, College of Computer and Control Engineering, Minjiang University, Fuzhou, China; College of Mathematics and Computer Science, Fuzhou University, Fuzhou, PR China
| | - Guobao Xiao
- Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, College of Computer and Control Engineering, Minjiang University, Fuzhou, China.
| | - Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin, China.
| |
Collapse
|
49
|
Lin C, Wang L, Shi L. AAPred-CNN: accurate predictor based on deep convolution neural network for identification of anti-angiogenic peptides. Methods 2022; 204:442-448. [PMID: 35031486 DOI: 10.1016/j.ymeth.2022.01.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 12/28/2021] [Accepted: 01/09/2022] [Indexed: 12/13/2022] Open
Abstract
Recently, deep learning techniques have been developed for various bioactive peptide prediction tasks. However, there are only conventional machine learning-based methods for the prediction of anti-angiogenic peptides (AAP), which play an important role in cancer treatment. The main reason why no deep learning method has been involved in this field is that there are too few experimentally validated AAPs to support the training of deep models but researchers have believed that deep learning seriously depends on the amounts of labeled data. In this paper, as a tentative work, we try to predict AAP by constructing different classical deep learning models and propose the first deep convolution neural network-based predictor (AAPred-CNN) for AAP. Contrary to intuition, the experimental results show that deep learning models can achieve superior or comparable performance to the state-of-the-art model, although they are given a few labeled sequences to train. We also decipher the influence of hyper-parameters and training samples on the performance of deep learning models to help understand how the model work. Furthermore, we also visualize the learned embeddings by dimension reduction to increase the model interpretability and reveal the residue propensity of AAP through the statistics of convolutional features for different residues. In summary, this work demonstrates the powerful representation ability of AAPred-CNNfor AAP prediction, further improving the prediction accuracy of AAP.
Collapse
Affiliation(s)
- Changhang Lin
- School of Big Data and Artificial Intelligence, Fujian Polytechnic Normal University, Fuzhou, China
| | - Lei Wang
- Beidahuang Industry Group General Hospital, Harbin, China.
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Shanghai, China.
| |
Collapse
|
50
|
Mouse4mC-BGRU: deep learning for predicting DNA N4-methylcytosine sites in mouse genome. Methods 2022; 204:258-262. [DOI: 10.1016/j.ymeth.2022.01.009] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Revised: 01/14/2022] [Accepted: 01/24/2022] [Indexed: 12/12/2022] Open
|