1
|
Du K, Xia Y, Wu Q, Yin M, Zhao H, Chen XW. Analysis of whole transcriptome reveals the immune response to porcine reproductive and respiratory syndrome virus infection and tylvalosin tartrate treatment in the porcine alveolar macrophages. Front Immunol 2025; 15:1506371. [PMID: 39872536 PMCID: PMC11769836 DOI: 10.3389/fimmu.2024.1506371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2024] [Accepted: 12/23/2024] [Indexed: 01/30/2025] Open
Abstract
Introduction Porcine reproductive and respiratory syndrome virus (PRRSV) is a major pathogen that has caused severe economic losses in the swine industry. Screening key host immune-related genetic factors in the porcine alveolar macrophages (PAMs) is critical to improve the anti-virial ability in pigs. Methods In this study, an in vivo model was set to evaluate the anti-PRRSV effect of tylvalosin tartrates. Then, strand-specific RNA-sequencing (ssRNA-seq) and miRNA-sequencing (miRNA-seq) were carried out to profile the whole transcriptome of PAMs in the negative control, PRRSV-infected, and tylvalosin tartrates-treatment group. Results The ssRNA-seq identified 11740 long non-coding RNAs in PAMs. Based on our attention mechanism-improved graph convolutional network, 41.07% and 28.59% lncRNAs were predicted to be located in the nucleus and cytoplasm, respectively. The miRNA-seq revealed that tylvalosin tartrates-enhanced miRNAs might play roles in regulating angiogenesis and innate immune-related functions, and it rescued the expression of three anti-inflammation miRNAs (ssc-miR-30a-5p, ssc-miR-218-5p, and ssc-miR-218) that were downregulated due to PRRSV infection. The cytoplasmic lncRNAs enhanced by tylvalosin tartrates might form ceRNA networks with miRNAs to regulate PAM chemotaxis. While cytoplasmic lncRNAs that were rescued by tylvalosin tartrates might protect PAMs via efferocytosis-related ceRNA networks. On the other hand, the tylvalosin tartrates-rescued nuclear lncRNAs might negatively regulate T cell apoptosis and bind to key anti-inflammation factor IL37 to protect the lungs by cis- and trans-regulation. Conclusions Our data provides a catalog of key non-coding RNAs in response to PRRSV and tylvalosin tartrates and might enrich the genetic basis for future PRRSV prevention and control.
Collapse
Affiliation(s)
| | | | | | | | | | - Xi-wen Chen
- Animal Disease Prevention and Control and Healthy Breeding Engineering Technology Research Centre, Mianyang Normal University, Mianyang, China
| |
Collapse
|
2
|
Yang Z, Shao W, Matsuda Y, Song L. iResNetDM: An interpretable deep learning approach for four types of DNA methylation modification prediction. Comput Struct Biotechnol J 2024; 23:4214-4221. [PMID: 39650332 PMCID: PMC11621598 DOI: 10.1016/j.csbj.2024.11.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Revised: 10/11/2024] [Accepted: 11/02/2024] [Indexed: 12/11/2024] Open
Abstract
Motivation Although several computational methods for predicting DNA methylation modifications have been developed, two main limitations persist: 1) All of the models are currently confined to binary predictors, which merely determine the presence or absence of DNA methylation modifications and thus prevent comprehensive analyses of the interrelations among varied modification types. Multi-class classification models for RNA modifications have been developed, and a comparable approach for DNA is essential. 2) Few previous studies offer adequate explanations of how models make decisions, instead relying on the extraction and visualization of attention matrices, which have identified few motifs and do not provide sufficient insights into the model decision-making process. Result In this study, we introduce the task of DNA methylation modification prediction as a multi-class classification problem for the first time. We present iResNetDM, a deep learning model that integrates Residual Networks (ResNet) with self-attention mechanisms. To the best of our knowledge, iResNetDM is the first model capable of distinguishing between four types of DNA methylation modifications. Our model not only demonstrates good performance across various DNA methylation modifications but can also capture relationships between different types of modifications. We used the integrated gradients technique to enhance the interpretability of the iResNetDM. This method can effectively elucidate the model's decision-making process, thus enabling the successful identification of multiple motifs. Notably, our model displays remarkable robustness, and can effectively identify unique motifs across different methylation modifications. We also compared the motifs discovered in various modifications and found that some had notable sequence similarities, suggesting that they may be subject to different types of modifications. This finding highlights the potential importance of these motifs in gene regulation.
Collapse
Affiliation(s)
- Zerui Yang
- Department of Chemistry, City University of Hong Kong, Hong Kong
- City University of Hong Kong Shenzhen Research Institute
| | - Wei Shao
- Department of Computer Science, City University of Hong Kong, Hong Kong
| | - Yudai Matsuda
- Department of Chemistry, City University of Hong Kong, Hong Kong
| | - Linqi Song
- City University of Hong Kong Shenzhen Research Institute
- Department of Computer Science, City University of Hong Kong, Hong Kong
| |
Collapse
|
3
|
Xia Y, Zhang Y, Liu D, Zhu YH, Wang Z, Song J, Yu DJ. BLAM6A-Merge: Leveraging Attention Mechanisms and Feature Fusion Strategies to Improve the Identification of RNA N6-Methyladenosine Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1803-1815. [PMID: 38913512 DOI: 10.1109/tcbb.2024.3418490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]
Abstract
RNA N6-methyladenosine is a prevalent and abundant type of RNA modification that exerts significant influence on diverse biological processes. To date, numerous computational approaches have been developed for predicting methylation, with most of them ignoring the correlations of different encoding strategies and failing to explore the adaptability of various attention mechanisms for methylation identification. To solve the above issues, we proposed an innovative framework for predicting RNA m6A modification site, termed BLAM6A-Merge. Specifically, it utilized a multimodal feature fusion strategy to combine the classification results of four features and Blastn tool. Apart from this, different attention mechanisms were employed for extracting higher-level features on specific features after the screening process. Extensive experiments on 12 benchmarking datasets demonstrated that BLAM6A-Merge achieved superior performance (average AUC: 0.849 for the full transcript mode and 0.784 for the mature mRNA mode). Notably, the Blastn tool was employed for the first time in the identification of methylation sites.
Collapse
|
4
|
Fan X, Lin B, Hu J, Guo Z. Ense-i6mA: Identification of DNA N 6-Methyladenine Sites Using XGB-RFE Feature Selection and Ensemble Machine Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1842-1854. [PMID: 38949938 DOI: 10.1109/tcbb.2024.3421228] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/03/2024]
Abstract
DNA N6-methyladenine (6mA) is an important epigenetic modification that plays a vital role in various cellular processes. Accurate identification of the 6mA sites is fundamental to elucidate the biological functions and mechanisms of modification. However, experimental methods for detecting 6mA sites are high-priced and time-consuming. In this study, we propose a novel computational method, called Ense-i6mA, to predict 6mA sites. Firstly, five encoding schemes, i.e., one-hot encoding, gcContent, Z-Curve, K-mer nucleotide frequency, and K-mer nucleotide frequency with gap, are employed to extract DNA sequence features. Secondly, eXtreme gradient boosting coupled with recursive feature elimination is applied to remove noisy features for avoiding over-fitting, reducing computing time and complexity. Then, the best subset of features is fed into base-classifiers composed of Extra Trees, eXtreme Gradient Boosting, Light Gradient Boosting Machine, and Support Vector Machine. Finally, to minimize generalization errors, the prediction probabilities of the base-classifiers are aggregated by averaging for inferring the final 6mA sites results. We conduct experiments on two species, i.e., Arabidopsis thaliana and Drosophila melanogaster, to compare the performance of Ense-i6mA against the recent 6mA sites prediction methods. The experimental results demonstrate that the proposed Ense-i6mA achieves area under the receiver operating characteristic curve values of 0.967 and 0.968, accuracies of 91.4% and 92.0%, and Mathew's correlation coefficient values of 0.829 and 0.842 on two benchmark datasets, respectively, and outperforms several existing state-of-the-art methods.
Collapse
|
5
|
Yu X, Yani C, Wang Z, Long H, Zeng R, Liu X, Anas B, Ren J. iDNA-ITLM: An interpretable and transferable learning model for identifying DNA methylation. PLoS One 2024; 19:e0301791. [PMID: 39480834 PMCID: PMC11527195 DOI: 10.1371/journal.pone.0301791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 03/20/2024] [Indexed: 11/02/2024] Open
Abstract
In this study, from the perspective of image processing, we propose the iDNA-ITLM model, using a novel data enhance strategy by continuously self-replicating a short DNA sequence into a longer DNA sequence and then embedding it into a high-dimensional matrix to enlarge the receptive field, for identifying DNA methylation sites. Our model consistently outperforms the current state-of-the-art sequence-based DNA methylation site recognition methods when evaluated on 17 benchmark datasets that cover multiple species and include three DNA methylation modifications (4mC, 5hmC, and 6mA). The experimental results demonstrate the robustness and superior performance of our model across these datasets. In addition, our model can transfer learning to RNA methylation sequences and produce good results without modifying the hyperparameters in the model. The proposed iDNA-ITLM model can be considered a universal predictor across DNA and RNA methylation species.
Collapse
Affiliation(s)
- Xia Yu
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Cui Yani
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| | - Zhichao Wang
- Unit 32033, The People’s Liberation Army, Beijing, China
| | - Haixia Long
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Rao Zeng
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Xiling Liu
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Bilal Anas
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Jia Ren
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| |
Collapse
|
6
|
Zhang M, Wang X, Xu S, Ge F, Paixao IC, Song J, Yu DJ. MetalTrans: A Biological Language Model-Based Approach for Predicting Disease-Associated Mutations in Protein Metal-Binding Sites. J Chem Inf Model 2024; 64:6216-6229. [PMID: 39092854 DOI: 10.1021/acs.jcim.4c00739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/04/2024]
Abstract
The critical importance of accurately predicting mutations in protein metal-binding sites for advancing drug discovery and enhancing disease diagnostic processes cannot be overstated. In response to this imperative, MetalTrans emerges as an accurate predictor for disease-associated mutations in protein metal-binding sites. The core innovation of MetalTrans lies in its seamless integration of multifeature splicing with the Transformer framework, a strategy that ensures exhaustive feature extraction. Central to MetalTrans's effectiveness is its deep feature combination strategy, which merges evolutionary-scale modeling amino acid embeddings with ProtTrans embeddings, thus shedding light on the biochemical properties of proteins. Employing the Transformer component, MetalTrans leverages the self-attention mechanism to delve into higher-level representations. Utilizing mutation site information for feature fusion not only enriches the feature set but also sidesteps the common pitfall of overestimation linked to protein sequence-based predictions. This nuanced approach to feature fusion is a key differentiator, enabling MetalTrans to outperform existing methods significantly, as evidenced by comparative analyses. Our evaluations across varied metal binding site data sets (specifically Zn, Ca, Mg, and Mix) underscore MetalTrans's superior performance, which achieved the average AUC values of 0.971, 0.965, 0.980, and 0.945 on multiple 5-fold cross-validation, respectively. Remarkably, against the multichannel convolutional neural network method on a benchmark independent test set, MetalTrans demonstrated unparalleled robustness and superiority, boasting the AUC score of 0.998 on multiple 5-fold cross-validation. Our comprehensive examination of the predicted outcomes further confirms the effectiveness of the model. The source codes, data sets, and prediction results for MetalTrans can be accessed for academic usage at https://github.com/EduardWang/MetalTrans.
Collapse
Affiliation(s)
- Ming Zhang
- School of Computer, Jiangsu University of Science and Technology, 666 Changhui Road, Zhenjiang 212100, China
| | - Xiaohua Wang
- School of Computer, Jiangsu University of Science and Technology, 666 Changhui Road, Zhenjiang 212100, China
| | - Shanruo Xu
- Duke Kunshan University, Duke Avenue, Kunshan, Jiangsu 215316, China
| | - Fang Ge
- State Key Laboratory of Organic Electronics and Information Displays & Institute of Advanced Materials (IAM), Nanjing University of Posts & Telecommunications, 9 Wenyuan Road, Nanjing 210023, China
| | - Ian Costa Paixao
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, Victoria 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing 210094, China
| |
Collapse
|
7
|
Wang X, Li F, Zhang Y, Imoto S, Shen HH, Li S, Guo Y, Yang J, Song J. Deep learning approaches for non-coding genetic variant effect prediction: current progress and future prospects. Brief Bioinform 2024; 25:bbae446. [PMID: 39276327 PMCID: PMC11401448 DOI: 10.1093/bib/bbae446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 08/08/2024] [Accepted: 08/27/2024] [Indexed: 09/16/2024] Open
Abstract
Recent advancements in high-throughput sequencing technologies have significantly enhanced our ability to unravel the intricacies of gene regulatory processes. A critical challenge in this endeavor is the identification of variant effects, a key factor in comprehending the mechanisms underlying gene regulation. Non-coding variants, constituting over 90% of all variants, have garnered increasing attention in recent years. The exploration of gene variant impacts and regulatory mechanisms has spurred the development of various deep learning approaches, providing new insights into the global regulatory landscape through the analysis of extensive genetic data. Here, we provide a comprehensive overview of the development of the non-coding variants models based on bulk and single-cell sequencing data and their model-based interpretation and downstream tasks. This review delineates the popular sequencing technologies for epigenetic profiling and deep learning approaches for discerning the effects of non-coding variants. Additionally, we summarize the limitations of current approaches in variant effect prediction research and outline opportunities for improvement. We anticipate that our study will offer a practical and useful guide for the bioinformatic community to further advance the unraveling of genetic variant effects.
Collapse
Affiliation(s)
- Xiaoyu Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Fuyi Li
- South Australian immunoGENomics Cancer Institute (SAiGENCI), Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, SA 5005, Australia
| | - Yiwen Zhang
- School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Seiya Imoto
- Genome Center, Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo 108-8639, Japan
- Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Hsin-Hui Shen
- Department of Materials Science and Engineering, Faculty of Engineering, Monash University, Clayton, VIC 3800, Australia
| | - Shanshan Li
- School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Yuming Guo
- School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Jian Yang
- School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310030, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang 310024, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
8
|
Yu X, Ren J, Long H, Zeng R, Zhang G, Bilal A, Cui Y. iDNA-OpenPrompt: OpenPrompt learning model for identifying DNA methylation. Front Genet 2024; 15:1377285. [PMID: 38689652 PMCID: PMC11058834 DOI: 10.3389/fgene.2024.1377285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Accepted: 03/07/2024] [Indexed: 05/02/2024] Open
Abstract
Introduction: DNA methylation is a critical epigenetic modification involving the addition of a methyl group to the DNA molecule, playing a key role in regulating gene expression without changing the DNA sequence. The main difficulty in identifying DNA methylation sites lies in the subtle and complex nature of methylation patterns, which may vary across different tissues, developmental stages, and environmental conditions. Traditional methods for methylation site identification, such as bisulfite sequencing, are typically labor-intensive, costly, and require large amounts of DNA, hindering high-throughput analysis. Moreover, these methods may not always provide the resolution needed to detect methylation at specific sites, especially in genomic regions that are rich in repetitive sequences or have low levels of methylation. Furthermore, current deep learning approaches generally lack sufficient accuracy. Methods: This study introduces the iDNA-OpenPrompt model, leveraging the novel OpenPrompt learning framework. The model combines a prompt template, prompt verbalizer, and Pre-trained Language Model (PLM) to construct the prompt-learning framework for DNA methylation sequences. Moreover, a DNA vocabulary library, BERT tokenizer, and specific label words are also introduced into the model to enable accurate identification of DNA methylation sites. Results and Discussion: An extensive analysis is conducted to evaluate the predictive, reliability, and consistency capabilities of the iDNA-OpenPrompt model. The experimental outcomes, covering 17 benchmark datasets that include various species and three DNA methylation modifications (4mC, 5hmC, 6mA), consistently indicate that our model surpasses outstanding performance and robustness approaches.
Collapse
Affiliation(s)
- Xia Yu
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Jia Ren
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| | - Haixia Long
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Rao Zeng
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Guoqiang Zhang
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Anas Bilal
- School of Information Science and Technology, Hainan Normal University, Haikou, Hainan, China
| | - Yani Cui
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| |
Collapse
|
9
|
KafiKang M, Abeysiriwardana C, Singh VK, Young Koh C, Prichard J, Mor SK, Hendawi A. Analysis of Emerging Variants of Turkey Reovirus using Machine Learning. Brief Bioinform 2024; 25:bbae224. [PMID: 38752857 PMCID: PMC11097603 DOI: 10.1093/bib/bbae224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 03/26/2024] [Accepted: 04/25/2024] [Indexed: 05/19/2024] Open
Abstract
Avian reoviruses continue to cause disease in turkeys with varied pathogenicity and tissue tropism. Turkey enteric reovirus has been identified as a causative agent of enteritis or inapparent infections in turkeys. The new emerging variants of turkey reovirus, tentatively named turkey arthritis reovirus (TARV) and turkey hepatitis reovirus (THRV), are linked to tenosynovitis/arthritis and hepatitis, respectively. Turkey arthritis and hepatitis reoviruses are causing significant economic losses to the turkey industry. These infections can lead to poor weight gain, uneven growth, poor feed conversion, increased morbidity and mortality and reduced marketability of commercial turkeys. To combat these issues, detecting and classifying the types of reoviruses in turkey populations is essential. This research aims to employ clustering methods, specifically K-means and Hierarchical clustering, to differentiate three types of turkey reoviruses and identify novel emerging variants. Additionally, it focuses on classifying variants of turkey reoviruses by leveraging various machine learning algorithms such as Support Vector Machines, Naive Bayes, Random Forest, Decision Tree, and deep learning algorithms, including convolutional neural networks (CNNs). The experiments use real turkey reovirus sequence data, allowing for robust analysis and evaluation of the proposed methods. The results indicate that machine learning methods achieve an average accuracy of 92%, F1-Macro of 93% and F1-Weighted of 92% scores in classifying reovirus types. In contrast, the CNN model demonstrates an average accuracy of 85%, F1-Macro of 71% and F1-Weighted of 84% scores in the same classification task. The superior performance of the machine learning classifiers provides valuable insights into reovirus evolution and mutation, aiding in detecting emerging variants of pathogenic TARVs and THRVs.
Collapse
Affiliation(s)
- Maryam KafiKang
- Computer Science Department, University of Rhode Island, Kingston, 02881, RI, USA
| | | | - Vikash K Singh
- Department of Veterinary Population Medicine, and Veterinary Diagnostic Laboratory, University of Minnesota, Saint Paul, 55108, MN, USA
| | - Chan Young Koh
- Computer Science Department, University of Rhode Island, Kingston, 02881, RI, USA
| | - Janet Prichard
- Department of Information Systems and Analytics, Bryant University, Smithfield, 02917, RI, USA
| | - Sunil K Mor
- Department of Veterinary Population Medicine, and Veterinary Diagnostic Laboratory, University of Minnesota, Saint Paul, 55108, MN, USA
- Department of Veterinary and Biomedical Sciences and Animal Disease Research & Diagnostic Laboratory, South Dakota State University, Brookings, 57007, SD, USA
| | - Abdeltawab Hendawi
- Computer Science Department, University of Rhode Island, Kingston, 02881, RI, USA
- Faculty of Computers and Artificial Intelligence, Cairo University, Giza, Egypt
| |
Collapse
|
10
|
Wang S, Liu Y, Liu Y, Zhang Y, Zhu X. BERT-5mC: an interpretable model for predicting 5-methylcytosine sites of DNA based on BERT. PeerJ 2023; 11:e16600. [PMID: 38089911 PMCID: PMC10712318 DOI: 10.7717/peerj.16600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Accepted: 11/15/2023] [Indexed: 12/18/2023] Open
Abstract
DNA 5-methylcytosine (5mC) is widely present in multicellular eukaryotes, which plays important roles in various developmental and physiological processes and a wide range of human diseases. Thus, it is essential to accurately detect the 5mC sites. Although current sequencing technologies can map genome-wide 5mC sites, these experimental methods are both costly and time-consuming. To achieve a fast and accurate prediction of 5mC sites, we propose a new computational approach, BERT-5mC. First, we pre-trained a domain-specific BERT (bidirectional encoder representations from transformers) model by using human promoter sequences as language corpus. BERT is a deep two-way language representation model based on Transformer. Second, we fine-tuned the domain-specific BERT model based on the 5mC training dataset to build the model. The cross-validation results show that our model achieves an AUROC of 0.966 which is higher than other state-of-the-art methods such as iPromoter-5mC, 5mC_Pred, and BiLSTM-5mC. Furthermore, our model was evaluated on the independent test set, which shows that our model achieves an AUROC of 0.966 that is also higher than other state-of-the-art methods. Moreover, we analyzed the attention weights generated by BERT to identify a number of nucleotide distributions that are closely associated with 5mC modifications. To facilitate the use of our model, we built a webserver which can be freely accessed at: http://5mc-pred.zhulab.org.cn.
Collapse
Affiliation(s)
- Shuyu Wang
- School of Sciences, Anhui Agricultural University, Hefei, Anhui, China
| | - Yinbo Liu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui, China
| | - Yufeng Liu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui, China
| | - Yong Zhang
- School of Sciences, Anhui Agricultural University, Hefei, Anhui, China
| | - Xiaolei Zhu
- School of Sciences, Anhui Agricultural University, Hefei, Anhui, China
| |
Collapse
|
11
|
Huang G, Huang X, Luo W. 6mA-StackingCV: an improved stacking ensemble model for predicting DNA N6-methyladenine site. BioData Min 2023; 16:34. [PMID: 38012796 PMCID: PMC10680251 DOI: 10.1186/s13040-023-00348-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Accepted: 11/04/2023] [Indexed: 11/29/2023] Open
Abstract
DNA N6-adenine methylation (N6-methyladenine, 6mA) plays a key regulating role in the cellular processes. Precisely recognizing 6mA sites is of importance to further explore its biological functions. Although there are many developed computational methods for 6mA site prediction over the past decades, there is a large root left to improve. We presented a cross validation-based stacking ensemble model for 6mA site prediction, called 6mA-StackingCV. The 6mA-StackingCV is a type of meta-learning algorithm, which uses output of cross validation as input to the final classifier. The 6mA-StackingCV reached the state of the art performances in the Rosaceae independent test. Extensive tests demonstrated the stability and the flexibility of the 6mA-StackingCV. We implemented the 6mA-StackingCV as a user-friendly web application, which allows one to restrictively choose representations or learning algorithms. This application is freely available at http://www.biolscience.cn/6mA-stackingCV/ . The source code and experimental data is available at https://github.com/Xiaohong-source/6mA-stackingCV .
Collapse
Affiliation(s)
- Guohua Huang
- School of Information Technology and Administration, Hunan University of Finance and Economics, Changsha, China.
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China.
| | - Xiaohong Huang
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China
| | - Wei Luo
- College of Information Science and Engineering, Shaoyang University, Shaoyang, Hunan, 422000, China
| |
Collapse
|
12
|
Pham NT, Rakkiyapan R, Park J, Malik A, Manavalan B. H2Opred: a robust and efficient hybrid deep learning model for predicting 2'-O-methylation sites in human RNA. Brief Bioinform 2023; 25:bbad476. [PMID: 38180830 PMCID: PMC10768780 DOI: 10.1093/bib/bbad476] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 11/22/2023] [Accepted: 11/28/2023] [Indexed: 01/07/2024] Open
Abstract
2'-O-methylation (2OM) is the most common post-transcriptional modification of RNA. It plays a crucial role in RNA splicing, RNA stability and innate immunity. Despite advances in high-throughput detection, the chemical stability of 2OM makes it difficult to detect and map in messenger RNA. Therefore, bioinformatics tools have been developed using machine learning (ML) algorithms to identify 2OM sites. These tools have made significant progress, but their performances remain unsatisfactory and need further improvement. In this study, we introduced H2Opred, a novel hybrid deep learning (HDL) model for accurately identifying 2OM sites in human RNA. Notably, this is the first application of HDL in developing four nucleotide-specific models [adenine (A2OM), cytosine (C2OM), guanine (G2OM) and uracil (U2OM)] as well as a generic model (N2OM). H2Opred incorporated both stacked 1D convolutional neural network (1D-CNN) blocks and stacked attention-based bidirectional gated recurrent unit (Bi-GRU-Att) blocks. 1D-CNN blocks learned effective feature representations from 14 conventional descriptors, while Bi-GRU-Att blocks learned feature representations from five natural language processing-based embeddings extracted from RNA sequences. H2Opred integrated these feature representations to make the final prediction. Rigorous cross-validation analysis demonstrated that H2Opred consistently outperforms conventional ML-based single-feature models on five different datasets. Moreover, the generic model of H2Opred demonstrated a remarkable performance on both training and testing datasets, significantly outperforming the existing predictor and other four nucleotide-specific H2Opred models. To enhance accessibility and usability, we have deployed a user-friendly web server for H2Opred, accessible at https://balalab-skku.org/H2Opred/. This platform will serve as an invaluable tool for accurately predicting 2OM sites within human RNA, thereby facilitating broader applications in relevant research endeavors.
Collapse
Affiliation(s)
- Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Republic of Korea
| | - Rajan Rakkiyapan
- Department of Mathematics, Bharathiar University, Coimbatore - 641046, Tamil Nadu, India
| | - Jongsun Park
- InfoBoss inc. and InfoBoss Research Center, Gangnam-gu, Seoul 06278, Republic of Korea
| | - Adeel Malik
- Institute of Intelligence Informatics Technology, Sangmyung University, Seoul, 03016, Republic of Korea
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Republic of Korea
| |
Collapse
|
13
|
Yu X, Hu J, Zhang Y. SNN6mA: Improved DNA N6-methyladenine site prediction using Siamese network-based feature embedding. Comput Biol Med 2023; 166:107533. [PMID: 37793205 DOI: 10.1016/j.compbiomed.2023.107533] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Revised: 09/01/2023] [Accepted: 09/27/2023] [Indexed: 10/06/2023]
Abstract
DNA N6-methyladenine (6mA) is one of the most common and abundant modifications, which plays essential roles in various biological processes and cellular functions. Therefore, the accurate identification of DNA 6mA sites is of great importance for a better understanding of its regulatory mechanisms and biological functions. Although significant progress has been made, there still has room for further improvement in 6mA site prediction in DNA sequences. In this study, we report a smart but accurate 6mA predictor, termed as SNN6mA, using Siamese network. To be specific, DNA segments are firstly encoded into feature vectors using the one-hot encoding scheme; then, these original feature vectors are mapped to a low-dimensional embedding space derived from Siamese network to capture more discriminative features; finally, the obtained low-dimensional features are fed to a fully connected neural network to perform final prediction. Stringent benchmarking tests on the datasets of two species demonstrated that the proposed SNN6mA is superior to the state-of-the-art 6mA predictors. Detailed data analyses show that the major advantage of SNN6mA lies in the utilization of Siamese network, which can map the original features into a low-dimensional embedding space with more discriminative capability. In summary, the proposed SNN6mA is the first attempt to use Siamese network for 6mA site prediction and could be easily extended to predict other types of modifications. The codes and datasets used in the study are freely available at https://github.com/YuXuan-Glasgow/SNN6mA for academic use.
Collapse
Affiliation(s)
- Xuan Yu
- Glasgow College, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Ying Zhang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China.
| |
Collapse
|
14
|
Wang Z, Liang S, Liu S, Meng Z, Wang J, Liang S. Sequence pre-training-based graph neural network for predicting lncRNA-miRNA associations. Brief Bioinform 2023; 24:bbad317. [PMID: 37651605 DOI: 10.1093/bib/bbad317] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 06/28/2023] [Accepted: 08/15/2023] [Indexed: 09/02/2023] Open
Abstract
MicroRNAs (miRNAs) silence genes by binding to messenger RNAs, whereas long non-coding RNAs (lncRNAs) act as competitive endogenous RNAs (ceRNAs) that can relieve miRNA silencing effects and upregulate target gene expression. The ceRNA association between lncRNAs and miRNAs has been a research hotspot due to its medical importance, but it is challenging to verify experimentally. In this paper, we propose a novel deep learning scheme, i.e. sequence pre-training-based graph neural network (SPGNN), that combines pre-training and fine-tuning stages to predict lncRNA-miRNA associations from RNA sequences and the existing interactions represented as a graph. First, we utilize a sequence-to-vector technique to generate pre-trained embeddings based on the sequences of all RNAs during the pre-training stage. In the fine-tuning stage, we use Graph Neural Network to learn node representations from the heterogeneous graph constructed using lncRNA-miRNA association information. We evaluate our proposed scheme SPGNN on our newly collected animal lncRNA-miRNA association dataset and demonstrate that combining the $k$-mer technique and Doc2vec model for pre-training with the Simple Graph Convolution Network for fine-tuning is effective in predicting lncRNA-miRNA associations. Our approach outperforms state-of-the-art baselines across various evaluation metrics. We also conduct an ablation study and hyperparameter analysis to verify the effectiveness of each component and parameter of our scheme. The complete code and dataset are available on GitHub: https://github.com/zixwang/SPGNN.
Collapse
Affiliation(s)
- Zixiao Wang
- Mohamed bin Zayed University of Artificial Intelligence, Masdar City, UAE
| | - Shiyang Liang
- Department of Gastroenterology, Tangdu Hospital, Air Force Medical University, 569 Xinsi Road, Xi'an 710038, Shaanxi, China
| | - Siwei Liu
- Mohamed bin Zayed University of Artificial Intelligence, Masdar City, UAE
| | | | - Jingjie Wang
- Department of Gastroenterology, Tangdu Hospital, Air Force Medical University, 569 Xinsi Road, Xi'an 710038, Shaanxi, China
| | - Shangsong Liang
- Mohamed bin Zayed University of Artificial Intelligence, Masdar City, UAE
| |
Collapse
|
15
|
Hu J, Tang YX, Zhou Y, Li Z, Rao B, Zhang GJ. Improving DNA 6mA Site Prediction via Integrating Bidirectional Long Short-Term Memory, Convolutional Neural Network, and Self-Attention Mechanism. J Chem Inf Model 2023; 63:5689-5700. [PMID: 37603823 DOI: 10.1021/acs.jcim.3c00698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/23/2023]
Abstract
Identifying DNA N6-methyladenine (6mA) sites is significantly important to understanding the function of DNA. Many deep learning-based methods have been developed to improve the performance of 6mA site prediction. In this study, to further improve the performance of 6mA site prediction, we propose a new meta method, called Co6mA, to integrate bidirectional long short-term memory (BiLSTM), convolutional neural networks (CNNs), and self-attention mechanisms (SAM) via assembling two different deep learning-based models. The first model developed in this study is called CBi6mA, which is composed of CNN, BiLSTM, and fully connected modules. The second model is borrowed from LA6mA, which is an existing 6mA prediction method based on BiLSTM and SAM modules. Experimental results on two independent testing sets of different model organisms, i.e., Arabidopsis thaliana and Drosophila melanogaster, demonstrate that Co6mA can achieve an average accuracy of 91.8%, covering 89% of all 6mA samples while achieving an average Matthews correlation coefficient value (0.839), which is higher than the second-best method DeepM6A.
Collapse
Affiliation(s)
- Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Yu-Xuan Tang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Yu Zhou
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Zhe Li
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Bing Rao
- School of Information and Electrical Engineering, Hangzhou City University, Hangzhou City University, Hangzhou 310015, China
| | - Gui-Jun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| |
Collapse
|
16
|
Zhang Y, Ge F, Li F, Yang X, Song J, Yu DJ. Prediction of Multiple Types of RNA Modifications via Biological Language Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3205-3214. [PMID: 37289599 DOI: 10.1109/tcbb.2023.3283985] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
It has been demonstrated that RNA modifications play essential roles in multiple biological processes. Accurate identification of RNA modifications in the transcriptome is critical for providing insights into the biological functions and mechanisms. Many tools have been developed for predicting RNA modifications at single-base resolution, which employ conventional feature engineering methods that focus on feature design and feature selection processes that require extensive biological expertise and may introduce redundant information. With the rapid development of artificial intelligence technologies, end-to-end methods are favorably received by researchers. Nevertheless, each well-trained model is only suitable for a specific RNA methylation modification type for nearly all of these approaches. In this study, we present MRM-BERT by feeding task-specific sequences into the powerful BERT (Bidirectional Encoder Representations from Transformers) model and implementing fine-tuning, which exhibits competitive performance to the state-of-the-art methods. MRM-BERT avoids repeated de novo training of the model and can predict multiple RNA modifications such as pseudouridine, m6A, m5C, and m1A in Mus musculus, Arabidopsis thaliana, and Saccharomyces cerevisiae. In addition, we analyse the attention heads to provide high attention regions for the prediction, and conduct saturated in silico mutagenesis of the input sequences to discover potential changes of RNA modifications, which can better assist researchers in their follow-up research.
Collapse
|
17
|
Zhang Z, Li F, Zhao J, Zheng C. CapsNetYY1: identifying YY1-mediated chromatin loops based on a capsule network architecture. BMC Genomics 2023; 24:448. [PMID: 37559017 PMCID: PMC10410878 DOI: 10.1186/s12864-023-09217-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Accepted: 02/28/2023] [Indexed: 08/11/2023] Open
Abstract
BACKGROUND Previous studies have identified that chromosome structure plays a very important role in gene control. The transcription factor Yin Yang 1 (YY1), a multifunctional DNA binding protein, could form a dimer to mediate chromatin loops and active enhancer-promoter interactions. The deletion of YY1 or point mutations at the YY1 binding sites significantly inhibit the enhancer-promoter interactions and affect gene expression. To date, only a few computational methods are available for identifying YY1-mediated chromatin loops. RESULTS We proposed a novel model named CapsNetYY1, which was based on capsule network architecture to identify whether a pair of YY1 motifs can form a chromatin loop. Firstly, we encode the DNA sequence using one-hot encoding method. Secondly, multi-scale convolution layer is used to extract local features of the sequence, and bidirectional gated recurrent unit is used to learn the features across time steps. Finally, capsule networks (convolution capsule layer and digital capsule layer) used to extract higher level features and recognize YY1-mediated chromatin loops. Compared with DeepYY1, the only prediction for YY1-mediated chromatin loops, our model CapsNetYY1 achieved the better performance on the independent datasets (AUC [Formula: see text]). CONCLUSION The results indicate that CapsNetYY1 is an excellent method for identifying YY1-mediated chromatin loops. We believe that the CapsNetYY1 method will be used for predictive classification of other DNA sequences.
Collapse
Affiliation(s)
- Zhimin Zhang
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China
| | - Fenglin Li
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China
| | - Jianping Zhao
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China.
| | - Chunhou Zheng
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Information Materials and Intelligent Sensing Laboratory of Anhui Province, and School of Artificial Intelligence, Anhui University, Hefei, China.
| |
Collapse
|
18
|
Wang LS, Sun ZL. iDHS-FFLG: Identifying DNase I Hypersensitive Sites by Feature Fusion and Local-Global Feature Extraction Network. Interdiscip Sci 2023; 15:155-170. [PMID: 36166165 DOI: 10.1007/s12539-022-00538-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 09/12/2022] [Accepted: 09/12/2022] [Indexed: 05/01/2023]
Abstract
The DNase I hypersensitive sites (DHSs) are active regions on chromatin that have been found to be highly sensitive to DNase I. These regions contain various cis-regulatory elements, including promoters, enhancers and silencers. Accurate identification of DHSs helps researchers better understand the transcriptional machinery of DNA and deepen the knowledge of functional DNA elements in non-coding sequences. Researchers have developed many methods based on traditional experiments and machine learning to identify DHSs. However, low prediction accuracy and robustness limit their application in genetics research. In this paper, a novel computational approach based on deep learning is proposed by feature fusion and local-global feature extraction network to identify DHSs in mouse, named iDHS-FFLG. First of all, multiple binary features of nucleotides are fused to better express sequence information. Then, a network consisting of the convolutional neural network (CNN), bidirectional long short-term memory (BiLSTM) and self-attention mechanism is designed to extract local features and global contextual associations. In the end, the prediction module is applied to distinguish between DHSs and non-DHSs. The results of several experiments demonstrate the superior performances of iDHS-FFLG compared to the latest methods.
Collapse
Affiliation(s)
- Lei-Shan Wang
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Hefei, 230601, Anhui, China
- School of Electrical Engineering and Automation, Anhui University, Hefei, 230601, Anhui, China
| | - Zhan-Li Sun
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, Anhui University, Hefei, 230601, Anhui, China.
- School of Electrical Engineering and Automation, Anhui University, Hefei, 230601, Anhui, China.
| |
Collapse
|
19
|
Fan XQ, Lin B, Hu J, Guo ZY. I-DNAN6mA: Accurate Identification of DNA N 6-Methyladenine Sites Using the Base-Pairing Map and Deep Learning. J Chem Inf Model 2023; 63:1076-1086. [PMID: 36722621 DOI: 10.1021/acs.jcim.2c01465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
The recent discovery of numerous DNA N6-methyladenine (6mA) sites has transformed our perception about the roles of 6mA in living organisms. However, our ability to understand them is hampered by our inability to identify 6mA sites rapidly and cost-efficiently by existing experimental methods. Developing a novel method to quickly and accurately identify 6mA sites is critical for speeding up the progress of its function detection and understanding. In this study, we propose a novel computational method, called I-DNAN6mA, to identify 6mA sites and complement experimental methods well, by leveraging the base-pairing rules and a well-designed three-stage deep learning model with pairwise inputs. The performance of our proposed method is benchmarked and evaluated on four species, i.e., Arabidopsis thaliana, Drosophila melanogaster, Rice, and Rosaceae. The experimental results demonstrate that I-DNAN6mA achieves area under the receiver operating characteristic curve values of 0.967, 0.963, 0.947, 0.976, and 0.990, accuracies of 91.5, 92.7, 88.2, 0.938, and 96.2%, and Mathew's correlation coefficient values of 0.855, 0.831, 0.763, 0.877, and 0.924 on five benchmark data sets, respectively, and outperforms several existing state-of-the-art methods. To our knowledge, I-DNAN6mA is the first approach to identify 6mA sites using a novel image-like representation of DNA sequences and a deep learning model with pairwise inputs. I-DNAN6mA is expected to be useful for locating functional regions of DNA.
Collapse
Affiliation(s)
- Xue-Qiang Fan
- School of Computer and Information, Hefei University of Technology, Hefei230009, China
| | - Bing Lin
- School of Computer and Information, Hefei University of Technology, Hefei230009, China
| | - Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou310023, China
| | - Zhong-Yi Guo
- School of Computer and Information, Hefei University of Technology, Hefei230009, China
| |
Collapse
|
20
|
Zeng W, Gautam A, Huson DH. MuLan-Methyl-multiple transformer-based language models for accurate DNA methylation prediction. Gigascience 2022; 12:giad054. [PMID: 37489753 PMCID: PMC10367125 DOI: 10.1093/gigascience/giad054] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2023] [Revised: 05/09/2023] [Accepted: 07/18/2023] [Indexed: 07/26/2023] Open
Abstract
Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation, and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep learning framework for predicting DNA methylation sites, which is based on 5 popular transformer-based language models. The framework identifies methylation sites for 3 different types of DNA methylation: N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the "pretrain and fine-tune" paradigm. Pretraining is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA methylation status of each type. The 5 models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source, and we provide a web server that implements the approach.
Collapse
Affiliation(s)
- Wenhuan Zeng
- Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany
| | - Anupam Gautam
- Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany
- International Max Planck Research School “From Molecules to Organisms”, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany
- Cluster of Excellence: EXC 2124: Controlling Microbes to Fight Infection, University of Tübingen, 72076 Tübingen, Germany
| | - Daniel H Huson
- Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany
- International Max Planck Research School “From Molecules to Organisms”, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany
- Cluster of Excellence: EXC 2124: Controlling Microbes to Fight Infection, University of Tübingen, 72076 Tübingen, Germany
| |
Collapse
|
21
|
Yang X, Wu P, Wang Z, Su X, Wu Z, Ma X, Wu F, Zhang D. Constructed the ceRNA network and predicted a FEZF1-AS1/miR-92b-3p/ZIC5 axis in colon cancer. Mol Cell Biochem 2022; 478:1083-1097. [PMID: 36219353 DOI: 10.1007/s11010-022-04578-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2022] [Accepted: 09/26/2022] [Indexed: 10/17/2022]
Abstract
The purpose of this study was to identify the role of FEZF1-AS1 in colon cancer and predicted the underlying mechanism. We extracted sequencing data of colon cancer patients from The Cancer Genome Atlas database, identified the differential expression of long noncoding RNA, microRNA, and messenger RNA, constructed a competitive endogenous RNA network, and then analyzed prognosis. Then, we used the enrichment analysis databases for functional analysis. Finally, we studied the FEZF1-AS1/miR-92b-3p/ZIC5 axis. We detected the expression of FEZF1-AS1, miR-92b-3p, and ZIC5 via quantitative reverse transcription-PCR, transfected colon cancer cell RKO with lentivirus and conducted FEZF1-AS1 knockdown, and performed cancer-related functional assays. It indicated that many RNA in the competitive endogenous RNA network, such as ZIC5, were predicted to be related to overall survival of colon cancer patients, and enrichment analysis showed cancer-related signaling pathways, such as PI3K/AKT signaling pathway. The expression of FEZF1-AS1 and ZIC5 was significantly higher and that of miR-92b-3p was lower in the colon cancer than in the normal colon tissues. FEZF1-AS1 promoted the migration, proliferation, as well as invasion of RKO. According to the prediction, FEZF1-AS1 and ZIC5 might competitively bind to miR-92b-3p, leading to the weakening of the inhibitory impact of miR-92b-3p on ZIC5 and increasing expression of ZIC5, thus further activating the PI3K/AKT signaling pathway, which led to the occurrence and development of colon cancer. The study suggested that FEZF1-AS1 might be an effective diagnosis biomarker for colon cancer.
Collapse
Affiliation(s)
- Xiaoping Yang
- Key Laboratory of Digestive Diseases of Gansu Province, Lanzhou University Second Hospital, Lanzhou, 730030, China.,The Second Clinical Medical College, Lanzhou University, Lanzhou, 730030, China
| | - Pingfan Wu
- Department of Pathology, The 940th Hospital of the Joint Logistic Support of the People's Liberation Army, Lanzhou, 730050, China
| | - Zirui Wang
- The Second Clinical Medical College, Lanzhou University, Lanzhou, 730030, China
| | - Xiaolu Su
- Department of Pathology, Lanzhou University Second Hospital, Lanzhou, 730030, China
| | - Zhiping Wu
- Department of Gastroenterology, Lanzhou University Second Hospital, Lanzhou, 730030, China
| | - Xueni Ma
- Key Laboratory of Digestive Diseases of Gansu Province, Lanzhou University Second Hospital, Lanzhou, 730030, China.,The Second Clinical Medical College, Lanzhou University, Lanzhou, 730030, China
| | - Fanqi Wu
- Key Laboratory of Digestive Diseases of Gansu Province, Lanzhou University Second Hospital, Lanzhou, 730030, China.,Department of Respiratory, Lanzhou University Second Hospital, Lanzhou, 730030, China
| | - Dekui Zhang
- Key Laboratory of Digestive Diseases of Gansu Province, Lanzhou University Second Hospital, Lanzhou, 730030, China. .,Department of Gastroenterology, Lanzhou University Second Hospital, Lanzhou, 730030, China.
| |
Collapse
|
22
|
Shao Z, Wang T, Qiao J, Zhang Y, Huang S, Zeng P. A comprehensive comparison of multilocus association methods with summary statistics in genome-wide association studies. BMC Bioinformatics 2022; 23:359. [PMID: 36042399 PMCID: PMC9429742 DOI: 10.1186/s12859-022-04897-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 08/22/2022] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Multilocus analysis on a set of single nucleotide polymorphisms (SNPs) pre-assigned within a gene constitutes a valuable complement to single-marker analysis by aggregating data on complex traits in a biologically meaningful way. However, despite the existence of a wide variety of SNP-set methods, few comprehensive comparison studies have been previously performed to evaluate the effectiveness of these methods. RESULTS We herein sought to fill this knowledge gap by conducting a comprehensive empirical comparison for 22 commonly-used summary-statistics based SNP-set methods. We showed that only seven methods could effectively control the type I error, and that these well-calibrated approaches had varying power performance under the simulation scenarios. Overall, we confirmed that the burden test was generally underpowered and score-based variance component tests (e.g., sequence kernel association test) were much powerful under the polygenic genetic architecture in both common and rare variant association analyses. We further revealed that two linkage-disequilibrium-free P value combination methods (e.g., harmonic mean P value method and aggregated Cauchy association test) behaved very well under the sparse genetic architecture in simulations and real-data applications to common and rare variant association analyses as well as in expression quantitative trait loci weighted integrative analysis. We also assessed the scalability of these approaches by recording computational time and found that all these methods can be scalable to biobank-scale data although some might be relatively slow. CONCLUSION In conclusion, we hope that our findings can offer an important guidance on how to choose appropriate multilocus association analysis methods in post-GWAS era. All the SNP-set methods are implemented in the R package called MCA, which is freely available at https://github.com/biostatpzeng/ .
Collapse
Affiliation(s)
- Zhonghe Shao
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Ting Wang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Jiahao Qiao
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Yuchen Zhang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Shuiping Huang
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
- Center for Medical Statistics and Data Analysis, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
- Key Laboratory of Human Genetics and Environmental Medicine, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
- Key Laboratory of Environment and Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
- Engineering Research Innovation Center of Biological Data Mining and Healthcare Transformation, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China
| | - Ping Zeng
- Department of Biostatistics, School of Public Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Center for Medical Statistics and Data Analysis, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Key Laboratory of Human Genetics and Environmental Medicine, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Key Laboratory of Environment and Health, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
- Engineering Research Innovation Center of Biological Data Mining and Healthcare Transformation, Xuzhou Medical University, Xuzhou, 221004, Jiangsu, China.
| |
Collapse
|
23
|
Kan Y, Feng L, Si Y, Zhou Z, Wang W, Yang J. Pathogenesis and Therapeutic Targets of Focal Cortical Dysplasia Based on Bioinformatics Analysis. Neurochem Res 2022; 47:3506-3521. [PMID: 35945307 DOI: 10.1007/s11064-022-03715-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 07/18/2022] [Accepted: 07/29/2022] [Indexed: 11/30/2022]
Abstract
Focal cortical dysplasia (FCD), a malformation of cortical development, is the most common cause of intractable epilepsy in children. However, the causes and underlying molecular events of FCD need further investigation. The microarray dataset GSE62019 and GSE97365 were obtained from Gene Expression Omnibus. To examine critical genes and signaling pathways, bioinformatics analysis tools such as protein-protein interaction (PPI) networks, miRNA-mRNA interaction networks, and immune infiltration in FCD samples were used to fully elucidate the pathogenesis of FCD. A total of 534 differentially expressed genes (DEGs) and 71 differentially expressed miRNAs (DEMs) were obtained. The DEGs obtained were enriched in ribosomal, protein targeting, and pathways of neurodegeneration multiple diseases, whereas the target genes of DEMs were enriched in signaling pathways such as transforming growth factor beta, Wnt, PI3K-Akt, etc. Finally, four hub genes (RPL11, FAU, RPS20, RPL27) and five key miRNAs (hsa-let-7b, hsa-miR-185, hsa-miR-23b, hsa-miR-222 and hsa-miR-92b) were obtained by PPI network, miRNA-mRNA network, and ROC analysis. The immune infiltration results showed that the infiltration levels of five immune cells (MDSC, regulatory T cells, activated CD8+ T cells, macrophage and effector memory CD8+ T cells) were slightly higher in FCD samples than in control samples. Moreover, the gene expressions of RPS19, RPL19, and RPS24 were highly correlated with the infiltration levels and immune characteristics of 28 immune cells. It broadens the understanding of the molecular mechanisms underlying the development of FCD and enlightens the identification of molecular targets and diagnostic biomarkers for FCD.
Collapse
Affiliation(s)
- Ying Kan
- Department of Nuclear Medicine, Beijing Friendship Hospital, Capital Medical University, Beijing, 100050, China
| | - Lijuan Feng
- Department of Nuclear Medicine, Beijing Friendship Hospital, Capital Medical University, Beijing, 100050, China
| | - Yukun Si
- Department of Nuclear Medicine, Beijing Friendship Hospital, Capital Medical University, Beijing, 100050, China
| | - Ziang Zhou
- Department of Nuclear Medicine, Beijing Friendship Hospital, Capital Medical University, Beijing, 100050, China
| | - Wei Wang
- Department of Nuclear Medicine, Beijing Friendship Hospital, Capital Medical University, Beijing, 100050, China
| | - Jigang Yang
- Department of Nuclear Medicine, Beijing Friendship Hospital, Capital Medical University, Beijing, 100050, China.
| |
Collapse
|
24
|
Predicting RNA solvent accessibility from multi-scale context feature via multi-shot neural network. Anal Biochem 2022; 654:114802. [PMID: 35809650 DOI: 10.1016/j.ab.2022.114802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Revised: 06/11/2022] [Accepted: 06/28/2022] [Indexed: 11/24/2022]
Abstract
Knowledge of RNA solvent accessibility has recently become attractive due to the increasing awareness of its importance for key biological process. Accurately predicting the solvent accessibility of RNA is crucial for understanding its 3D structure and biological function. In this study, we develop a novel computational method, termed M2pred, for accurately predicting the solvent accessibility of RNA from sequence-based multi-scale context feature. In M2pred, three single-view features, i.e., base-pairing probabilities, position-specific frequency matrix, and a binary one-hot encoding, are first generated as three feature sources, and immediately concatenated to engender a super feature. Secondly, for the super feature, the matrix-format features of each nucleotide are extracted using an initialized sliding window technique, and regularly stacked into a cube-format feature. Then, using multi-scale context feature extraction strategy, a pyramid feature constructed of contextual feature of four scales related to target nucleotides is extracted from the cube-format feature. Finally, a customized multi-shot neural network framework, which is equipped with four different scales of receptive fields mainly integrating several residual attention blocks, is designed to dig discrimination information from the contextual pyramid feature. Experimental results demonstrate that the proposed M2pred achieve a high prediction performance and outperforms existing state-of-the-art prediction methods of RNA solvent accessibility.
Collapse
|
25
|
Machine learning in neuro-oncology: toward novel development fields. J Neurooncol 2022; 159:333-346. [PMID: 35761160 DOI: 10.1007/s11060-022-04068-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2022] [Accepted: 06/11/2022] [Indexed: 10/17/2022]
Abstract
PURPOSE Artificial Intelligence (AI) involves several and different techniques able to elaborate a large amount of data responding to a specific planned outcome. There are several possible applications of this technology in neuro-oncology. METHODS We reviewed, according to PRISMA guidelines, available studies adopting AI in different fields of neuro-oncology including neuro-radiology, pathology, surgery, radiation therapy, and systemic treatments. RESULTS Neuro-radiology presented the major number of studies assessing AI. However, this technology is being successfully tested also in other operative settings including surgery and radiation therapy. In this context, AI shows to significantly reduce resources and costs maintaining an elevated qualitative standard. Pathological diagnosis and development of novel systemic treatments are other two fields in which AI showed promising preliminary data. CONCLUSION It is likely that AI will be quickly included in some aspects of daily clinical practice. Possible applications of these techniques are impressive and cover all aspects of neuro-oncology.
Collapse
|
26
|
Li H, Zhang N, Wang Y, Xia S, Zhu Y, Xing C, Tian X, Du Y. DNA N6-Methyladenine Modification in Eukaryotic Genome. Front Genet 2022; 13:914404. [PMID: 35812743 PMCID: PMC9263368 DOI: 10.3389/fgene.2022.914404] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Accepted: 06/08/2022] [Indexed: 11/18/2022] Open
Abstract
DNA methylation is treated as an important epigenetic mark in various biological activities. In the past, a large number of articles focused on 5 mC while lacking attention to N6-methyladenine (6 mA). The presence of 6 mA modification was previously discovered only in prokaryotes. Recently, with the development of detection technologies, 6 mA has been found in several eukaryotes, including protozoans, metazoans, plants, and fungi. The importance of 6 mA in prokaryotes and single-celled eukaryotes has been widely accepted. However, due to the incredibly low density of 6 mA and restrictions on detection technologies, the prevalence of 6 mA and its role in biological processes in eukaryotic organisms are highly debated. In this review, we first summarize the advantages and disadvantages of 6 mA detection methods. Then, we conclude existing reports on the prevalence of 6 mA in eukaryotic organisms. Next, we highlight possible methyltransferases, demethylases, and the recognition proteins of 6 mA. In addition, we summarize the functions of 6 mA in eukaryotes. Last but not least, we summarize our point of view and put forward the problems that need further research.
Collapse
Affiliation(s)
- Hao Li
- School of Basic Medical Sciences, Anhui Medical University, Hefei, China
- First School of Clinical Medicine, Anhui Medical University, Hefei, China
- First Affiliated Hospital of Anhui Medical University, Hefei, China
| | - Ning Zhang
- School of Basic Medical Sciences, Anhui Medical University, Hefei, China
- First School of Clinical Medicine, Anhui Medical University, Hefei, China
- First Affiliated Hospital of Anhui Medical University, Hefei, China
| | - Yuechen Wang
- School of Basic Medical Sciences, Anhui Medical University, Hefei, China
- Second School of Clinical Medicine, Anhui Medical University, Hefei, China
| | - Siyuan Xia
- School of Basic Medical Sciences, Anhui Medical University, Hefei, China
- Second School of Clinical Medicine, Anhui Medical University, Hefei, China
| | - Yating Zhu
- School of Basic Medical Sciences, Anhui Medical University, Hefei, China
| | - Chen Xing
- School of Basic Medical Sciences, Anhui Medical University, Hefei, China
| | - Xuefeng Tian
- School of Basic Medical Sciences, Anhui Medical University, Hefei, China
- First School of Clinical Medicine, Anhui Medical University, Hefei, China
| | - Yinan Du
- School of Basic Medical Sciences, Anhui Medical University, Hefei, China
- *Correspondence: Yinan Du,
| |
Collapse
|
27
|
Monteiro NRC, Simões CJV, Ávila HV, Abbasi M, Oliveira JL, Arrais JP. Explainable deep drug-target representations for binding affinity prediction. BMC Bioinformatics 2022; 23:237. [PMID: 35715734 PMCID: PMC9204982 DOI: 10.1186/s12859-022-04767-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Accepted: 05/25/2022] [Indexed: 11/10/2022] Open
Abstract
Background Several computational advances have been achieved in the drug discovery field, promoting the identification of novel drug–target interactions and new leads. However, most of these methodologies have been overlooking the importance of providing explanations to the decision-making process of deep learning architectures. In this research study, we explore the reliability of convolutional neural networks (CNNs) at identifying relevant regions for binding, specifically binding sites and motifs, and the significance of the deep representations extracted by providing explanations to the model’s decisions based on the identification of the input regions that contributed the most to the prediction. We make use of an end-to-end deep learning architecture to predict binding affinity, where CNNs are exploited in their capacity to automatically identify and extract discriminating deep representations from 1D sequential and structural data. Results The results demonstrate the effectiveness of the deep representations extracted from CNNs in the prediction of drug–target interactions. CNNs were found to identify and extract features from regions relevant for the interaction, where the weight associated with these spots was in the range of those with the highest positive influence given by the CNNs in the prediction. The end-to-end deep learning model achieved the highest performance both in the prediction of the binding affinity and on the ability to correctly distinguish the interaction strength rank order when compared to baseline approaches. Conclusions This research study validates the potential applicability of an end-to-end deep learning architecture in the context of drug discovery beyond the confined space of proteins and ligands with determined 3D structure. Furthermore, it shows the reliability of the deep representations extracted from the CNNs by providing explainability to the decision-making process. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04767-y.
Collapse
Affiliation(s)
- Nelson R C Monteiro
- Univ Coimbra, Centre for Informatics and Systems of the University of Coimbra, Department of Informatics Engineering, Coimbra, Portugal.
| | | | - Henrique V Ávila
- Univ Coimbra, Centre for Informatics and Systems of the University of Coimbra, Department of Informatics Engineering, Coimbra, Portugal
| | - Maryam Abbasi
- Univ Coimbra, Centre for Informatics and Systems of the University of Coimbra, Department of Informatics Engineering, Coimbra, Portugal
| | - José L Oliveira
- IEETA, Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal
| | - Joel P Arrais
- Univ Coimbra, Centre for Informatics and Systems of the University of Coimbra, Department of Informatics Engineering, Coimbra, Portugal
| |
Collapse
|
28
|
Network pharmacology combined with GEO database identifying the mechanisms and molecular targets of Polygoni Cuspidati Rhizoma on Peri-implants. Sci Rep 2022; 12:8227. [PMID: 35581339 PMCID: PMC9114011 DOI: 10.1038/s41598-022-12366-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 05/10/2022] [Indexed: 11/08/2022] Open
Abstract
Peri-implants is a chronic disease leads to the bone resorption and loss of implants. Polygoni Cuspidati Rhizoma (PCRER), a traditional Chinese herbal has been used to treat diseases of bone metabolism. However, its mechanism of anti-bone absorption still remains unknown. We aimed to identify its molecular target and the mechanism involved in PCRER potential treatment theory to Peri-implants by network pharmacology. The active ingredients of PCRER and potential disease-related targets were retrieved from TCMSP, Swiss Target Prediction, SEA databases and then combined with the Peri-implants disease differential genes obtained in the GEO microarray database. The crossed genes were used to protein–protein interaction (PPI) construction and Gene Ontology (GO) and KEGG enrichment analysis. Using STRING database and Cytoscape plug-in to build protein interaction network and screen the hub genes and verified through molecular docking by AutoDock vina software. A total of 13 active compounds and 90 cross targets of PCRER were selected for analysis. The GO and KEGG enrichment analysis indicated that the anti-Peri-implants targets of PCRER mainly play a role in the response in IL-17 signaling, Calcium signaling pathway, Toll-like receptor signaling pathway, TNF signaling pathway among others. And CytoHubba screened ten hub genes (MMP9, IL6, MPO, IL1B, SELL, IFNG, CXCL8, CXCL2, PTPRC, PECAM1). Finally, the molecular docking results indicated the good binding ability with active compounds and hub genes. PCRER’s core components are expected to be effective drugs to treat Peri-implants by anti-inflammation, promotes bone metabolism. Our study provides new thoughts into the development of natural medicine for the prevention and treatment of Peri-implants.
Collapse
|
29
|
iProm-Zea: A two-layer model to identify plant promoters and their types using convolutional neural network. Genomics 2022; 114:110384. [PMID: 35533969 DOI: 10.1016/j.ygeno.2022.110384] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Revised: 04/18/2022] [Accepted: 05/02/2022] [Indexed: 01/14/2023]
Abstract
A promoter is a short DNA sequence near the start codon, responsible for initiating the transcription of a specific gene in the genome. The accurate recognition of promoters is important for achieving a better understanding of transcriptional regulation. Because of their importance in the process of biological transcriptional regulation, there is an urgent need to develop in silico tools to identify promoters and their types in a timely and accurate manner. A number of prediction methods have been developed in this regard; however, almost all of them are merely used for identifying promoters and their strength or sigma types. The TATA box region in TATA promoter influences the post-transcriptional processes; therefore, in the current study, we developed a two-layer predictor called "iProm-Zea" using the convolutional neural network (CNN) for identify TATA and TATA less promoters. The first layer can be used to identify a given DNA sequence as a promoter or non-promoter. The second layer can be used to identify whether the recognized promoter is the TATA promoter. To find an optimal feature encoding scheme and model, we employed four feature encoding schemes on different machine learning and CNN algorithms, and based on the evaluation results, we selected a one-hot encoding scheme and a CNN model for iProm-Zea. The 5-fold cross validation testing results demonstrated that the constructed predictor showed great potential for identifying promoters and classifying them as TATA and TATA less promoters. Furthermore, we performed cross-species analysis of iProm-Zea to evaluate its performance in other species. Moreover, to make it easier for other experimental scientists to obtain the results they need, we established a freely accessible and user-friendly web server at http://nsclbio.jbnu.ac.kr/tools/iProm-Zea/.
Collapse
|
30
|
Guan X, Wang Y, Shao W, Li Z, Huang S, Zhang D. S2Snet: deep learning for low molecular weight RNA identification with nanopore. Brief Bioinform 2022; 23:6562681. [DOI: 10.1093/bib/bbac098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2021] [Revised: 02/24/2022] [Accepted: 02/25/2022] [Indexed: 11/12/2022] Open
Abstract
Abstract
Ribonucleic acid (RNA) is a pivotal nucleic acid that plays a crucial role in regulating many biological activities. Recently, one study utilized a machine learning algorithm to automatically classify RNA structural events generated by a Mycobacterium smegmatis porin A nanopore trap. Although it can achieve desirable classification results, compared with deep learning (DL) methods, this classic machine learning requires domain knowledge to manually extract features, which is sophisticated, labor-intensive and time-consuming. Meanwhile, the generated original RNA structural events are not strictly equal in length, which is incompatible with the input requirements of DL models. To alleviate this issue, we propose a sequence-to-sequence (S2S) module that transforms the unequal length sequence (UELS) to the equal length sequence. Furthermore, to automatically extract features from the RNA structural events, we propose a sequence-to-sequence neural network based on DL. In addition, we add an attention mechanism to capture vital information for classification, such as dwell time and blockage amplitude. Through quantitative and qualitative analysis, the experimental results have achieved about a 2% performance increase (accuracy) compared to the previous method. The proposed method can also be applied to other nanopore platforms, such as the famous Oxford nanopore. It is worth noting that the proposed method is not only aimed at pursuing state-of-the-art performance but also provides an overall idea to process nanopore data with UELS.
Collapse
Affiliation(s)
- Xiaoyu Guan
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China
| | - Yuqin Wang
- State Key Laboratory of Analytical Chemistry for Life Sciences, School of Chemistry and Chemical Engineering, Nanjing University, 210023, Nanjing, China
- Chemistry and Biomedicine Innovation Center (ChemBIC), Nanjing University, 210023, Nanjing, China
| | - Wei Shao
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China
| | - Zhongnian Li
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China
| | - Shuo Huang
- State Key Laboratory of Analytical Chemistry for Life Sciences, School of Chemistry and Chemical Engineering, Nanjing University, 210023, Nanjing, China
- Chemistry and Biomedicine Innovation Center (ChemBIC), Nanjing University, 210023, Nanjing, China
| | - Daoqiang Zhang
- College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, China
| |
Collapse
|
31
|
Jiao S, Chen Z, Zhang L, Zhou X, Shi L. ATGPred-FL: sequence-based prediction of autophagy proteins with feature representation learning. Amino Acids 2022; 54:799-809. [PMID: 35286461 DOI: 10.1007/s00726-022-03145-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2021] [Accepted: 01/28/2022] [Indexed: 11/26/2022]
Abstract
Autophagy plays an important role in biological evolution and is regulated by many autophagy proteins. Accurate identification of autophagy proteins is crucially important to reveal their biological functions. Due to the expense and labor cost of experimental methods, it is urgent to develop automated, accurate and reliable sequence-based computational tools to enable the identification of novel autophagy proteins among numerous proteins and peptides. For this purpose, a new predictor named ATGPred-FL was proposed for the efficient identification of autophagy proteins. We investigated various sequence-based feature descriptors and adopted the feature learning method to generate corresponding, more informative probability features. Then, a two-step feature selection strategy based on accuracy was utilized to remove irrelevant and redundant features, leading to the most discriminative 14-dimensional feature set. The final predictor was built using a support vector machine classifier, which performed favorably on both the training and testing sets with accuracy values of 94.40% and 90.50%, respectively. ATGPred-FL is the first ATG machine learning predictor based on protein primary sequences. We envision that ATGPred-FL will be an effective and useful tool for autophagy protein identification, and it is available for free at http://lab.malab.cn/~acy/ATGPred-FL , the source code and datasets are accessible at https://github.com/jiaoshihu/ATGPred .
Collapse
Affiliation(s)
- Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Zheng Chen
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, 7098 Liuxian Street, Shenzhen, 518055, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No.4 Block 2 North Jianshe Road, Chengdu, 61005, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, 518172, China
| | - Xun Zhou
- Beidahuang Industry Group General Hospital, Harbin, 150001, China.
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, No 415, Fengyang Road, Huangpu District, Shanghai, 210000, China.
| |
Collapse
|
32
|
Zhang N, Zang T. A multi-network integration approach for measuring disease similarity based on ncRNA regulation and heterogeneous information. BMC Bioinformatics 2022; 23:89. [PMID: 35255810 PMCID: PMC8902705 DOI: 10.1186/s12859-022-04613-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Accepted: 02/14/2022] [Indexed: 11/28/2022] Open
Abstract
Background Measuring similarity between complex diseases has significant implications for revealing the pathogenesis of diseases and development in the domain of biomedicine. It has been consentaneous that functional associations between disease-related genes and semantic associations can be applied to calculate disease similarity. Currently, more and more studies have demonstrated the profound involvement of non-coding RNA in the regulation of genome organization and gene expression. Thus, taking ncRNA into account can be useful in measuring disease similarities. However, existing methods ignore the regulation functions of ncRNA in biological process. In this study, we proposed a novel deep-learning method to deduce disease similarity. Results In this article, we proposed a novel method, ImpAESim, a framework integrating multiple networks embedding to learn compact feature representations and disease similarity calculation. We first utilize three different disease-related information networks to build up a heterogeneous network, after a network diffusion process, RWR, a compact feature learning model composed of classic Auto Encoder (AE) and improved AE model is proposed to extract constraints and low-dimensional feature representations. We finally obtain an accurate and low-dimensional feature representation of diseases, then we employed the cosine distance as the measurement of disease similarity. Conclusion ImpAESim focuses on extracting a low-dimensional vector representation of features based on ncRNA regulation, and gene–gene interaction network. Our method can significantly reduce the calculation bias resulted from the sparse disease associations which are derived from semantic associations.
Collapse
Affiliation(s)
- Ningyi Zhang
- Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Tianyi Zang
- Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
| |
Collapse
|
33
|
Cheng X, Wang J, Li Q, Liu T. BiLSTM-5mC: A Bidirectional Long Short-Term Memory-Based Approach for Predicting 5-Methylcytosine Sites in Genome-Wide DNA Promoters. Molecules 2021; 26:7414. [PMID: 34946497 PMCID: PMC8704614 DOI: 10.3390/molecules26247414] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2021] [Accepted: 12/04/2021] [Indexed: 12/04/2022] Open
Abstract
An important reason of cancer proliferation is the change in DNA methylation patterns, characterized by the localized hypermethylation of the promoters of tumor-suppressor genes together with an overall decrease in the level of 5-methylcytosine (5mC). Therefore, identifying the 5mC sites in the promoters is a critical step towards further understanding the diverse functions of DNA methylation in genetic diseases such as cancers and aging. However, most wet-lab experimental techniques are often time consuming and laborious for detecting 5mC sites. In this study, we proposed a deep learning-based approach, called BiLSTM-5mC, for accurately identifying 5mC sites in genome-wide DNA promoters. First, we randomly divided the negative samples into 11 subsets of equal size, one of which can form the balance subset by combining with the positive samples in the same amount. Then, two types of feature vectors encoded by the one-hot method, and the nucleotide property and frequency (NPF) methods were fed into a bidirectional long short-term memory (BiLSTM) network and a full connection layer to train the 22 submodels. Finally, the outputs of these models were integrated to predict 5mC sites by using the majority vote strategy. Our experimental results demonstrated that BiLSTM-5mC outperformed existing methods based on the same independent dataset.
Collapse
Affiliation(s)
- Xin Cheng
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China; (X.C.); (Q.L.)
| | - Jun Wang
- School of Software Technology, Zhejiang University, Ningbo 315048, China;
| | - Qianyue Li
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China; (X.C.); (Q.L.)
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China; (X.C.); (Q.L.)
| |
Collapse
|
34
|
Zhang S, Wang J, Pei L, Liu K, Gao Y, Fang H, Zhang R, Zhao L, Sun S, Wu J, Song B, Dai H, Li R, Xu Y. Interpretability analysis of one-year mortality prediction for stroke patients based on deep neural network. IEEE J Biomed Health Inform 2021; 26:1903-1910. [PMID: 34714758 DOI: 10.1109/jbhi.2021.3123657] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Clinically, physicians collect the benchmark medical data to establish archives for a stroke patient and then add the follow up data regularly. It has great significance on prognosis prediction for stroke patients. In this paper, we present an interpretable deep learning model to predict the one-year mortality risk on stroke. We design sub-modules to reconstruct features from original clinical data that highlight the dissimilarity and temporality of different variables. The model consists of Bidirectional Long Short-Term Memory (Bi-LSTM), in which a novel correlation attention module is proposed that takes the correlation of variables into consideration. In experiments, datasets are collected clinically from the department of neurology in a local AAA hospital. It consists of 2,275 stroke patients hospitalized in the department of neurology from 2014 to 2016. Our model achieves a precision of 0.9414, a recall of 0.9502 and an F1-score of 0.9415. In addition, we provide the analysis of the interpretability by visualizations with reference to clinical professional guidelines.
Collapse
|