1
|
Lei R, Jia J, Qin L. im7G-DCT: A two-branch strategy model based on improved DenseNet and transformer for m7G site prediction. Comput Biol Chem 2025; 118:108473. [PMID: 40245811 DOI: 10.1016/j.compbiolchem.2025.108473] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2025] [Revised: 03/29/2025] [Accepted: 04/08/2025] [Indexed: 04/19/2025]
Abstract
N-7 methylguanosine (m7G) is an important RNA modification that plays a key role in regulating gene expression and cellular physiological functions. Medical research has shown that m7G is closely associated with the development of a variety of diseases, including cancer, neurodegenerative diseases and viral infections. Therefore, accurate identification of m7G sites in mRNA is important for clinical applications and development of therapeutic strategies. With the rapid development of computational methods, deep learning prediction models are widely used in the field of m7G site. We developed a model called im7G-DCT based on the improved Densely Connected Convolutional Network and Transformer, which employs a two-branching strategy to extract local and global features from the original feature code in parallel. We found that this innovative strategy can mine the potential feature information of m7G locus sequences in a deeper way, which enhances the richness of features and improves the accuracy of prediction. As an innovative deep learning method, the im7G-DCT model shows important research value in the field of bioinformatics and has broad application prospects. In the results of independent test experiments, the im7G-DCT model achieved 81.68 %, 90.52 %, 86.10 %, and 72.48 % in sensitivity, specificity, accuracy, and Matthews' correlation coefficient, respectively. The im7G-DCT model performs well in all evaluation metrics and it significantly outperforms other existing prediction models.
Collapse
Affiliation(s)
- Rufeng Lei
- School of Mathematics, Northwest University, Xi'an 710127, China
| | - Jian Jia
- School of Mathematics, Northwest University, Xi'an 710127, China.
| | - Lulu Qin
- School of Artificial Intelligence, Guilin University of Electronic Technology, Guilin 541004, China
| |
Collapse
|
2
|
Luo R, Liu J, Guan L, Li M. HybProm: An attention-assisted hybrid CNN-BiLSTM model for the interpretable prediction of DNA promoter. Methods 2025; 235:71-80. [PMID: 39929298 DOI: 10.1016/j.ymeth.2025.02.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2024] [Revised: 01/18/2025] [Accepted: 02/03/2025] [Indexed: 02/13/2025] Open
Abstract
Promoter prediction is essential for analyzing gene structures, understanding regulatory networks, transcription mechanisms, and precisely controlling gene expression. Recently, computational and deep learning methods for promoter prediction have gained attention. However, there is still room to improve their accuracy. To address this, we propose the HybProm model, which uses DNA2Vec to transform DNA sequences into low-dimensional vectors, followed by a CNN-BiLSTM-Attention architecture to extract features and predict promoters across species, including E. coli, humans, mice, and plants. Experiments show that HybProm consistently achieves high accuracy (90%-99%) and offers good interpretability by identifying key sequence patterns and positions that drive predictions.
Collapse
Affiliation(s)
- Rentao Luo
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000 Jiangxi, China
| | - Jiawei Liu
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000 Jiangxi, China
| | - Lixin Guan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000 Jiangxi, China
| | - Mengshan Li
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000 Jiangxi, China.
| |
Collapse
|
3
|
Zou H. iHBPs-VWDC: variable-length window-based dynamic connectivity approach for identifying hormone-binding proteins. J Biomol Struct Dyn 2025; 43:550-559. [PMID: 37978902 DOI: 10.1080/07391102.2023.2283150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 11/08/2023] [Indexed: 11/19/2023]
Abstract
Hormone-binding proteins (HBPs) are soluble carrier proteins that play a vital role in the growth and development of living organisms. Identifying HBPs accurately is crucial for understanding their functions. However, traditional wet lab experimental methods are labor intensive and cost ineffective. Therefore, there is a need for computational methods to efficiently identify HBPs. In this study, a machine learning method based on support vector machine (SVM) was proposed for the accurate and efficient identification of HBPs. The encoding of protein sequences involved using fifty different physicochemical (PC) properties. A variable-length window-based dynamic connectivity method was applied to capture the connection information between two different PC properties through two distinct strategies. The canonical correlation analysis algorithm was then used to fuse features obtained from these approaches. Feature selection was performed using the F-score approach to choose the most discriminative features. Finally, these selected features were fed into the SVM to discriminate between HBPs and non-HBPs. The proposed method achieved high classification accuracies of 99.19%, 96.77%, and 94.57% on the main dataset and two independent datasets, respectively, as demonstrated in the jackknife test. Comparative results showed that our proposed method outperforms existing approaches on the same datasets, indicating its potential as a useful tool for identifying HBPs. The Matlab codes and datasets used in the current study are freely available at https://figshare.com/articles/online_resource/iHBPs-VWDC/23559834.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Hongliang Zou
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, China
- Jiangxi Engineering Research Center of Unattended Perception System and Artificial Intelligence Technology, Jiangxi Science and Technology Normal University, Nanchang, China
| |
Collapse
|
4
|
Zhao C, Guan Y, Yan S, Li J. Exploring the Promoter Generation and Prediction of Halomonas spp. Based on GAN and Multi-Model Fusion Methods. Int J Mol Sci 2024; 25:13137. [PMID: 39684846 DOI: 10.3390/ijms252313137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2024] [Revised: 12/04/2024] [Accepted: 12/05/2024] [Indexed: 12/18/2024] Open
Abstract
Promoters, as core elements in the regulation of gene expression, play a pivotal role in genetic engineering and synthetic biology. The accurate prediction and optimization of promoter strength are essential for advancing these fields. Here, we present the first promoter strength database tailored to Halomonas, an extremophilic microorganism, and propose a novel promoter design and prediction method based on generative adversarial networks (GANs) and multi-model fusion. The GAN model effectively learns the key features of Halomonas promoter sequences, such as the GC content and Moran's coefficients, to generate biologically plausible promoter sequences. To enhance prediction accuracy, we developed a multi-model fusion framework integrating deep learning and machine learning approaches. Deep learning models, incorporating BiLSTM and CNN architectures, capture k-mer and PSSM features, whereas machine learning models utilize engineered string and non-string features to construct comprehensive feature matrices for the multidimensional analysis and prediction of promoter strength. Using the proposed framework, newly generated promoters via mutation were predicted, and their functional validity was experimentally confirmed. The integration of multiple models significantly reduced the experimental validation space through an intersection-based strategy, achieving a notable improvement in top quantile prediction accuracy, particularly within the top five quantiles. The robustness and applicability of this model were further validated on diverse datasets, including test sets and out-of-sample promoters. This study not only introduces an innovative approach for promoter design and prediction in Halomonas but also lays a foundation for advancing industrial biotechnology. Additionally, the proposed strategy of GAN-based generation coupled with multi-model prediction demonstrates versatility, offering a valuable reference for promoter design and strength prediction in other extremophiles. Our findings highlight the promising synergy between artificial intelligence and synthetic biology, underscoring their profound academic and practical implications.
Collapse
Affiliation(s)
- Cuihuan Zhao
- Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China
| | - Yuying Guan
- Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua University, Beijing 100084, China
| | - Shuan Yan
- Department of Engineering Physics, Institute of Public Safety Research, Tsinghua University, Beijing 100084, China
| | - Jiahang Li
- School of Mathematical Sciences, Nankai University, Tianjin 300071, China
| |
Collapse
|
5
|
Du Q, Guo Y, Zhang J, Lu F, Peng C, Zhou C. Predicting Promoters in Multiple Prokaryotes with Prompt. Interdiscip Sci 2024; 16:814-828. [PMID: 39110340 DOI: 10.1007/s12539-024-00637-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 05/17/2024] [Accepted: 05/21/2024] [Indexed: 10/27/2024]
Abstract
Promoters are important cis-regulatory elements for the regulation of gene expression, and their accurate predictions are crucial for elucidating the biological functions and potential mechanisms of genes. Many previous prokaryotic promoter prediction methods are encouraging in terms of the prediction performance, but most of them focus on the recognition of promoters in only one or a few bacterial species. Moreover, due to ignoring the promoter sequence motifs, the interpretability of predictions with existing methods is limited. In this work, we present a generalized method Prompt (Promoters in multiple prokaryotes) to predict promoters in 16 prokaryotes and improve the interpretability of prediction results. Prompt integrates three methods including RSK (Regression based on Selected k-mer), CL (Contrastive Learning) and MLP (Multilayer Perception), and employs a voting strategy to divide the datasets into high-confidence and low-confidence categories. Results on the promoter prediction tasks in 16 prokaryotes show that the accuracy (Accuracy, Matthews correlation coefficient) of Prompt is greater than 80% in highly credible datasets of 16 prokaryotes, and is greater than 90% in 12 prokaryotes, and Prompt performs the best compared with other existing methods. Moreover, by identifying promoter sequence motifs, Prompt can improve the interpretability of the predictions. Prompt is freely available at https://github.com/duqimeng/PromptPrompt , and will contribute to the research of promoters in prokaryote.
Collapse
Affiliation(s)
- Qimeng Du
- School of Engineering, Air-Space-Ground Integrated Intelligence and Big Data Application Engineering Research Center of Yunnan Provincial Department of Education, Dali University, Dali, 671003, China
| | - Yixue Guo
- College of Biotechnology, Tianjin University of Science & Technology, Tianjin, 300457, China
| | - Junpeng Zhang
- School of Engineering, Air-Space-Ground Integrated Intelligence and Big Data Application Engineering Research Center of Yunnan Provincial Department of Education, Dali University, Dali, 671003, China
| | - Fuping Lu
- College of Biotechnology, Tianjin University of Science & Technology, Tianjin, 300457, China
| | - Chong Peng
- College of Biotechnology, Tianjin University of Science & Technology, Tianjin, 300457, China.
| | - Chichun Zhou
- School of Engineering, Air-Space-Ground Integrated Intelligence and Big Data Application Engineering Research Center of Yunnan Provincial Department of Education, Dali University, Dali, 671003, China.
| |
Collapse
|
6
|
Artemyev V, Gubaeva A, Paremskaia AI, Dzhioeva AA, Deviatkin A, Feoktistova SG, Mityaeva O, Volchkov PY. Synthetic Promoters in Gene Therapy: Design Approaches, Features and Applications. Cells 2024; 13:1963. [PMID: 39682712 PMCID: PMC11640742 DOI: 10.3390/cells13231963] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2024] [Revised: 11/22/2024] [Accepted: 11/24/2024] [Indexed: 12/18/2024] Open
Abstract
Gene therapy is a promising approach to the treatment of various inherited diseases, but its development is complicated by a number of limitations of the natural promoters used. The currently used strong ubiquitous natural promoters do not allow for the specificity of expression, while natural tissue-specific promoters have lowactivity. These limitations of natural promoters can be addressed by creating new synthetic promoters that achieve high levels of tissue-specific target gene expression. This review discusses recent advances in the development of synthetic promoters that provide a more precise regulation of gene expression. Approaches to the design of synthetic promoters are reviewed, including manual design and bioinformatic methods using machine learning. Examples of successful applications of synthetic promoters in the therapy of hereditary diseases and cancer are presented, as well as prospects for their clinical use.
Collapse
Affiliation(s)
- Valentin Artemyev
- Federal Research Center for Innovator and Emerging Biomedical and Pharmaceutical Technologies, 125315 Moscow, Russia; (A.G.); (A.D.); (O.M.); (P.Y.V.)
- Moscow Center for Advanced Studies, Kulakova Str. 20, 123592 Moscow, Russia;
| | - Anna Gubaeva
- Federal Research Center for Innovator and Emerging Biomedical and Pharmaceutical Technologies, 125315 Moscow, Russia; (A.G.); (A.D.); (O.M.); (P.Y.V.)
| | - Anastasiia Iu. Paremskaia
- Federal Research Center for Innovator and Emerging Biomedical and Pharmaceutical Technologies, 125315 Moscow, Russia; (A.G.); (A.D.); (O.M.); (P.Y.V.)
| | - Amina A. Dzhioeva
- Moscow Center for Advanced Studies, Kulakova Str. 20, 123592 Moscow, Russia;
| | - Andrei Deviatkin
- Federal Research Center for Innovator and Emerging Biomedical and Pharmaceutical Technologies, 125315 Moscow, Russia; (A.G.); (A.D.); (O.M.); (P.Y.V.)
| | - Sofya G. Feoktistova
- Federal Research Center for Innovator and Emerging Biomedical and Pharmaceutical Technologies, 125315 Moscow, Russia; (A.G.); (A.D.); (O.M.); (P.Y.V.)
| | - Olga Mityaeva
- Federal Research Center for Innovator and Emerging Biomedical and Pharmaceutical Technologies, 125315 Moscow, Russia; (A.G.); (A.D.); (O.M.); (P.Y.V.)
- Moscow Center for Advanced Studies, Kulakova Str. 20, 123592 Moscow, Russia;
- Faculty of Fundamental Medicine, Moscow State University, Lomonosovsky Pr., 27, 119991 Moscow, Russia
| | - Pavel Yu. Volchkov
- Federal Research Center for Innovator and Emerging Biomedical and Pharmaceutical Technologies, 125315 Moscow, Russia; (A.G.); (A.D.); (O.M.); (P.Y.V.)
- Faculty of Fundamental Medicine, Moscow State University, Lomonosovsky Pr., 27, 119991 Moscow, Russia
- Moscow Clinical Scientific Center N.A. A.S. Loginov, 111123 Moscow, Russia
| |
Collapse
|
7
|
Malebary SJ, Alromema N. iDLB-Pred: identification of disordered lipid binding residues in protein sequences using convolutional neural network. Sci Rep 2024; 14:24724. [PMID: 39433833 PMCID: PMC11494137 DOI: 10.1038/s41598-024-75700-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Accepted: 10/08/2024] [Indexed: 10/23/2024] Open
Abstract
Proteins, nucleic acids, and lipids all interact with intrinsically disordered protein areas. Lipid-binding regions are involved in a variety of biological processes as well as a number of human illnesses. The expanding body of experimental evidence for these interactions and the dearth of techniques to anticipate them from the protein sequence serve as driving forces. Although large-scale laboratory techniques are considered to be essential for equipment for studying binding residues, they are time consuming and costly, making it challenging for researchers to predict lipid binding residues. As a result, computational techniques are being looked at as a different strategy to overcome this difficulty. To predict disordered lipid-binding residues (DLBRs), we proposed iDLB-Pred predictor utilizing benchmark dataset to compute feature through extraction techniques to identify relevant patterns and information. Various classification techniques, including deep learning methods such as Convolutional Neural Networks (CNNs), Deep Neural Networks (DNNs), Multilayer Perceptrons (MLPs), Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Gated Recurrent Units (GRUs), were employed for model training. The proposed model, iDLB-Pred, was rigorously validated using metrics such as accuracy, sensitivity, specificity, and Matthew's correlation coefficient. The results demonstrate the predictor's exceptional performance, achieving accuracy rates of 81% on an independent dataset and 86% in 10-fold cross-validation.
Collapse
Affiliation(s)
- Sharaf J Malebary
- Department of Information Technology, Faculty of Computing and Information Technology-Rabigh, King Abdulaziz University, P.O. Box 344, 21911, Rabigh, Saudi Arabia.
| | - Nashwan Alromema
- Department of Computer Science, Faculty of Computing and Information Technology-Rabigh, King Abdulaziz University, P.O. Box 344, 21911, Rabigh, Saudi Arabia
| |
Collapse
|
8
|
Amjad A, Ahmed S, Kabir M, Arif M, Alam T. A novel deep learning identifier for promoters and their strength using heterogeneous features. Methods 2024; 230:119-128. [PMID: 39168294 DOI: 10.1016/j.ymeth.2024.08.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Revised: 07/24/2024] [Accepted: 08/17/2024] [Indexed: 08/23/2024] Open
Abstract
Promoters, which are short (50-1500 base-pair) in DNA regions, have emerged to play a critical role in the regulation of gene transcription. Numerous dangerous diseases, likewise cancer, cardiovascular, and inflammatory bowel diseases, are caused by genetic variations in promoters. Consequently, the correct identification and characterization of promoters are significant for the discovery of drugs. However, experimental approaches to recognizing promoters and their strengths are challenging in terms of cost, time, and resources. Therefore, computational techniques are highly desirable for the correct characterization of promoters from unannotated genomic data. Here, we designed a powerful bi-layer deep-learning based predictor named "PROCABLES", which discriminates DNA samples as promoters in the first-phase and strong or weak promoters in the second-phase respectively. The proposed method utilizes five distinct features, such as word2vec, k-spaced nucleotide pairs, trinucleotide propensity-based features, trinucleotide composition, and electron-ion interaction pseudopotentials, to extract the hidden patterns from the DNA sequence. Afterwards, a stacked framework is formed by integrating a convolutional neural network (CNN) with bidirectional long-short-term memory (LSTM) using multi-view attributes to train the proposed model. The PROCABLES model achieved an accuracy of 0.971 and 0.920 and the MCC 0.940 and 0.840 for the first and second-layer using the ten-fold cross-validation test, respectively. The predicted results anticipate that the proposed PROCABLES protocol outperformed the advanced computational predictors targeting promoters and their types. In summary, this research will provide useful hints for the recognition of large-scale promoters in particular and other DNA problems in general.
Collapse
Affiliation(s)
- Aqsa Amjad
- School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan
| | - Saeed Ahmed
- School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan
| | - Muhammad Kabir
- School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan.
| | - Muhammad Arif
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar.
| | - Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar.
| |
Collapse
|
9
|
Luo Z, Yu L, Xu Z, Liu K, Gu L. Comprehensive Review and Assessment of Computational Methods for Prediction of N6-Methyladenosine Sites. BIOLOGY 2024; 13:777. [PMID: 39452086 PMCID: PMC11504118 DOI: 10.3390/biology13100777] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/30/2024] [Revised: 09/19/2024] [Accepted: 09/23/2024] [Indexed: 10/26/2024]
Abstract
N6-methyladenosine (m6A) plays a crucial regulatory role in the control of cellular functions and gene expression. Recent advances in sequencing techniques for transcriptome-wide m6A mapping have accelerated the accumulation of m6A site information at a single-nucleotide level, providing more high-confidence training data to develop computational approaches for m6A site prediction. However, it is still a major challenge to precisely predict m6A sites using in silico approaches. To advance the computational support for m6A site identification, here, we curated 13 up-to-date benchmark datasets from nine different species (i.e., H. sapiens, M. musculus, Rat, S. cerevisiae, Zebrafish, A. thaliana, Pig, Rhesus, and Chimpanzee). This will assist the research community in conducting an unbiased evaluation of alternative approaches and support future research on m6A modification. We revisited 52 computational approaches published since 2015 for m6A site identification, including 30 traditional machine learning-based, 14 deep learning-based, and 8 ensemble learning-based methods. We comprehensively reviewed these computational approaches in terms of their training datasets, calculated features, computational methodologies, performance evaluation strategy, and webserver/software usability. Using these benchmark datasets, we benchmarked nine predictors with available online websites or stand-alone software and assessed their prediction performance. We found that deep learning and traditional machine learning approaches generally outperformed scoring function-based approaches. In summary, the curated benchmark dataset repository and the systematic assessment in this study serve to inform the design and implementation of state-of-the-art computational approaches for m6A identification and facilitate more rigorous comparisons of new methods in the future.
Collapse
Affiliation(s)
- Zhengtao Luo
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei 230036, China;
- Anhui Provincial Key Laboratory of Smart Agriculture Technology and Equipment, Anhui Agricultural University, Hefei 230036, China
| | - Liyi Yu
- Computer Department, Jingdezhen Ceramic University, Jingdezhen 333403, China; (L.Y.); (Z.X.)
| | - Zhaochun Xu
- Computer Department, Jingdezhen Ceramic University, Jingdezhen 333403, China; (L.Y.); (Z.X.)
- School for Interdisciplinary Medicine and Engineering, Harbin Medical University, Harbin 150076, China
| | - Kening Liu
- Computer Department, Jingdezhen Ceramic University, Jingdezhen 333403, China; (L.Y.); (Z.X.)
| | - Lichuan Gu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei 230036, China;
- Anhui Provincial Key Laboratory of Smart Agriculture Technology and Equipment, Anhui Agricultural University, Hefei 230036, China
| |
Collapse
|
10
|
Ren R, Yu H, Teng J, Mao S, Bian Z, Tao Y, Yau SST. CAPE: a deep learning framework with Chaos-Attention net for Promoter Evolution. Brief Bioinform 2024; 25:bbae398. [PMID: 39120645 PMCID: PMC11311715 DOI: 10.1093/bib/bbae398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 07/13/2024] [Accepted: 07/27/2024] [Indexed: 08/10/2024] Open
Abstract
Predicting the strength of promoters and guiding their directed evolution is a crucial task in synthetic biology. This approach significantly reduces the experimental costs in conventional promoter engineering. Previous studies employing machine learning or deep learning methods have shown some success in this task, but their outcomes were not satisfactory enough, primarily due to the neglect of evolutionary information. In this paper, we introduce the Chaos-Attention net for Promoter Evolution (CAPE) to address the limitations of existing methods. We comprehensively extract evolutionary information within promoters using merged chaos game representation and process the overall information with modified DenseNet and Transformer structures. Our model achieves state-of-the-art results on two kinds of distinct tasks related to prokaryotic promoter strength prediction. The incorporation of evolutionary information enhances the model's accuracy, with transfer learning further extending its adaptability. Furthermore, experimental results confirm CAPE's efficacy in simulating in silico directed evolution of promoters, marking a significant advancement in predictive modeling for prokaryotic promoter strength. Our paper also presents a user-friendly website for the practical implementation of in silico directed evolution on promoters. The source code implemented in this study and the instructions on accessing the website can be found in our GitHub repository https://github.com/BobYHY/CAPE.
Collapse
Affiliation(s)
- Ruohan Ren
- Zhili College, Tsinghua University, Beijing 100084, China
| | - Hongyu Yu
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Jiahao Teng
- School of Life Sciences, Tsinghua University, Beijing 100084, China
| | - Sihui Mao
- Zhili College, Tsinghua University, Beijing 100084, China
| | - Zixuan Bian
- Weiyang College, Tsinghua University, Beijing 100084, China
| | - Yangtianze Tao
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
- Beijing Institute of Mathematical Sciences and Applications (Bimsa), Beijing 101408, China
| |
Collapse
|
11
|
Peng B, Sun G, Fan Y. iProL: identifying DNA promoters from sequence information based on Longformer pre-trained model. BMC Bioinformatics 2024; 25:224. [PMID: 38918692 PMCID: PMC11201334 DOI: 10.1186/s12859-024-05849-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 06/19/2024] [Indexed: 06/27/2024] Open
Abstract
Promoters are essential elements of DNA sequence, usually located in the immediate region of the gene transcription start sites, and play a critical role in the regulation of gene transcription. Its importance in molecular biology and genetics has attracted the research interest of researchers, and it has become a consensus to seek a computational method to efficiently identify promoters. Still, existing methods suffer from imbalanced recognition capabilities for positive and negative samples, and their recognition effect can still be further improved. We conducted research on E. coli promoters and proposed a more advanced prediction model, iProL, based on the Longformer pre-trained model in the field of natural language processing. iProL does not rely on prior biological knowledge but simply uses promoter DNA sequences as plain text to identify promoters. It also combines one-dimensional convolutional neural networks and bidirectional long short-term memory to extract both local and global features. Experimental results show that iProL has a more balanced and superior performance than currently published methods. Additionally, we constructed a novel independent test set following the previous specification and compared iProL with three existing methods on this independent test set.
Collapse
Affiliation(s)
- Binchao Peng
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China
| | - Guicong Sun
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China
| | - Yongxian Fan
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China.
| |
Collapse
|
12
|
Li Y, Wei X, Yang Q, Xiong A, Li X, Zou Q, Cui F, Zhang Z. msBERT-Promoter: a multi-scale ensemble predictor based on BERT pre-trained model for the two-stage prediction of DNA promoters and their strengths. BMC Biol 2024; 22:126. [PMID: 38816885 PMCID: PMC11555825 DOI: 10.1186/s12915-024-01923-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 05/21/2024] [Indexed: 06/01/2024] Open
Abstract
BACKGROUND A promoter is a specific sequence in DNA that has transcriptional regulatory functions, playing a role in initiating gene expression. Identifying promoters and their strengths can provide valuable information related to human diseases. In recent years, computational methods have gained prominence as an effective means for identifying promoter, offering a more efficient alternative to labor-intensive biological approaches. RESULTS In this study, a two-stage integrated predictor called "msBERT-Promoter" is proposed for identifying promoters and predicting their strengths. The model incorporates multi-scale sequence information through a tokenization strategy and fine-tunes the DNABERT model. Soft voting is then used to fuse the multi-scale information, effectively addressing the issue of insufficient DNA sequence information extraction in traditional models. To the best of our knowledge, this is the first time an integrated approach has been used in the DNABERT model for promoter identification and strength prediction. Our model achieves accuracy rates of 96.2% for promoter identification and 79.8% for promoter strength prediction, significantly outperforming existing methods. Furthermore, through attention mechanism analysis, we demonstrate that our model can effectively combine local and global sequence information, enhancing its interpretability. CONCLUSIONS msBERT-Promoter provides an effective tool that successfully captures sequence-related attributes of DNA promoters and can accurately identify promoters and predict their strengths. This work paves a new path for the application of artificial intelligence in traditional biology.
Collapse
Affiliation(s)
- Yazi Li
- School of Mathematics and Statistics, Hainan University, Haikou, 570228, China
| | - Xiaoman Wei
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - Qinglin Yang
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - An Xiong
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - Xingfeng Li
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China.
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou, 570228, China.
| |
Collapse
|
13
|
Lei R, Jia J, Qin L, Wei X. iPro2L-DG: Hybrid network based on improved densenet and global attention mechanism for identifying promoter sequences. Heliyon 2024; 10:e27364. [PMID: 38510021 PMCID: PMC10950492 DOI: 10.1016/j.heliyon.2024.e27364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 02/24/2024] [Accepted: 02/28/2024] [Indexed: 03/22/2024] Open
Abstract
The promoter is a key DNA sequence whose primary function is to control the initiation time and the degree of expression of gene transcription. Accurate identification of promoters is essential for understanding gene expression studies. Traditional sequencing techniques for identifying promoters are costly and time-consuming. Therefore, the development of computational methods to identify promoters has become critical. Since deep learning methods show great potential in identifying promoters, this study proposes a new promoter prediction model, called iPro2L-DG. The iPro2L-DG predictor, based on an improved Densely Connected Convolutional Network (DenseNet) and a Global Attention Mechanism (GAM), is constructed to achieve the prediction of promoters. The promoter sequences are combined feature encoding using C2 encoding and nucleotide chemical property (NCP) encoding. An improved DenseNet extracts advanced feature information from the combined feature encoding. GAM evaluates the importance of advanced feature information in terms of channel and spatial dimensions, and finally uses a Full Connect Neural Network (FNN) to derive prediction probabilities. The experimental results showed that the accuracy of iPro2L-DG in the first layer (promoter identification) was 94.10% with Matthews correlation coefficient value of 0.8833. In the second layer (promoter strength prediction), the accuracy was 89.42% with Matthews correlation coefficient value of 0.7915. The iPro2L-DG predictor significantly outperforms other existing predictors in promoter identification and promoter strength prediction. Therefore, our proposed model iPro2L-DG is the most advanced promoter prediction tool. The source code of the iPro2L-DG model can be found in https://github.com/leirufeng/iPro2L-DG.
Collapse
Affiliation(s)
- Rufeng Lei
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Jianhua Jia
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Lulu Qin
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Xin Wei
- Business School, Jiangxi Institute of Fashion Technology, Nanchang, 330044, China
| |
Collapse
|
14
|
Zou H. iDPPIV-SI: identifying dipeptidyl peptidase IV inhibitory peptides by using multiple sequence information. J Biomol Struct Dyn 2024; 42:2144-2152. [PMID: 37125813 DOI: 10.1080/07391102.2023.2203257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Accepted: 04/10/2023] [Indexed: 05/02/2023]
Abstract
Currently, diabetes has become a great threaten for people's health in the world. Recent study shows that dipeptidyl peptidase IV (DPP-IV) inhibitory peptides may be a potential pharmaceutical agent to treat diabetes. Thus, there is a need to discriminate DPP-IV inhibitory peptides from non-DPP-IV inhibitory peptides. To address this issue, a novel computational model called iDPPIV-SI was developed in this study. In the first, 50 different types of physicochemical (PC) properties were employed to denote the peptide sequences. Three different feature descriptors including the 1-order, 2-order correlation methods and discrete wavelet transform were applied to collect useful information from the PC matrix. Furthermore, the least absolute shrinkage and selection operator (LASSO) algorithm was employed to select these most discriminative features. All of these chosen features were fed into support vector machine (SVM) for identifying DPP-IV inhibitory peptides. The iDPPIV-SI achieved 91.26% and 98.12% classification accuracies on the training and independent dataset, respectively. There is a significantly improvement in the classification performance by the proposed method, as compared with the state-of-the-art predictors. The datasets and MATLAB codes (based on MATLAB2015b) used in current study are available at https://figshare.com/articles/online_resource/iDPPIV-SI/20085878.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Hongliang Zou
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, China
| |
Collapse
|
15
|
Wang X, Xu K, Tan Y, Yu S, Zhao X, Zhou J. Deep Learning-Assisted Design of Novel Promoters in Escherichia coli. ADVANCED GENETICS (HOBOKEN, N.J.) 2023; 4:2300184. [PMID: 38099247 PMCID: PMC10716054 DOI: 10.1002/ggn2.202300184] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 10/09/2023] [Indexed: 12/17/2023]
Abstract
Deep learning (DL) approaches have the ability to accurately recognize promoter regions and predict their strength. Here, the potential for controllably designing active Escherichia coli promoter is explored by combining multiple deep learning models. First, "DRSAdesign," which relies on a diffusion model to generate different types of novel promoters is created, followed by predicting whether they are real or fake and strength. Experimental validation showed that 45 out of 50 generated promoters are active with high diversity, but most promoters have relatively low activity. Next, "Ndesign," which relies on generating random sequences carrying functional -35 and -10 motifs of the sigma70 promoter is introduced, and their strength is predicted using the designed DL model. The DL model is trained and validated using 200 and 50 generated promoters, and displays Pearson correlation coefficients of 0.49 and 0.43, respectively. Taking advantage of the DL models developed in this work, possible 6-mers are predicted as key functional motifs of the sigma70 promoter, suggesting that promoter recognition and strength prediction mainly rely on the accommodation of functional motifs. This work provides DL tools to design promoters and assess their functions, paving the way for DL-assisted metabolic engineering.
Collapse
Affiliation(s)
- Xinglong Wang
- Engineering Research Center of Ministry of Education on Food Synthetic Biotechnology and School of BiotechnologyJiangnan University1800 Lihu RoadWuxiJiangsu214122China
- Science Center for Future FoodsJiangnan University1800 Lihu RoadWuxiJiangsu214122China
| | - Kangjie Xu
- Engineering Research Center of Ministry of Education on Food Synthetic Biotechnology and School of BiotechnologyJiangnan University1800 Lihu RoadWuxiJiangsu214122China
- Science Center for Future FoodsJiangnan University1800 Lihu RoadWuxiJiangsu214122China
| | - Yameng Tan
- Engineering Research Center of Ministry of Education on Food Synthetic Biotechnology and School of BiotechnologyJiangnan University1800 Lihu RoadWuxiJiangsu214122China
- Science Center for Future FoodsJiangnan University1800 Lihu RoadWuxiJiangsu214122China
| | - Shangyang Yu
- Engineering Research Center of Ministry of Education on Food Synthetic Biotechnology and School of BiotechnologyJiangnan University1800 Lihu RoadWuxiJiangsu214122China
- Science Center for Future FoodsJiangnan University1800 Lihu RoadWuxiJiangsu214122China
| | - Xinyi Zhao
- Engineering Research Center of Ministry of Education on Food Synthetic Biotechnology and School of BiotechnologyJiangnan University1800 Lihu RoadWuxiJiangsu214122China
- Science Center for Future FoodsJiangnan University1800 Lihu RoadWuxiJiangsu214122China
| | - Jingwen Zhou
- Engineering Research Center of Ministry of Education on Food Synthetic Biotechnology and School of BiotechnologyJiangnan University1800 Lihu RoadWuxiJiangsu214122China
- Science Center for Future FoodsJiangnan University1800 Lihu RoadWuxiJiangsu214122China
- Jiangsu Province Engineering Research Center of Food Synthetic BiotechnologyJiangnan UniversityWuxi214122China
| |
Collapse
|
16
|
Yu Z, Yin Z, Zou H. iAMY-RECMFF: Identifying amyloidgenic peptides by using residue pairwise energy content matrix and features fusion algorithm. J Bioinform Comput Biol 2023; 21:2350023. [PMID: 37899353 DOI: 10.1142/s0219720023500233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/31/2023]
Abstract
Various diseases, including Huntington's disease, Alzheimer's disease, and Parkinson's disease, have been reported to be linked to amyloid. Therefore, it is crucial to distinguish amyloid from non-amyloid proteins or peptides. While experimental approaches are typically preferred, they are costly and time-consuming. In this study, we have developed a machine learning framework called iAMY-RECMFF to discriminate amyloidgenic from non-amyloidgenic peptides. In our model, we first encoded the peptide sequences using the residue pairwise energy content matrix. We then utilized Pearson's correlation coefficient and distance correlation to extract useful information from this matrix. Additionally, we employed an improved similarity network fusion algorithm to integrate features from different perspectives. The Fisher approach was adopted to select the optimal feature subset. Finally, the selected features were inputted into a support vector machine for identifying amyloidgenic peptides. Experimental results demonstrate that our proposed method significantly improves the identification of amyloidgenic peptides compared to existing predictors. This suggests that our method may serve as a powerful tool in identifying amyloidgenic peptides. To facilitate academic use, the dataset and codes used in the current study are accessible at https://figshare.com/articles/online_resource/iAMY-RECMFF/22816916.
Collapse
Affiliation(s)
- Zizheng Yu
- School of Communications and Electronics Jiangxi, Science and Technology Normal University, Nanchang 330013, P. R. China
| | - Zhijian Yin
- School of Communications and Electronics Jiangxi, Science and Technology Normal University, Nanchang 330013, P. R. China
- Jiangxi Engineering Research Center of Unattended Perception System and Artificial Intelligence Technology Jiangxi Science and Technology Normal University, Jiangxi 330088, P. R. China
| | - Hongliang Zou
- School of Communications and Electronics Jiangxi, Science and Technology Normal University, Nanchang 330013, P. R. China
- Jiangxi Engineering Research Center of Unattended Perception System and Artificial Intelligence Technology Jiangxi Science and Technology Normal University, Jiangxi 330088, P. R. China
| |
Collapse
|
17
|
Zou H, Yu W. Integrating Low-Order and High-Order Correlation Information for Identifying Phage Virion Proteins. J Comput Biol 2023; 30:1131-1143. [PMID: 37729064 DOI: 10.1089/cmb.2022.0237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2023] Open
Abstract
Phage virion proteins (PVPs) play an important role in the host cell. Fast and accurate identification of PVPs is beneficial for the discovery and development of related drugs. Although wet experimental approaches are the first choice to identify PVPs, they are costly and time-consuming. Thus, researchers have turned their attention to computational models, which can speed up related studies. Therefore, we proposed a novel machine-learning model to identify PVPs in the current study. First, 50 different types of physicochemical properties were used to denote protein sequences. Next, two different approaches, including Pearson's correlation coefficient (PCC) and maximal information coefficient (MIC), were employed to extract discriminative information. Further, to capture the high-order correlation information, we used PCC and MIC once again. After that, we adopted the least absolute shrinkage and selection operator algorithm to select the optimal feature subset. Finally, these chosen features were fed into a support vector machine to discriminate PVPs from phage non-virion proteins. We performed experiments on two different datasets to validate the effectiveness of our proposed method. Experimental results showed a significant improvement in performance compared with state-of-the-art approaches. It indicates that the proposed computational model may become a powerful predictor in identifying PVPs.
Collapse
Affiliation(s)
- Hongliang Zou
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, China
| | - Wanting Yu
- College of Animal Science and Technology, Jiangxi Agricultural University, Nanchang, China
| |
Collapse
|
18
|
Ni CE, Doan DP, Chiu YJ, Huang YH. TSSUNet-MB - ab initio identification of σ 70 promoter transcription start sites in Escherichia coli using deep multitask learning. Comput Biol Chem 2023; 105:107904. [PMID: 37327560 DOI: 10.1016/j.compbiolchem.2023.107904] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2022] [Revised: 03/22/2023] [Accepted: 06/09/2023] [Indexed: 06/18/2023]
Abstract
MOTIVATION Computational promoter prediction (CPP) tools designed to classify prokaryotic promoter regions usually assume that a transcription start site (TSS) is located at a predefined position within each promoter region. Such CPP tools are sensitive to any positional shifting of the TSS in a windowed region, and they are unsuitable for determining the boundaries of prokaryotic promoters. RESULTS TSSUNet-MB is a deep learning model developed to identify the TSSs of σ70 promoters. Mononucleotide and bendability were used to encode input sequences. TSSUNet-MB outperforms other CPP tools when assessed using the sequences obtained from the neighborhood of real promoters. TSSUNet-MB achieved a sensitivity of 0.839 and specificity of 0.768 on sliding sequences, while other CPP tool cannot maintain both sensitivities and specificities in a compatible range. Furthermore, TSSUNet-MB can precisely predict the TSS position of σ70 promoter-containing regions with a 10-base accuracy of 77.6%. By leveraging the sliding window scanning approach, we further computed the confidence score of each predicted TSS, which allows for more accurately identifying TSS locations. Our results suggest that TSSUNet-MB is a robust tool for finding σ70 promoters and identifying TSSs.
Collapse
Affiliation(s)
- Chung-En Ni
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Duy-Phuong Doan
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Yen-Jung Chiu
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Yen-Hua Huang
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei, Taiwan; Center for Systems and Synthetic Biology, National Yang Ming Chiao Tung University, Taipei, Taiwan.
| |
Collapse
|
19
|
Li H, Zhang J, Zhao Y, Yang W. Predicting Corynebacterium glutamicum promoters based on novel feature descriptor and feature selection technique. Front Microbiol 2023; 14:1141227. [PMID: 36937275 PMCID: PMC10018189 DOI: 10.3389/fmicb.2023.1141227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Accepted: 02/10/2023] [Indexed: 03/06/2023] Open
Abstract
The promoter is an important noncoding DNA regulatory element, which combines with RNA polymerase to activate the expression of downstream genes. In industry, artificial arginine is mainly synthesized by Corynebacterium glutamicum. Replication of specific promoter regions can increase arginine production. Therefore, it is necessary to accurately locate the promoter in C. glutamicum. In the wet experiment, promoter identification depends on sigma factors and DNA splicing technology, this is a laborious job. To quickly and conveniently identify the promoters in C. glutamicum, we have developed a method based on novel feature representation and feature selection to complete this task, describing the DNA sequences through statistical parameters of multiple physicochemical properties, filtering redundant features by combining analysis of variance and hierarchical clustering, the prediction accuracy of the which is as high as 91.6%, the sensitivity of 91.9% can effectively identify promoters, and the specificity of 91.2% can accurately identify non-promoters. In addition, our model can correctly identify 181 promoters and 174 non-promoters among 400 independent samples, which proves that the developed prediction model has excellent robustness.
Collapse
Affiliation(s)
- HongFei Li
- College of Life Science, Northeast Forestry University, Harbin, China
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Jingyu Zhang
- Department of Neurology, The Fourth Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Yuming Zhao
- College of Life Science, Northeast Forestry University, Harbin, China
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
- *Correspondence: Yuming Zhao, ; Wen Yang,
| | - Wen Yang
- International Medical Center, Shenzhen University General Hospital, Shenzhen, China
- *Correspondence: Yuming Zhao, ; Wen Yang,
| |
Collapse
|
20
|
Jia J, Lei R, Qin L, Wu G, Wei X. iEnhancer-DCSV: Predicting enhancers and their strength based on DenseNet and improved convolutional block attention module. Front Genet 2023; 14:1132018. [PMID: 36936423 PMCID: PMC10014624 DOI: 10.3389/fgene.2023.1132018] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Accepted: 02/13/2023] [Indexed: 03/06/2023] Open
Abstract
Enhancers play a crucial role in controlling gene transcription and expression. Therefore, bioinformatics puts many emphases on predicting enhancers and their strength. It is vital to create quick and accurate calculating techniques because conventional biomedical tests take too long time and are too expensive. This paper proposed a new predictor called iEnhancer-DCSV built on a modified densely connected convolutional network (DenseNet) and an improved convolutional block attention module (CBAM). Coding was performed using one-hot and nucleotide chemical property (NCP). DenseNet was used to extract advanced features from raw coding. The channel attention and spatial attention modules were used to evaluate the significance of the advanced features and then input into a fully connected neural network to yield the prediction probabilities. Finally, ensemble learning was employed on the final categorization findings via voting. According to the experimental results on the test set, the first layer of enhancer recognition achieved an accuracy of 78.95%, and the Matthews correlation coefficient value was 0.5809. The second layer of enhancer strength prediction achieved an accuracy of 80.70%, and the Matthews correlation coefficient value was 0.6609. The iEnhancer-DCSV method can be found at https://github.com/leirufeng/iEnhancer-DCSV. It is easy to obtain the desired results without using the complex mathematical formulas involved.
Collapse
Affiliation(s)
- Jianhua Jia
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, China
- *Correspondence: Jianhua Jia, ; Rufeng Lei,
| | - Rufeng Lei
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, China
- *Correspondence: Jianhua Jia, ; Rufeng Lei,
| | - Lulu Qin
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, China
| | - Genqiang Wu
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, China
| | - Xin Wei
- Business School, Jiangxi Institute of Fashion Technology, Nanchang, China
| |
Collapse
|
21
|
Luo Z, Lou L, Qiu W, Xu Z, Xiao X. Predicting N6-Methyladenosine Sites in Multiple Tissues of Mammals through Ensemble Deep Learning. Int J Mol Sci 2022; 23:15490. [PMID: 36555143 PMCID: PMC9778682 DOI: 10.3390/ijms232415490] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Revised: 12/03/2022] [Accepted: 12/05/2022] [Indexed: 12/13/2022] Open
Abstract
N6-methyladenosine (m6A) is the most abundant within eukaryotic messenger RNA modification, which plays an essential regulatory role in the control of cellular functions and gene expression. However, it remains an outstanding challenge to detect mRNA m6A transcriptome-wide at base resolution via experimental approaches, which are generally time-consuming and expensive. Developing computational methods is a good strategy for accurate in silico detection of m6A modification sites from the large amount of RNA sequence data. Unfortunately, the existing computational models are usually only for m6A site prediction in a single species, without considering the tissue level of species, while most of them are constructed based on low-confidence level data generated by an m6A antibody immunoprecipitation (IP)-based sequencing method, thereby restricting reliability and generalizability of proposed models. Here, we review recent advances in computational prediction of m6A sites and construct a new computational approach named im6APred using ensemble deep learning to accurately identify m6A sites based on high-confidence level data in multiple tissues of mammals. Our model im6APred builds upon a comprehensive evaluation of multiple classification methods, including four traditional classification algorithms and three deep learning methods and their ensembles. The optimal base-classifier combinations are then chosen by five-fold cross-validation test to achieve an effective stacked model. Our model im6APred can produce the area under the receiver operating characteristic curve (AUROC) in the range of 0.82-0.91 on independent tests, indicating that our model has the ability to learn general methylation rules on RNA bases and generalize to m6A transcriptome-wide identification. Moreover, AUROCs in the range of 0.77-0.96 were achieved using cross-species/tissues validation on the benchmark dataset, demonstrating differences in predictive performance at the tissue level and the need for constructing tissue-specific models for m6A site prediction.
Collapse
Affiliation(s)
| | | | | | - Zhaochun Xu
- Computer Department, Jingdezhen Ceramic University, Jingdezhen 333403, China
| | - Xuan Xiao
- Computer Department, Jingdezhen Ceramic University, Jingdezhen 333403, China
| |
Collapse
|
22
|
Mai DHA, Nguyen LT, Lee EY. TSSNote-CyaPromBERT: Development of an integrated platform for highly accurate promoter prediction and visualization of Synechococcus sp. and Synechocystis sp. through a state-of-the-art natural language processing model BERT. Front Genet 2022; 13:1067562. [PMID: 36523764 PMCID: PMC9745317 DOI: 10.3389/fgene.2022.1067562] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/17/2022] [Indexed: 07/30/2023] Open
Abstract
Since the introduction of the first transformer model with a unique self-attention mechanism, natural language processing (NLP) models have attained state-of-the-art (SOTA) performance on various tasks. As DNA is the blueprint of life, it can be viewed as an unusual language, with its characteristic lexicon and grammar. Therefore, NLP models may provide insights into the meaning of the sequential structure of DNA. In the current study, we employed and compared the performance of popular SOTA NLP models (i.e., XLNET, BERT, and a variant DNABERT trained on the human genome) to predict and analyze the promoters in freshwater cyanobacterium Synechocystis sp. PCC 6803 and the fastest growing cyanobacterium Synechococcus elongatus sp. UTEX 2973. These freshwater cyanobacteria are promising hosts for phototrophically producing value-added compounds from CO2. Through a custom pipeline, promoters and non-promoters from Synechococcus elongatus sp. UTEX 2973 were used to train the model. The trained model achieved an AUROC score of 0.97 and F1 score of 0.92. During cross-validation with promoters from Synechocystis sp. PCC 6803, the model achieved an AUROC score of 0.96 and F1 score of 0.91. To increase accessibility, we developed an integrated platform (TSSNote-CyaPromBERT) to facilitate large dataset extraction, model training, and promoter prediction from public dRNA-seq datasets. Furthermore, various visualization tools have been incorporated to address the "black box" issue of deep learning and feature analysis. The learning transfer ability of large language models may help identify and analyze promoter regions for newly isolated strains with similar lineages.
Collapse
|
23
|
Bernardino M, Beiko R. Genome-scale prediction of bacterial promoters. Biosystems 2022; 221:104771. [PMID: 36099980 DOI: 10.1016/j.biosystems.2022.104771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 08/18/2022] [Accepted: 08/27/2022] [Indexed: 11/02/2022]
Abstract
A key step in the transcription of RNA is the binding of the RNA polymerase protein complex to a short promoter sequence that is typically upstream of the gene to be expressed. Automated identification of promoters would serve as a valuable complement to experimental validation in determining which genes are likely to be expressed and when; however, promoter sequences are short and highly variable, which makes them very difficult to accurately classify. The many tools developed to identify promoters in DNA have generally been tested on small and balanced subsets of genomic sequence, and the results may not reflect their expected performance on genomes with millions of DNA base pairs where promoters are likely to comprise less than ∼1% of the sequence. Here we introduce Expositor, a neural-network-based method that uses different types of DNA encodings and tunable sensitivity and specificity parameters. Expositor showed higher sensitivity and precision on the E. coli K-12 MG1655 chromosome than other tested approaches. Expositor predictions were more consistent in the homologous subset of sequence from a strain of Salmonella than they were with another strain of E. coli. We also examined the accuracy of Expositor in distinguishing different classes of promoters and found that misclassification between classes was consistent with the biological similarity between promoters.
Collapse
Affiliation(s)
- Miria Bernardino
- Faculty of Computer Science, Dalhousie University, Halifax, Canada.
| | - Robert Beiko
- Faculty of Computer Science, Dalhousie University, Halifax, Canada.
| |
Collapse
|
24
|
Suleman MT, Alkhalifah T, Alturise F, Khan YD. DHU-Pred: accurate prediction of dihydrouridine sites using position and composition variant features on diverse classifiers. PeerJ 2022; 10:e14104. [PMID: 36320563 PMCID: PMC9618264 DOI: 10.7717/peerj.14104] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Accepted: 09/01/2022] [Indexed: 01/21/2023] Open
Abstract
Background Dihydrouridine (D) is a modified transfer RNA post-transcriptional modification (PTM) that occurs abundantly in bacteria, eukaryotes, and archaea. The D modification assists in the stability and conformational flexibility of tRNA. The D modification is also responsible for pulmonary carcinogenesis in humans. Objective For the detection of D sites, mass spectrometry and site-directed mutagenesis have been developed. However, both are labor-intensive and time-consuming methods. The availability of sequence data has provided the opportunity to build computational models for enhancing the identification of D sites. Based on the sequence data, the DHU-Pred model was proposed in this study to find possible D sites. Methodology The model was built by employing comprehensive machine learning and feature extraction approaches. It was then validated using in-demand evaluation metrics and rigorous experimentation and testing approaches. Results The DHU-Pred revealed an accuracy score of 96.9%, which was considerably higher compared to the existing D site predictors. Availability and Implementation A user-friendly web server for the proposed model was also developed and is freely available for the researchers.
Collapse
Affiliation(s)
- Muhammad Taseer Suleman
- Department of Computer Science, School of Systems and Technology, University of Management & Technology, Lahore, Pakistan
| | - Tamim Alkhalifah
- Department of Computer, College of Science and Arts in Ar Rass Qassim University, Ar Rass, Qassim, Saudi Arabia
| | - Fahad Alturise
- Department of Computer, College of Science and Arts in Ar Rass Qassim University, Ar Rass, Qassim, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management & Technology, Lahore, Pakistan
| |
Collapse
|
25
|
Nguyen-Vo TH, Trinh QH, Nguyen L, Nguyen-Hoang PU, Rahardja S, Nguyen BP. iPromoter-Seqvec: identifying promoters using bidirectional long short-term memory and sequence-embedded features. BMC Genomics 2022; 23:681. [PMID: 36192696 PMCID: PMC9531353 DOI: 10.1186/s12864-022-08829-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Accepted: 08/08/2022] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND Promoters, non-coding DNA sequences located at upstream regions of the transcription start site of genes/gene clusters, are essential regulatory elements for the initiation and regulation of transcriptional processes. Furthermore, identifying promoters in DNA sequences and genomes significantly contributes to discovering entire structures of genes of interest. Therefore, exploration of promoter regions is one of the most imperative topics in molecular genetics and biology. Besides experimental techniques, computational methods have been developed to predict promoters. In this study, we propose iPromoter-Seqvec - an efficient computational model to predict TATA and non-TATA promoters in human and mouse genomes using bidirectional long short-term memory neural networks in combination with sequence-embedded features extracted from input sequences. The promoter and non-promoter sequences were retrieved from the Eukaryotic Promoter database and then were refined to create four benchmark datasets. RESULTS The area under the receiver operating characteristic curve (AUCROC) and the area under the precision-recall curve (AUCPR) were used as two key metrics to evaluate model performance. Results on independent test sets showed that iPromoter-Seqvec outperformed other state-of-the-art methods with AUCROC values ranging from 0.85 to 0.99 and AUCPR values ranging from 0.86 to 0.99. Models predicting TATA promoters in both species had slightly higher predictive power compared to those predicting non-TATA promoters. With a novel idea of constructing artificial non-promoter sequences based on promoter sequences, our models were able to learn highly specific characteristics discriminating promoters from non-promoters to improve predictive efficiency. CONCLUSIONS iPromoter-Seqvec is a stable and robust model for predicting both TATA and non-TATA promoters in human and mouse genomes. Our proposed method was also deployed as an online web server with a user-friendly interface to support research communities. Links to our source codes and web server are available at https://github.com/mldlproject/2022-iPromoter-Seqvec .
Collapse
Affiliation(s)
- Thanh-Hoang Nguyen-Vo
- School of Mathematics and Statistics, Victoria University of Wellington, Gate 7, Kelburn Parade, 6140 Wellington, New Zealand
| | - Quang H. Trinh
- School of Information and Communication Technology, Hanoi University of Science and Technology, 1 Dai Co Viet, 100000 Hanoi, Vietnam
| | - Loc Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Gate 7, Kelburn Parade, 6140 Wellington, New Zealand
| | - Phuong-Uyen Nguyen-Hoang
- Computational Biology Center, International University - VNU HCMC, Quarter 6, Linh Trung Ward, Thu Duc District, 700000 Ho Chi Minh City, Vietnam
| | - Susanto Rahardja
- School of Marine Science and Technology, Northwestern Polytechnical University, 127 West Youyi Road, 710072 Xi’an, China
- Infocomm Technology Cluster, Singapore Institute of Technology, 10 Dover Drive, 138683 Singapore, Singapore
| | - Binh P. Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Gate 7, Kelburn Parade, 6140 Wellington, New Zealand
| |
Collapse
|
26
|
Zhang P, Zhang H, Wu H. iPro-WAEL: a comprehensive and robust framework for identifying promoters in multiple species. Nucleic Acids Res 2022; 50:10278-10289. [PMID: 36161334 PMCID: PMC9561371 DOI: 10.1093/nar/gkac824] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Revised: 08/24/2022] [Accepted: 09/14/2022] [Indexed: 11/27/2022] Open
Abstract
Promoters are consensus DNA sequences located near the transcription start sites and they play an important role in transcription initiation. Due to their importance in biological processes, the identification of promoters is significantly important for characterizing the expression of the genes. Numerous computational methods have been proposed to predict promoters. However, it is difficult for these methods to achieve satisfactory performance in multiple species. In this study, we propose a novel weighted average ensemble learning model, termed iPro-WAEL, for identifying promoters in multiple species, including Human, Mouse, E.coli, Arabidopsis, B.amyloliquefaciens, B.subtilis and R.capsulatus. Extensive benchmarking experiments illustrate that iPro-WAEL has optimal performance and is superior to the current methods in promoter prediction. The experimental results also demonstrate a satisfactory prediction ability of iPro-WAEL on cross-cell lines, promoters annotated by other methods and distinguishing between promoters and enhancers. Moreover, we identify the most important transcription factor binding site (TFBS) motif in promoter regions to facilitate the study of identifying important motifs in the promoter regions. The source code of iPro-WAEL is freely available at https://github.com/HaoWuLab-Bioinformatics/iPro-WAEL.
Collapse
Affiliation(s)
- Pengyu Zhang
- School of Software, Shandong University, Jinan, 250101, Shandong, China.,College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China
| | - Hongming Zhang
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China
| | - Hao Wu
- School of Software, Shandong University, Jinan, 250101, Shandong, China
| |
Collapse
|
27
|
Wang M, Li F, Wu H, Liu Q, Li S. PredPromoter-MF(2L): A Novel Approach of Promoter Prediction Based on Multi-source Feature Fusion and Deep Forest. Interdiscip Sci 2022; 14:697-711. [PMID: 35488998 DOI: 10.1007/s12539-022-00520-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 04/05/2022] [Accepted: 04/05/2022] [Indexed: 12/12/2022]
Abstract
Promoters short DNA sequences play vital roles in initiating gene transcription. However, it remains a challenge to identify promoters using conventional experiment techniques in a high-throughput manner. To this end, several computational predictors based on machine learning models have been developed, while their performance is unsatisfactory. In this study, we proposed a novel two-layer predictor, called PredPromoter-MF(2L), based on multi-source feature fusion and ensemble learning. PredPromoter-MF(2L) was developed based on various deep features learned by a pre-trained deep learning network model and sequence-derived features. Feature selection based on XGBoost was applied to reduce fused features dimensions, and a cascade deep forest model was trained on the selected feature subset for promoter prediction. The results both fivefold cross-validation and independent test demonstrated that PredPromoter-MF(2L) outperformed state-of-the-art methods.
Collapse
Affiliation(s)
- Miao Wang
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shanxi, China
| | - Fuyi Li
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC, 3000, Australia
| | - Hao Wu
- School of Software, Shandong University, Jinan, 250100, Shandong, China
| | - Quanzhong Liu
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shanxi, China.
| | - Shuqin Li
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shanxi, China.
| |
Collapse
|
28
|
Database of Potential Promoter Sequences in the Capsicum annuum Genome. BIOLOGY 2022; 11:biology11081117. [PMID: 35892972 PMCID: PMC9332048 DOI: 10.3390/biology11081117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Revised: 07/19/2022] [Accepted: 07/23/2022] [Indexed: 11/16/2022]
Abstract
In this study, we used a mathematical method for the multiple alignment of highly divergent sequences (MAHDS) to create a database of potential promoter sequences (PPSs) in the Capsicum annuum genome. To search for PPSs, 20 statistically significant classes of sequences located in the range from −499 to +100 nucleotides near the annotated genes were calculated. For each class, a position–weight matrix (PWM) was computed and then used to identify PPSs in the C. annuum genome. In total, 825,136 PPSs were detected, with a false positive rate of 0.13%. The PPSs obtained with the MAHDS method were tested using TSSFinder, which detects transcription start sites. The databank of the found PPSs provides their coordinates in chromosomes, the alignment of each PPS with the PWM, and the level of statistical significance as a normal distribution argument, and can be used in genetic engineering and biotechnology.
Collapse
|
29
|
BERT-Promoter: An improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection. Comput Biol Chem 2022; 99:107732. [PMID: 35863177 DOI: 10.1016/j.compbiolchem.2022.107732] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Accepted: 07/12/2022] [Indexed: 02/01/2023]
Abstract
A promoter is a sequence of DNA that initializes the process of transcription and regulates whenever and wherever genes are expressed in the organism. Because of its importance in molecular biology, identifying DNA promoters are challenging to provide useful information related to its functions and related diseases. Several computational models have been developed to early predict promoters from high-throughput sequencing over the past decade. Although some useful predictors have been proposed, there remains short-falls in those models and there is an urgent need to enhance the predictive performance to meet the practice requirements. In this study, we proposed a novel architecture that incorporated transformer natural language processing (NLP) and explainable machine learning to address this problem. More specifically, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model was employed to encode DNA sequences, and SHapley Additive exPlanations (SHAP) analysis served as a feature selection step to look at the top-rank BERT encodings. At the last stage, different machine learning classifiers were implemented to learn the top features and produce the prediction outcomes. This study not only predicted the DNA promoters but also their activities (strong or weak promoters). Overall, several experiments showed an accuracy of 85.5 % and 76.9 % for these two levels, respectively. Our performance showed a superiority to previously published predictors on the same dataset in most measurement metrics. We named our predictor as BERT-Promoter and it is freely available at https://github.com/khanhlee/bert-promoter.
Collapse
|
30
|
Zou H. iRNA5hmC-HOC: High-order correlation information for identifying RNA 5-hydroxymethylcytosine modification. J Bioinform Comput Biol 2022; 20:2250017. [DOI: 10.1142/s0219720022500172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
31
|
Naseer S, Hussain W, Khan YD, Rasool N. iPhosS(Deep)-PseAAC: Identification of Phosphoserine Sites in Proteins Using Deep Learning on General Pseudo Amino Acid Compositions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1703-1714. [PMID: 33242308 DOI: 10.1109/tcbb.2020.3040747] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Among all the PTMs, the protein phosphorylation is pivotal for various pathological and physiological processes. About 30 percent of eukaryotic proteins undergo the phosphorylation modification, leading to various changes in conformation, function, stability, localization, and so forth. In eukaryotic proteins, phosphorylation occurs on serine (S), Threonine (T) and Tyrosine (Y) residues. Among these all, serine phosphorylation has its own importance as it is associated with various importance biological processes, including energy metabolism, signal transduction pathways, cell cycling, and apoptosis. Thus, its identification is important, however, the in vitro, ex vivo and in vivo identification can be laborious, time-taking and costly. There is a dire need of an efficient and accurate computational model to help researchers and biologists identifying these sites, in an easy manner. Herein, we propose a novel predictor for identification of Phosphoserine sites (PhosS) in proteins, by integrating the Chou's Pseudo Amino Acid Composition (PseAAC) with deep features. We used well-known DNNs for both the tasks of learning a feature representation of peptide sequences and performing classifications. Among different DNNs, the best score is shown by Covolutional Neural Network based model which renders CNN based prediction model the best for Phosphoserine prediction. Based on these results, it is concluded that the proposed model can help to identify PhosS sites in a very efficient and accurate manner which can help scientists understand the mechanism of this modification in proteins.
Collapse
|
32
|
Zou H, Yang F, Yin Z. Identification of tumor homing peptides by utilizing hybrid feature representation. J Biomol Struct Dyn 2022; 41:3405-3412. [PMID: 35262448 DOI: 10.1080/07391102.2022.2049368] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Cancer is one of the serious diseases, recent studies reported that tumor homing peptides (THPs) play a key role in treatment of cancer. Due to the experimental methods are time-consuming and expensive, it is urgent to develop automatic computational approaches to identify THPs. Hence, in this study, we proposed a novel machine learning methods to distinguish THPs from non-THPs, in which the peptide sequences firstly encoded by pseudo residue pairwise energy content matrix (PseRECM) and pseudo physicochemical property (PsePC). Moreover, the least absolute shrinkage and selection operator (LAASO) was employed to select optimal features from the extracted features. All of these selected features were fed into support vector machine (SVM) for identifying THPs. We achieved 89.02%, 88.49%, and 94.58% classification accuracy on the Main, Small, and Main90 dataset, respectively. Experimental results showed that our proposed method outperforms the existing predictors on the same benchmark datasets. It indicates that the proposed method may be a useful tool in identifying THPs. The datasets and codes used in current study are available at https://figshare.com/articles/online_resource/iTHPs/16778770.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Hongliang Zou
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, China
| | - Fan Yang
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, China
| | - Zhijian Yin
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, China
| |
Collapse
|
33
|
Zou H, Yang F, Yin Z. iTTCA-MFF: identifying tumor T cell antigens based on multiple feature fusion. Immunogenetics 2022; 74:447-454. [PMID: 35246701 DOI: 10.1007/s00251-022-01258-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Accepted: 02/26/2022] [Indexed: 11/05/2022]
Abstract
Cancer is a terrible disease, recent studies reported that tumor T cell antigens (TTCAs) may play a promising role in cancer treatment. Since experimental methods are still expensive and time-consuming, it is highly desirable to develop automatic computational methods to identify tumor T cell antigens from the huge amount of natural and synthetic peptides. Hence, in this study, a novel computational model called iTTCA-MFF was proposed to identify TTCAs. In order to describe the sequence effectively, the physicochemical (PC) properties of amino acid and residue pairwise energy content matrix (RECM) were firstly employed to encode peptide sequences. Then, two different approaches including covariance and Pearson's correlation coefficient (PCC) were used to collect discriminative information from PC and RECM matrixes. Next, an effective feature selection approach called the least absolute shrinkage and selection operator (LAASO) was adopted to select the optimal features. These selected optimal features were fed into support vector machine (SVM) for identifying TTCAs. We performed experiments on two different datasets, experimental results indicated that the proposed method is promising and may play a complementary role to the existing methods for identifying TTCAs. The datasets and codes can be available at https://figshare.com/articles/online_resource/iTTCA-MFF/17636120 .
Collapse
Affiliation(s)
- Hongliang Zou
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, 330003, China.
| | - Fan Yang
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, 330003, China
| | - Zhijian Yin
- School of Communications and Electronics, Jiangxi Science and Technology Normal University, Nanchang, 330003, China
| |
Collapse
|
34
|
Qiao H, Zhang S, Xue T, Wang J, Wang B. iPro-GAN: A novel model based on generative adversarial learning for identifying promoters and their strength. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2022; 215:106625. [PMID: 35038653 DOI: 10.1016/j.cmpb.2022.106625] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 12/13/2021] [Accepted: 01/06/2022] [Indexed: 06/14/2023]
Abstract
BACKGROUND AND OBJECTIVE Promoter is a component of the gene, which can specifically bind with RNA polymerase and determine where transcription starts, and also determine the transcription efficiency of the gene. Promoters can be divided into strong promoters and weak promoters because their structures and the interaction time interval are quite different. The functional variation of the promoter can lead to a variety of diseases. Therefore, identifying promoters and their strength is necessary and has important biological significance. A novel and promising model based on deep learning is proposed to achieve it. METHODS In this work, we build a power model named iPro-GAN for identification of promoters and their strength. First, we collect benchmark datasets and independent datasets for training and testing. Then, Moran-based spatial auto-cross correlation method is used as feature extraction method. Finally, deep convolution generative adversarial network with 10-fold cross validation is applied for classifying. The first layer of the model is used to identify the promoter and the second layer is used to determine its type. RESULTS On the benchmark data set, the accuracy of the first layer predictor is 93.15%, and the accuracy of the second layer predictor is 92.30%. On the independent data set, the accuracy of the first layer predictor is 86.77%, and the accuracy of the second layer predictor is 91.66%. In particular, breakthrough progress has been made in the identification of promoters' strength. CONCLUSIONS These results are far higher than the existing best predictor, which indicate that our model is serviceable and practicable to identify promoters and their strength. Furthermore, the datasets and source codes are available from this link: https://github.com/Bovbene/iPro-GAN.
Collapse
Affiliation(s)
- Huijuan Qiao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China.
| | - Tian Xue
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Jinyue Wang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Bowei Wang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
35
|
Ma D, Chen Z, He Z, Huang X. A SNARE Protein Identification Method Based on iLearnPlus to Efficiently Solve the Data Imbalance Problem. Front Genet 2022; 12:818841. [PMID: 35154261 PMCID: PMC8832978 DOI: 10.3389/fgene.2021.818841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2021] [Accepted: 12/14/2021] [Indexed: 11/13/2022] Open
Abstract
Machine learning has been widely used to solve complex problems in engineering applications and scientific fields, and many machine learning-based methods have achieved good results in different fields. SNAREs are key elements of membrane fusion and required for the fusion process of stable intermediates. They are also associated with the formation of some psychiatric disorders. This study processes the original sequence data with the synthetic minority oversampling technique (SMOTE) to solve the problem of data imbalance and produces the most suitable machine learning model with the iLearnPlus platform for the identification of SNARE proteins. Ultimately, a sensitivity of 66.67%, specificity of 93.63%, accuracy of 91.33%, and MCC of 0.528 were obtained in the cross-validation dataset, and a sensitivity of 66.67%, specificity of 93.63%, accuracy of 91.33%, and MCC of 0.528 were obtained in the independent dataset (the adaptive skip dipeptide composition descriptor was used for feature extraction, and LightGBM with proper parameters was used as the classifier). These results demonstrate that this combination can perform well in the classification of SNARE proteins and is superior to other methods.
Collapse
|
36
|
Zhang M, Jia C, Li F, Li C, Zhu Y, Akutsu T, Webb GI, Zou Q, Coin LJM, Song J. Critical assessment of computational tools for prokaryotic and eukaryotic promoter prediction. Brief Bioinform 2022; 23:6502561. [PMID: 35021193 PMCID: PMC8921625 DOI: 10.1093/bib/bbab551] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 11/12/2021] [Accepted: 11/30/2021] [Indexed: 01/13/2023] Open
Abstract
Promoters are crucial regulatory DNA regions for gene transcriptional activation. Rapid advances in next-generation sequencing technologies have accelerated the accumulation of genome sequences, providing increased training data to inform computational approaches for both prokaryotic and eukaryotic promoter prediction. However, it remains a significant challenge to accurately identify species-specific promoter sequences using computational approaches. To advance computational support for promoter prediction, in this study, we curated 58 comprehensive, up-to-date, benchmark datasets for 7 different species (i.e. Escherichia coli, Bacillus subtilis, Homo sapiens, Mus musculus, Arabidopsis thaliana, Zea mays and Drosophila melanogaster) to assist the research community to assess the relative functionality of alternative approaches and support future research on both prokaryotic and eukaryotic promoters. We revisited 106 predictors published since 2000 for promoter identification (40 for prokaryotic promoter, 61 for eukaryotic promoter, and 5 for both). We systematically evaluated their training datasets, computational methodologies, calculated features, performance and software usability. On the basis of these benchmark datasets, we benchmarked 19 predictors with functioning webservers/local tools and assessed their prediction performance. We found that deep learning and traditional machine learning-based approaches generally outperformed scoring function-based approaches. Taken together, the curated benchmark dataset repository and the benchmarking analysis in this study serve to inform the design and implementation of computational approaches for promoter prediction and facilitate more rigorous comparison of new techniques in the future.
Collapse
Affiliation(s)
| | - Cangzhi Jia
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | | | | | | | | | - Geoffrey I Webb
- Department of Data Science and Artificial Intelligence, Monash University, Melbourne, VIC 3800, Australia,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Quan Zou
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | - Lachlan J M Coin
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| | - Jiangning Song
- Corresponding authors: Jiangning Song, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia. E-mail: ; Lachlan J.M. Coin, Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, Victoria 3000, Australia. E-mail: ; Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China. E-mail: ; Cangzhi Jia, School of Science, Dalian Maritime University, Dalian 116026, China. E-mail:
| |
Collapse
|
37
|
Li H, Shi L, Gao W, Zhang Z, Zhang L, Wang G. dPromoter-XGBoost: Detecting promoters and strength by combining multiple descriptors and feature selection using XGBoost. Methods 2022; 204:215-222. [PMID: 34998983 DOI: 10.1016/j.ymeth.2022.01.001] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 12/13/2021] [Accepted: 01/02/2022] [Indexed: 12/12/2022] Open
Abstract
Promoters play an irreplaceable role in biological processes and genetics, which are responsible for stimulating the transcription and expression of specific genes. Promoter abnormalities have been found in some diseases, and the level of promoter-binding transcription factors can be used as a marker before a disease occurs. Hence, detecting promoters from DNA sequences has important biological significance, particular, distinguishing strong promoters can help to elucidate differences in gene expression and the mechanisms of specific diseases. With the introduction of third-generation sequencing, it is difficult to match the speed of sequencing to the speed of labeling promoters experimentally. Many computing models have been designed to fill this gap and identify unlabeled DNA. However, their feature representation methods are very singular, which cannot reflect the information contained in the original samples. With the aim of avoiding information loss, we propose a computational model based on multiple descriptors and feature selection to jointly express samples. It is worth mentioning that a new feature descriptor called K-mer word vector is defined. The promoter model of multiple feature descriptors dominated by K-mer word vector achieves similar performance to existing methods, the sensitivity of 85.72% can distinguish the promoter more effectively than other methods. Furthermore, the performance of the promoter strength has surpassed published methods, and accuracy of 77.00% greatly improves the ability to distinguish between strong and weak promoters.
Collapse
Affiliation(s)
- Hongfei Li
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China; Yangtze Delta Region Institute, University of Electronic Science and Technology, Quzhou,China
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Shanghai, China
| | - Wentao Gao
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Zixiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, China
| | - Guohua Wang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China.
| |
Collapse
|
38
|
Zou H. Identifying blood‐brain barrier peptides by using amino acids physicochemical properties and features fusion method. Pept Sci (Hoboken) 2021. [DOI: 10.1002/pep2.24247] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Affiliation(s)
- Hongliang Zou
- School of Communications and Electronics Jiangxi Science and Technology Normal University Nanchang China
| |
Collapse
|
39
|
Liang Y, Zhang S, Qiao H, Yao Y. iPromoter-ET: Identifying promoters and their strength by extremely randomized trees-based feature selection. Anal Biochem 2021; 630:114335. [PMID: 34389299 DOI: 10.1016/j.ab.2021.114335] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2021] [Revised: 07/24/2021] [Accepted: 08/09/2021] [Indexed: 10/20/2022]
Abstract
Promoter is a region of DNA that determines the transcription of a particular gene. There are several σ factors in the RNA polymerase, which has the function of identifying the promoter and facilitating the binding of the RNA polymerase to the promoter. Owing to the importance of promoter in genome research, it is an urgent task to develop computational tool for effectively identifying promoters and their strength facing the avalanche of DNA sequences discovered in the post-genomic age. In this paper, we develop a model named iPromoter-ET using the k-mer nucleotide composition, binary encoding and dinucleotide property matrix-based distance transformation for features extraction, and extremely randomized trees (extra trees) for feature selection. Its 1st layer is used to identify whether a DNA sequence is of promoter or not, while its 2nd layer is to identify promoter samples as being strong or weak promoter. Support vector machine and the five cross-validation are used to perform identification and assess performance, respectively. The results indicate that our model remarkably outperforms the existing models in both the 1st and 2nd layers for accuracy and stability. We anticipate that our proposed model will become a very effective intelligent tool, or at the least, a complementary tool to the existing modes of identifying promoters and their strength. Moreover, the datasets and codes for iPromoter-ET are freely available at https://github.com/shengli0201/iPromoter-ET.
Collapse
Affiliation(s)
- Yunyun Liang
- School of Science, Xi'an Polytechnic University, Xi'an, 710048, PR China.
| | - Shengli Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Huijuan Qiao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| | - Yingying Yao
- School of Mathematics and Statistics, Xidian University, Xi'an, 710071, PR China
| |
Collapse
|
40
|
Lyu Y, He W, Li S, Zou Q, Guo F. iPro2L-PSTKNC: A Two-Layer Predictor for Discovering Various Types of Promoters by Position Specific of Nucleotide Composition. IEEE J Biomed Health Inform 2021; 25:2329-2337. [PMID: 32976109 DOI: 10.1109/jbhi.2020.3026735] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Promoters are DNA regulatory elements located proximal to the transcription start site, which are in charge of the initiation of specific gene transcription. In Escherichia coli, promoters can be recognized by σ factors that have multiple families based on distinct function and structure, such as σ24, σ28, σ32, σ38, σ54 and σ70. At present, biological methods are mainly used to identify these promoters. However, because it is time-consuming and material-consuming to do biological experiments, computational biology algorithm has emerged as a more effective way to predict the classification. In this study, we develop a novel two-layer seamless predictor called iPro2L-PSTKNC to identify the promoters of the E. coli genome, which based on the feature extraction model we newly proposed that is named as the position specific tendencies of k-mer nucleotide composition (PSTKNC). On the first layer, it is a binary classification predicting whether a sequence is promoter or not. And the second layer is a multiple classification identifying which type the identified promoter belongs to. The ensemble classification SVM performsbest comparing with other algorithms, which gets a promising accuracy and the Matthews correlation coefficient (MCC) at [Formula: see text] and [Formula: see text]. Our data and code are available at https://github.com/lyuyinuo/iPro2L-PSTKNC.
Collapse
|
41
|
Naseer S, Hussain W, Khan YD, Rasool N. NPalmitoylDeep-PseAAC: A Predictor of N-Palmitoylation Sites in Proteins Using Deep Representations of Proteins and PseAAC via Modified 5-Steps Rule. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200605142828] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Among all the major Post-translational modification, lipid modifications
possess special significance due to their widespread functional importance in eukaryotic cells. There
exist multiple types of lipid modifications and Palmitoylation, among them, is one of the broader
types of modification, having three different types. The N-Palmitoylation is carried out by
attachment of palmitic acid to an N-terminal cysteine. Due to the association of N-Palmitoylation
with various biological functions and diseases such as Alzheimer’s and other neurodegenerative
diseases, its identification is very important.
Objective:
The in vitro, ex vivo and in vivo identification of Palmitoylation is laborious, time-taking
and costly. There is a dire need for an efficient and accurate computational model to help researchers
and biologists identify these sites, in an easy manner. Herein, we propose a novel prediction model
for the identification of N-Palmitoylation sites in proteins.
Method:
The proposed prediction model is developed by combining the Chou’s Pseudo Amino
Acid Composition (PseAAC) with deep neural networks. We used well-known deep neural
networks (DNNs) for both the tasks of learning a feature representation of peptide sequences and
developing a prediction model to perform classification.
Results:
Among different DNNs, Gated Recurrent Unit (GRU) based RNN model showed the
highest scores in terms of accuracy, and all other computed measures, and outperforms all the
previously reported predictors.
Conclusion:
The proposed GRU based RNN model can help to identify N-Palmitoylation in a very
efficient and accurate manner which can help scientists understand the mechanism of this
modification in proteins.
Collapse
Affiliation(s)
- Sheraz Naseer
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, P.O. Box 10033, C-II, Johar Town, Lahore 54770, Pakistan
| | - Waqar Hussain
- National Center of Artificial Intelligence, Punjab University College of Information Technology, University of the Punjab, Lahore, Pakistan
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, P.O. Box 10033, C-II, Johar Town, Lahore 54770, Pakistan
| | - Nouman Rasool
- Dr Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences, University of Karachi, Karachi, 75270, Pakistan
| |
Collapse
|
42
|
iPTT(2 L)-CNN: A Two-Layer Predictor for Identifying Promoters and Their Types in Plant Genomes by Convolutional Neural Network. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:6636350. [PMID: 33488763 PMCID: PMC7803414 DOI: 10.1155/2021/6636350] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Revised: 12/13/2020] [Accepted: 12/16/2020] [Indexed: 11/18/2022]
Abstract
A promoter is a short DNA sequence near to the start codon, responsible for initiating transcription of a specific gene in genome. The accurate recognition of promoters has great significance for a better understanding of the transcriptional regulation. Because of their importance in the process of biological transcriptional regulation, there is an urgent need to develop in silico tools to identify promoters and their types timely and accurately. A number of prediction methods had been developed in this regard; however, almost all of them were merely used for identifying promoters and their strength or sigma types. Owing to that TATA box region in TATA promoter that influences posttranscriptional processes, in the current study, we developed a two-layer predictor called iPTT(2L)-CNN by using the convolutional neural network (CNN) for identifying TATA and TATA-less promoters. The first layer can be used to identify a given DNA sequence as a promoter or nonpromoter. The second layer is used to identify whether the recognized promoter is TATA promoter or not. The 5-fold crossvalidation and independent testing results demonstrate that the constructed predictor is promising for identifying promoter and classifying TATA and TATA-less promoter. Furthermore, to make it easier for most experimental scientists get the results they need, a user-friendly web server has been established at http://www.jci-bioinfo.cn/iPPT(2L)-CNN.
Collapse
|
43
|
Zhou K, Ng W, Cortés-Peña Y, Wang X. Increasing metabolic pathway flux by using machine learning models. Curr Opin Biotechnol 2020; 66:179-185. [PMID: 32896771 DOI: 10.1016/j.copbio.2020.08.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2020] [Revised: 08/03/2020] [Accepted: 08/11/2020] [Indexed: 01/19/2023]
Abstract
Machine learning is transforming many industries through self-improving models that are fueled by big data and high computing power. The field of metabolic engineering, which uses cellular biochemical network to manufacture useful small molecules, has also witnessed the first wave of machine learning applications in the past five years, covering reaction route design, enzyme selection, pathway engineering and process optimization. This review focuses on pathway engineering, and uses a few recent studies to illustrate (1) how machine learning models can be useful in overcoming an evident rate-limiting step, and (2) how the models may be used to exhaustively search - or guide optimization algorithms to search - a large design space when the cellular regulation of the reaction network is more convoluted.
Collapse
Affiliation(s)
- Kang Zhou
- Department of Chemical and Biomolecular Engineering, National University of Singapore, 117585, Singapore.
| | - Wenfa Ng
- Department of Chemical and Biomolecular Engineering, National University of Singapore, 117585, Singapore
| | - Yoel Cortés-Peña
- Department of Chemical and Biomolecular Engineering, National University of Singapore, 117585, Singapore; Department of Civil and Environmental Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA; DOE Center for Advanced Bioenergy and Bioproducts Innovation (CABBI), University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Xiaonan Wang
- Department of Chemical and Biomolecular Engineering, National University of Singapore, 117585, Singapore
| |
Collapse
|
44
|
|
45
|
Do DT, Le NQK. Using extreme gradient boosting to identify origin of replication in Saccharomyces cerevisiae via hybrid features. Genomics 2020; 112:2445-2451. [PMID: 31987913 DOI: 10.1016/j.ygeno.2020.01.017] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Revised: 01/12/2020] [Accepted: 01/23/2020] [Indexed: 12/11/2022]
Abstract
DNA replication is a fundamental task that plays a crucial role in the propagation of all living things on earth. Hence, the accurate identification of its origin could be the key to giving an insightful understanding of the regulatory mechanism of gene expression. Indeed, with the robust development of computational techniques and the abundant biological sequencing data, it has become possible for scientists to identify the origin of replication accurately and promptly. This growing concern has drawn a lot of attention among experts in this field. However, to gain better outcomes, more work is required. Therefore, this study is designed to explore the combination of state-of-the-art features and extreme gradient boosting learning system in classifying DNA sequences. Our hybrid approach is able to identify the origin of DNA replication with achieved sensitivity of 85.19%, specificity of 93.83%, accuracy of 89.51%, and MCC of 0.7931. Evidence is presented to show that our proposed method is superior to the state-of-the-art methods on the same benchmark dataset. Moreover, the research results represent a further step towards developing the prediction models for DNA replication in particular and DNA sequences in general.
Collapse
Affiliation(s)
- Duyen Thi Do
- Toxicology and Biomedicine Research Group, Faculty of Applied Sciences, Ton Duc Thang University, Ho Chi Minh City, Viet Nam.
| | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei City 106, Taiwan; Research Center of Artificial Intelligence in Medicine, Taipei Medical University, Taipei City 106, Taiwan.
| |
Collapse
|
46
|
Some illuminating remarks on molecular genetics and genomics as well as drug development. Mol Genet Genomics 2020; 295:261-274. [PMID: 31894399 DOI: 10.1007/s00438-019-01634-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Accepted: 12/05/2019] [Indexed: 02/07/2023]
Abstract
Facing the explosive growth of biological sequences unearthed in the post-genomic age, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, but still keep it with considerable sequence-order information or its special pattern. To deal with such a challenging problem, the ideas of "pseudo amino acid components" and "pseudo K-tuple nucleotide composition" have been proposed. The ideas and their approaches have further stimulated the birth for "distorted key theory", "wenxing diagram", and substantially strengthening the power in treating the multi-label systems, as well as the establishment of the famous "5-steps rule". All these logic developments are quite natural that are very useful not only for theoretical scientists but also for experimental scientists in conducting genetics/genomics analysis and drug development. Presented in this review paper are also their future perspectives; i.e., their impacts will become even more significant and propounding.
Collapse
|
47
|
Ren J, Lee J, Na D. Recent advances in genetic engineering tools based on synthetic biology. J Microbiol 2020; 58:1-10. [PMID: 31898252 DOI: 10.1007/s12275-020-9334-x] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2019] [Revised: 08/19/2019] [Accepted: 11/05/2019] [Indexed: 12/26/2022]
Abstract
Genome-scale engineering is a crucial methodology to rationally regulate microbiological system operations, leading to expected biological behaviors or enhanced bioproduct yields. Over the past decade, innovative genome modification technologies have been developed for effectively regulating and manipulating genes at the genome level. Here, we discuss the current genome-scale engineering technologies used for microbial engineering. Recently developed strategies, such as clustered regularly interspaced short palindromic repeats (CRISPR)-Cas9, multiplex automated genome engineering (MAGE), promoter engineering, CRISPR-based regulations, and synthetic small regulatory RNA (sRNA)-based knockdown, are considered as powerful tools for genome-scale engineering in microbiological systems. MAGE, which modifies specific nucleotides of the genome sequence, is utilized as a genome-editing tool. Contrastingly, synthetic sRNA, CRISPRi, and CRISPRa are mainly used to regulate gene expression without modifying the genome sequence. This review introduces the recent genome-scale editing and regulating technologies and their applications in metabolic engineering.
Collapse
Affiliation(s)
- Jun Ren
- School of Integrative Engineering, Chung-Ang University, Seoul, 06974, Republic of Korea
| | - Jingyu Lee
- School of Integrative Engineering, Chung-Ang University, Seoul, 06974, Republic of Korea
| | - Dokyun Na
- School of Integrative Engineering, Chung-Ang University, Seoul, 06974, Republic of Korea.
| |
Collapse
|
48
|
Le NQK, Yapp EKY, Nagasundaram N, Yeh HY. Classifying Promoters by Interpreting the Hidden Information of DNA Sequences via Deep Learning and Combination of Continuous FastText N-Grams. Front Bioeng Biotechnol 2019; 7:305. [PMID: 31750297 PMCID: PMC6848157 DOI: 10.3389/fbioe.2019.00305] [Citation(s) in RCA: 62] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 10/17/2019] [Indexed: 01/16/2023] Open
Abstract
A promoter is a short region of DNA (100-1,000 bp) where transcription of a gene by RNA polymerase begins. It is typically located directly upstream or at the 5' end of the transcription initiation site. DNA promoter has been proven to be the primary cause of many human diseases, especially diabetes, cancer, or Huntington's disease. Therefore, classifying promoters has become an interesting problem and it has attracted the attention of a lot of researchers in the bioinformatics field. There were a variety of studies conducted to resolve this problem, however, their performance results still require further improvement. In this study, we will present an innovative approach by interpreting DNA sequences as a combination of continuous FastText N-grams, which are then fed into a deep neural network in order to classify them. Our approach is able to attain a cross-validation accuracy of 85.41 and 73.1% in the two layers, respectively. Our results outperformed the state-of-the-art methods on the same dataset, especially in the second layer (strength classification). Throughout this study, promoter regions could be identified with high accuracy and it provides analysis for further biological research as well as precision medicine. In addition, this study opens new paths for the natural language processing application in omics data in general and DNA sequences in particular.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
| | | | - N. Nagasundaram
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, Singapore, Singapore
| | - Hui-Yuan Yeh
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
49
|
Chou KC. Proposing Pseudo Amino Acid Components is an Important Milestone for Proteome and Genome Analyses. Int J Pept Res Ther 2019. [DOI: 10.1007/s10989-019-09910-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
50
|
|