1
|
Yu Q, Zhang X, Hu Y, Chen S, Yang L. A Method for Predicting DNA Motif Length Based On Deep Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:61-73. [PMID: 35275822 DOI: 10.1109/tcbb.2022.3158471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
A DNA motif is a sequence pattern shared by the DNA sequence segments that bind to a specific protein. Discovering motifs in a given DNA sequence dataset plays a vital role in studying gene expression regulation. As an important attribute of the DNA motif, the motif length directly affects the quality of the discovered motifs. How to determine the motif length more accurately remains a difficult challenge to be solved. We propose a new motif length prediction scheme named MotifLen by using supervised machine learning. First, a method of constructing sample data for predicting the motif length is proposed. Secondly, a deep learning model for motif length prediction is constructed based on the convolutional neural network. Then, the methods of applying the proposed prediction model based on a motif found by an existing motif discovery algorithm are given. The experimental results show that i) the prediction accuracy of MotifLen is more than 90% on the validation set and is significantly higher than that of the compared methods on real datasets, ii) MotifLen can successfully optimize the motifs found by the existing motif discovery algorithms, and iii) it can effectively improve the time performance of some existing motif discovery algorithms.
Collapse
|
2
|
Status Analysis and Future Development Planning of Fitness APP Based on Intelligent Word Frequency Analysis. JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING 2022. [DOI: 10.1155/2022/5190979] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In order to analyze the current situation analysis and future development plan of fitness APP, this paper analyzes the current situation of fitness APP combined with the intelligent word frequency analysis algorithm. In order to deal with the problem that the existing qPMS algorithm is very time-consuming to discover motifs in the large DNA sequence dataset D, a structural motif discovery algorithm for large DNA sequence datasets, SMS, is proposed. Moreover, this paper performs motif discovery by mining substrings with high frequency in the input sequence. Through data research, it can be seen that the fitness APP based on intelligent word frequency analysis proposed in this paper has good results. On this basis, this paper analyzes the problems existing in the existing fitness APP and provides suggestions for its future development.
Collapse
|
3
|
Theepalakshmi P, Reddy US. Freezing firefly algorithm for efficient planted (ℓ, d) motif search. Med Biol Eng Comput 2022; 60:511-530. [PMID: 35020123 DOI: 10.1007/s11517-021-02468-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Accepted: 11/06/2021] [Indexed: 10/19/2022]
Abstract
The detection of inimitable patterns (motif) occurring in a set of biological sequences could elevate new biological discoveries. Its application in recognition of transcription factors and their binding sites have demonstrated the necessity to attain knowledge of gene function, human diseases, and drug design. The literature identifies (ℓ, d) motif search as the widely studied problem in PMS (Planted Motif Search). This paper proposes an efficient optimization algorithm named "Freezing FireFly (FFF)" to solve (ℓ, d) motif search problem. The new strategy freezing such as local and global was added to increase the performance of the basic Firefly algorithm. It freezes the best possible out coming positions even in the lesser brighter one. The performance of the proposed algorithm is experienced on simulated and real datasets. The experimental results show that the proposed algorithm resolves the instance (50, 21) within 1.47 min in the simulated dataset. For real (such as ChIP-seq (Chromatin Immunoprecipitation)) and synthetic datasets, the proposed algorithm runs much faster in comparison to existing state-of-the-art optimization algorithms, including Samselect, TraverStringRef, PMS8, qPMS9, AlignACE, FMGA, and GSGA.
Collapse
Affiliation(s)
- P Theepalakshmi
- Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu, India.
| | - U Srinivasulu Reddy
- Machine Learning and Data Analytics Lab, Center of Excellence in Artificial Intelligence, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu, India
| |
Collapse
|
4
|
Chen Q, Li Y, Lin C, Chen L, Luo H, Xia S, Liu C, Cheng X, Liu C, Li J, Dou D. OUP accepted manuscript. Nucleic Acids Res 2022; 50:e67. [PMID: 35288754 PMCID: PMC9262588 DOI: 10.1093/nar/gkac173] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Revised: 02/02/2022] [Accepted: 03/04/2022] [Indexed: 11/21/2022] Open
Abstract
DNA-encoded library (DEL) technology is a powerful tool for small molecule identification in drug discovery, yet the reported DEL selection strategies were applied primarily on protein targets in either purified form or in cellular context. To expand the application of this technology, we employed DEL selection on an RNA target HIV-1 TAR (trans-acting responsive region), but found that the majority of signals were resulted from false positive DNA–RNA binding. We thus developed an optimized selection strategy utilizing RNA patches and competitive elution to minimize unwanted DNA binding, followed by k-mer analysis and motif search to differentiate false positive signal. This optimized strategy resulted in a very clean background in a DEL selection against Escherichia coli FMN Riboswitch, and the enriched compounds were determined with double digit nanomolar binding affinity, as well as similar potency in functional FMN competition assay. These results demonstrated the feasibility of small molecule identification against RNA targets using DEL selection. The developed experimental and computational strategy provided a promising opportunity for RNA ligand screening and expanded the application of DEL selection to a much wider context in drug discovery.
Collapse
Affiliation(s)
| | | | | | - Liu Chen
- HitGen Inc., Shuangliu District, Chengdu, China
| | - Hao Luo
- HitGen Inc., Shuangliu District, Chengdu, China
| | - Shuai Xia
- HitGen Inc., Shuangliu District, Chengdu, China
| | - Chuan Liu
- HitGen Inc., Shuangliu District, Chengdu, China
| | | | | | - Jin Li
- HitGen Inc., Shuangliu District, Chengdu, China
| | - Dengfeng Dou
- To whom correspondence should be addressed. Tel: +86 28 85197385 8700;
| |
Collapse
|
5
|
Yuan X, Li J, Bai J, Xi J. A Local Outlier Factor-Based Detection of Copy Number Variations From NGS Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1811-1820. [PMID: 31880558 DOI: 10.1109/tcbb.2019.2961886] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Copy number variation (CNV) is a major type of genomic structural variations that play an important role in human disorders. Next generation sequencing (NGS) has fueled the advancement in algorithm design to detect CNVs at base-pair resolution. However, accurate detection of CNVs of low amplitudes remains a challenging task. This paper proposes a new computational method, CNV-LOF, to identify CNVs of full-range amplitudes from NGS data. CNV-LOF is distinctly different from traditional methods, which mainly consider aberrations from a global perspective and rely on some assumed distribution of NGS read depths. In contrast, CNV-LOF takes a local view on the read depths and assigns an outlier factor to each genome segment. With the outlier factor profile, CNV-LOF uses a boxplot procedure to declare CNVs without the reliance of any distribution assumptions. Simulation experiments indicate that CNV-LOF outperforms five existing methods with respect to F1-measure, sensitivity, and precision. CNV-LOF is further validated on real sequencing samples, yielding highly consistent results with peer methods. CNV-LOF is able to detect CNVs of low and moderate amplitudes where the other existing methods fail, and it is expected to become a routine approach for the discovery of novel CNVs on whole sequencing genome.
Collapse
|
6
|
Xiao P, Cai X, Rajasekaran S. EMS3: An Improved Algorithm for Finding Edit-Distance Based Motifs. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:27-37. [PMID: 32931433 DOI: 10.1109/tcbb.2020.3024222] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Discovering patterns in biological sequences is a crucial step to extract useful information from them. Motifs can be viewed as patterns that occur exactly or with minor changes across some or all of the biological sequences. Motif search has numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity among families of proteins, etc. The general problem of motif search is intractable. One of the most studied models of motif search proposed in literature is Edit-distance based Motif Search (EMS). In EMS, the goal is to find all the patterns of length l that occur with an edit-distance of at most d in each of the input sequences. EMS algorithms existing in the literature do not scale well on challenging instances and large datasets. In this paper, the current state-of-the-art EMS solver is advanced by exploiting the idea of dimension reduction. A novel idea to reduce the cardinality of the alphabet is proposed. The algorithm we propose, EMS3, is an exact algorithm. I.e., it finds all the motifs present in the input sequences. EMS3 can be also viewed as a divide and conquer algorithm. In this paper, we provide theoretical analyses to establish the efficiency of EMS3. Extensive experiments on standard benchmark datasets (synthetic and real-world) show that the proposed algorithm outperforms the existing state-of-the-art algorithm (EMS2).
Collapse
|
7
|
Yu Q, Zhao X, Huo H. A new algorithm for DNA motif discovery using multiple sample sequence sets. J Bioinform Comput Biol 2019; 17:1950021. [PMID: 31617465 DOI: 10.1142/s0219720019500215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
DNA motif discovery plays an important role in understanding the mechanisms of gene regulation. Most existing motif discovery algorithms can identify motifs in an efficient and effective manner when dealing with small datasets. However, large datasets generated by high-throughput sequencing technologies pose a huge challenge: it is too time-consuming to process the entire dataset, but if only a small sample sequence set is processed, it is difficult to identify infrequent motifs. In this paper, we propose a new DNA motif discovery algorithm: first divide the input dataset into multiple sample sequence sets, then refine initial motifs of each sample sequence set with the expectation maximization method, and finally combine all the results from each sample sequence set. Besides, we design a new initial motif generation method with the utilization of the entire dataset, which helps to identify infrequent motifs. The experimental results on the simulated data show that the proposed algorithm has better time performance for large datasets and better accuracy of identifying infrequent motifs than the compared algorithms. Also, we have verified the validity of the proposed algorithm on the real data.
Collapse
Affiliation(s)
- Qiang Yu
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, P. R. China
| | - Xiang Zhao
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, P. R. China
| | - Hongwei Huo
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, P. R. China
| |
Collapse
|
8
|
Sun CX, Yang Y, Wang H, Wang WH. A Clustering Approach for Motif Discovery in ChIP-Seq Dataset. ENTROPY (BASEL, SWITZERLAND) 2019; 21:E802. [PMID: 33267515 PMCID: PMC7515331 DOI: 10.3390/e21080802] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Revised: 08/04/2019] [Accepted: 08/15/2019] [Indexed: 12/25/2022]
Abstract
Chromatin immunoprecipitation combined with next-generation sequencing (ChIP-Seq) technology has enabled the identification of transcription factor binding sites (TFBSs) on a genome-wide scale. To effectively and efficiently discover TFBSs in the thousand or more DNA sequences generated by a ChIP-Seq data set, we propose a new algorithm named AP-ChIP. First, we set two thresholds based on probabilistic analysis to construct and further filter the cluster subsets. Then, we use Affinity Propagation (AP) clustering on the candidate cluster subsets to find the potential motifs. Experimental results on simulated data show that the AP-ChIP algorithm is able to make an almost accurate prediction of TFBSs in a reasonable time. Also, the validity of the AP-ChIP algorithm is tested on a real ChIP-Seq data set.
Collapse
Affiliation(s)
- Chun-xiao Sun
- College of Science, Northwest A&F University, Yangling 712100, China
| | - Yu Yang
- School of Computer Science, Pingdingshan University, Pingdingshan 467000, China
- School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Hua Wang
- College of Software, Nankai University, Tianjin 300071, China
- Department of Mathematical Sciences, Georgia Southern University, Statesboro, GA 30460, USA
| | - Wen-hu Wang
- School of Computer Science, Pingdingshan University, Pingdingshan 467000, China
| |
Collapse
|