1
|
Zhou J, Wu H, Du K, Zhou W, Zhou CZ, Li H. PCVR: a pre-trained contextualized visual representation for DNA sequence classification. BMC Bioinformatics 2025; 26:125. [PMID: 40346458 PMCID: PMC12065381 DOI: 10.1186/s12859-025-06136-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Accepted: 04/07/2025] [Indexed: 05/11/2025] Open
Abstract
BACKGROUND The classification of DNA sequences is pivotal in bioinformatics, essentially for genetic information analysis. Traditional alignment-based tools tend to have slow speed and low recall. Machine learning methods learn implicit patterns from data with encoding techniques such as k-mer counting and ordinal encoding, which fail to handle long sequences or sacrifice structural and sequential information. Frequency chaos game representation (FCGR) converts DNA sequences of arbitrary lengths into fixed-size images, breaking free from the constraints of sequence length while preserving more sequential information than other representations. However, existing works merely consider local information, ignoring long-range dependencies and global contextual information within FCGR image. RESULTS We propose PCVR, a Pre-trained Contextualized Visual Representation for DNA sequence classification. PCVR encodes FCGR with a vision transformer into contextualized features containing more global information. To meet the substantial data requirements of the training of vision transformer and learn more robust features, we pre-train the encoder with a masked autoencoder. Pre-trained PCVR exhibits impressive performance on three datasets even with only unsupervised learning. After fine-tuning, PCVR outperforms existing methods on superkingdom and phylum levels. Additionally, our ablation studies confirm the contribution of the vision transformer encoder and masked autoencoder pre-training to performance improvement. CONCLUSIONS PCVR significantly improves DNA sequence classification accuracy and shows strong potential for new species discovery due to its effective capture of global information and robustness. Codes for PCVR are available at https://github.com/jiaruizhou/PCVR .
Collapse
Affiliation(s)
- Jiarui Zhou
- School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei, 230026, Anhui Province, China
| | - Hui Wu
- Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, 230026, Anhui Province, China
| | - Kang Du
- Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, 230026, Anhui Province, China
| | - Wengang Zhou
- Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, 230026, Anhui Province, China.
| | - Cong-Zhao Zhou
- Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, 230026, Anhui Province, China
| | - Houqiang Li
- Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, 230026, Anhui Province, China
| |
Collapse
|
2
|
Correia JP, Silva LRD, Silva R. Multifractal analysis and support vector machine for the classification of coronaviruses and SARS-CoV-2 variants. Sci Rep 2025; 15:15041. [PMID: 40301538 PMCID: PMC12041560 DOI: 10.1038/s41598-025-98366-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2024] [Accepted: 04/10/2025] [Indexed: 05/01/2025] Open
Abstract
This study presents a novel approach for the classification of coronavirus species and variants of SARS-CoV-2 using Chaos Game Representation (CGR) and 2D Multifractal Detrended Fluctuation Analysis (2D MF-DFA). By extracting fractal parameters from CGR images, we constructed a state space that effectively distinguishes different species and variants. Our method achieved [Formula: see text] accuracy in species classification, with a notable [Formula: see text] accuracy for SARS-CoV-2 variants despite their genetic similarities. Using a Support Vector Machine (SVM) as a classifier further enhanced the performance. This approach, which requires fewer steps than most existing methods, offers an efficient and effective tool for viral classification, with implications for bioinformatics, public health, and vaccine development.
Collapse
Affiliation(s)
- J P Correia
- Department of Theoretical and Experimental Physics, Federal University of Rio Grande do Norte, 59072-970, Natal-RN, Brazil.
- Department of Technology and Data Science, Getúlio Vargas Foundation, 01313-902, São Paulo, Brazil.
| | - L R da Silva
- Department of Theoretical and Experimental Physics, Federal University of Rio Grande do Norte, 59072-970, Natal-RN, Brazil
- National Institute of Science and Technology of Complex Systems, Brazilian Center for Physics Research, 22290-180, Rio de Janeiro-RJ, Brazil
| | - R Silva
- Department of Theoretical and Experimental Physics, Federal University of Rio Grande do Norte, 59072-970, Natal-RN, Brazil
- Department of Physics, Rio Grande do Norte State University, 59610-210, Mossoró-RN, Brazil
| |
Collapse
|
3
|
Xie P, Guan J, He X, Zhao Z, Guo Y, Sun Z, Yao L, Lee TY, Chiang YC. CAP-m7G: A capsule network-based framework for specific RNA N7-methylguanosine site identification using image encoding and reconstruction layers. Comput Struct Biotechnol J 2025; 27:804-812. [PMID: 40109445 PMCID: PMC11919597 DOI: 10.1016/j.csbj.2025.02.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2024] [Revised: 02/24/2025] [Accepted: 02/25/2025] [Indexed: 03/22/2025] Open
Abstract
N7-methylguanosine (m7G) modifications play a pivotal role in RNA stability, mRNA export, and protein translation. They are closely associated with ribosome function and the regulation of gene expression. Dysregulation of m7G has been implicated in various diseases, including cancers and neurodegenerative disorders, where the loss of m7G can lead to genomic instability and uncontrolled cell proliferation. Accurate identification of m7G sites is thus essential for elucidating these mechanisms. Due to the high cost of experimentally validating m7G sites, several artificial intelligence models have been developed to predict these sites. However, the performance of these models is not yet optimal, and a user-friendly web server is still needed. To address these issues, we developed CAP-m7G, an innovative model that integrates Chaos Game Representation, Capsule Networks, and reconstruction layers. CAP-m7G achieved an accuracy of 96.63%, a specificity of 95.07%, and a Matthews correlation coefficient (MCC) of 0.933 on independent test data. Our results demonstrate that the integration of Chaos Game Representation with Capsule Network can effectively capture the crucial sequence information associated with m7G sites. The web server can be accessed at https://awi.cuhk.edu.cn/~biosequence/CAP-m7G/index.php.
Collapse
Affiliation(s)
- Peilin Xie
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Blvd, Longgang District, 518172, Shenzhen, China
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Blvd, Longgang District, 518172, Shenzhen, China
| | - Jiahui Guan
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Blvd, Longgang District, 518172, Shenzhen, China
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Blvd, Longgang District, 518172, Shenzhen, China
| | - Xuxin He
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Blvd, Longgang District, 518172, Shenzhen, China
| | - Zhihao Zhao
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Blvd, Longgang District, 518172, Shenzhen, China
| | - Yilin Guo
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Blvd, Longgang District, 518172, Shenzhen, China
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Blvd, Longgang District, 518172, Shenzhen, China
| | - Zhenglong Sun
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Blvd, Longgang District, 518172, Shenzhen, China
| | - Lantian Yao
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Blvd, Longgang District, 518172, Shenzhen, China
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Blvd, Longgang District, 518172, Shenzhen, China
| | - Tzong-Yi Lee
- Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Center for Intelligent Drug Systems and Smart Bio-devices (IDS2B), National Yang Ming Chiao Tung University, Hsinchu, 300, Taiwan
| | - Ying-Chih Chiang
- Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Blvd, Longgang District, 518172, Shenzhen, China
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Blvd, Longgang District, 518172, Shenzhen, China
- School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Blvd, Longgang District, 518172, Shenzhen, China
| |
Collapse
|
4
|
Wang S, Yu ZG, Han GS. MVSLLnc: LncRNA subcellular localization prediction based on multi-source features and two-stage voting strategy. Methods 2025; 234:324-332. [PMID: 39837434 DOI: 10.1016/j.ymeth.2025.01.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Revised: 12/28/2024] [Accepted: 01/16/2025] [Indexed: 01/23/2025] Open
Abstract
The subcellular localization of long non-coding RNAs (lncRNAs) is crucial for understanding the function of lncRNAs. Since the traditional biological experimental methods are time-consuming and some existing computational methods rely on high computing power, we are committed to finding a simple and easy-to-implement method to achieve more efficient prediction of the subcellular localization of lncRNAs. In this work, we proposed a model based on multi-source features and two-stage voting strategy for predicting the subcellular localization of lncRNAs (MVSLLnc). The multi-source features include k-mer frequency, features based on the coordinate values of Chaos Game Representation (CGR) and features based on physicochemical property (PhyChe). We feed the multi-source features into the traditional machine learning classifiers RF, SVM and XGBoost, respectively, and perform the final prediction task with two-stage voting strategy. Experimental results on three benchmark datasets show that the accuracy can reach 0.829, 0.793 and 0.968, respectively. The accuracy on three independent test sets is 0.642, 0.737 and 0.518, respectively, which are competitive with the existing methods. Our ablation analyses show that the two-stage voting strategy can make full use of the advantages of multi-source features and multiple classifiers, and obtain more robust results.
Collapse
Affiliation(s)
- Sheng Wang
- National Center for Applied Mathematics in Hunan, Xiangtan University, Hunan 411105, China; Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China
| | - Zu-Guo Yu
- National Center for Applied Mathematics in Hunan, Xiangtan University, Hunan 411105, China; Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China.
| | - Guo-Sheng Han
- National Center for Applied Mathematics in Hunan, Xiangtan University, Hunan 411105, China; Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China.
| |
Collapse
|
5
|
Wu Y, Xie X, Zhu J, Guan L, Li M. Overview and Prospects of DNA Sequence Visualization. Int J Mol Sci 2025; 26:477. [PMID: 39859192 PMCID: PMC11764684 DOI: 10.3390/ijms26020477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2024] [Revised: 12/30/2024] [Accepted: 01/04/2025] [Indexed: 01/27/2025] Open
Abstract
Due to advances in big data technology, deep learning, and knowledge engineering, biological sequence visualization has been extensively explored. In the post-genome era, biological sequence visualization enables the visual representation of both structured and unstructured biological sequence data. However, a universal visualization method for all types of sequences has not been reported. Biological sequence data are rapidly expanding exponentially and the acquisition, extraction, fusion, and inference of knowledge from biological sequences are critical supporting technologies for visualization research. These areas are important and require in-depth exploration. This paper elaborates on a comprehensive overview of visualization methods for DNA sequences from four different perspectives-two-dimensional, three-dimensional, four-dimensional, and dynamic visualization approaches-and discusses the strengths and limitations of each method in detail. Furthermore, this paper proposes two potential future research directions for biological sequence visualization in response to the challenges of inefficient graphical feature extraction and knowledge association network generation in existing methods. The first direction is the construction of knowledge graphs for biological sequence big data, and the second direction is the cross-modal visualization of biological sequences using machine learning methods. This review is anticipated to provide valuable insights and contributions to computational biology, bioinformatics, genomic computing, genetic breeding, evolutionary analysis, and other related disciplines in the fields of biology, medicine, chemistry, statistics, and computing. It has an important reference value in biological sequence recommendation systems and knowledge question answering systems.
Collapse
Affiliation(s)
| | | | | | | | - Mengshan Li
- School of Mathematics and Computer Science, Gannan Normal University, Ganzhou 341000, China; (Y.W.); (X.X.); (J.Z.); (L.G.)
| |
Collapse
|
6
|
Alipour F, Hill KA, Kari L. CGRclust: Chaos Game Representation for twin contrastive clustering of unlabelled DNA sequences. BMC Genomics 2024; 25:1214. [PMID: 39695938 DOI: 10.1186/s12864-024-11135-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Accepted: 12/06/2024] [Indexed: 12/20/2024] Open
Abstract
BACKGROUND Traditional supervised learning methods applied to DNA sequence taxonomic classification rely on the labor-intensive and time-consuming step of labelling the primary DNA sequences. Additionally, standard DNA classification/clustering methods involve time-intensive multiple sequence alignments, which impacts their applicability to large genomic datasets or distantly related organisms. These limitations indicate a need for robust, efficient, and scalable unsupervised DNA sequence clustering methods that do not depend on sequence labels or alignment. RESULTS This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclust overcomes the limitations of traditional sequence classification methods by leveraging unsupervised twin contrastive learning to detect distinctive sequence patterns, without requiring DNA sequence alignment or biological/taxonomic labels. CGRclust accurately clustered twenty-five diverse datasets, with sequence lengths ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences. Compared with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy across all four taxonomic levels tested for mitochondrial DNA genomes of fish. Moreover, CGRclust also consistently demonstrates superior performance across all the viral genomic datasets. The high clustering accuracy of CGRclust on these twenty-five datasets, which vary significantly in terms of sequence length, number of genomes, number of clusters, and level of taxonomy, demonstrates its robustness, scalability, and versatility. CONCLUSION CGRclust is a novel, scalable, alignment-free DNA sequence clustering method that uses CGR images of DNA sequences and CNNs for twin contrastive clustering of unlabelled primary DNA sequences, achieving superior or comparable accuracy and performance over current approaches. CGRclust demonstrated enhanced reliability, by consistently achieving over 80% accuracy in more than 90% of the datasets analyzed. In particular, CGRclust performed especially well in clustering viral DNA datasets, where it consistently outperformed all competing methods.
Collapse
Affiliation(s)
- Fatemeh Alipour
- School of Computer Science, University of Waterloo, Waterloo, Canada.
| | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, Canada
| |
Collapse
|
7
|
Sarumi OA, Hahn M, Heider D. NeuralBeds: Neural embeddings for efficient DNA data compression and optimized similarity search. Comput Struct Biotechnol J 2024; 23:732-741. [PMID: 38298179 PMCID: PMC10828564 DOI: 10.1016/j.csbj.2023.12.046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 12/28/2023] [Accepted: 12/28/2023] [Indexed: 02/02/2024] Open
Abstract
The availability of high throughput sequencing tools coupled with the declining costs in the production of DNA sequences has led to the generation of enormous amounts of omics data curated in several databases such as NCBI and EMBL. Identification of similar DNA sequences from these databases is one of the fundamental tasks in bioinformatics. It is essential for discovering homologous sequences in organisms, phylogenetic studies of evolutionary relationships among several biological entities, or detection of pathogens. Improving DNA similarity search is of outmost importance because of the increased complexity of the evergrowing repositories of sequences. Therefore, instead of using the conventional approach of comparing raw sequences, e.g., in fasta format, a numerical representation of the sequences can be used to calculate their similarities and optimize the search process. In this study, we analyzed different approaches for numerical embeddings, including Chaos Game Representation, hashing, and neural networks, and compared them with classical approaches such as principal component analysis. It turned out that neural networks generate embeddings that are able to capture the similarity between DNA sequences as a distance measure and outperform the other approaches on DNA similarity search, significantly.
Collapse
Affiliation(s)
- Oluwafemi A. Sarumi
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, Marburg, D-35043, Germany
- Institute of Computer Science, Heinrich-Heine-University Duesseldorf, Graf-Adolf-Str. 63, Duesseldorf, D-40215, Germany
| | - Maximilian Hahn
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, Marburg, D-35043, Germany
| | - Dominik Heider
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, Marburg, D-35043, Germany
- Institute of Computer Science, Heinrich-Heine-University Duesseldorf, Graf-Adolf-Str. 63, Duesseldorf, D-40215, Germany
| |
Collapse
|
8
|
Boumajdi N, Bendani H, Belyamani L, Ibrahimi A. TreeWave: command line tool for alignment-free phylogeny reconstruction based on graphical representation of DNA sequences and genomic signal processing. BMC Bioinformatics 2024; 25:367. [PMID: 39604838 PMCID: PMC11600722 DOI: 10.1186/s12859-024-05992-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Accepted: 11/18/2024] [Indexed: 11/29/2024] Open
Abstract
BACKGROUND Genomic sequence similarity comparison is a crucial research area in bioinformatics. Multiple Sequence Alignment (MSA) is the basic technique used to identify regions of similarity between sequences, although MSA tools are widely used and highly accurate, they are often limited by computational complexity, and inaccuracies when handling highly divergent sequences, which leads to the development of alignment-free (AF) algorithms. RESULTS This paper presents TreeWave, a novel AF approach based on frequency chaos game representation and discrete wavelet transform of sequences for phylogeny inference. We validate our method on various genomic datasets such as complete virus genome sequences, bacteria genome sequences, human mitochondrial genome sequences, and rRNA gene sequences. Compared to classical methods, our tool demonstrates a significant reduction in running time, especially when analyzing large datasets. The resulting phylogenetic trees show that TreeWave has similar classification accuracy to the classical MSA methods based on the normalized Robinson-Foulds distances and Baker's Gamma coefficients. CONCLUSIONS TreeWave is an open source and user-friendly command line tool for phylogeny reconstruction. It is a faster and more scalable tool that prioritizes computational efficiency while maintaining accuracy. TreeWave is freely available at https://github.com/nasmaB/TreeWave .
Collapse
Affiliation(s)
- Nasma Boumajdi
- Laboratory of Biotechnology (MedBiotech), Rabat Medical & Pharmacy School, Bioinova Research Center, Mohammed V University in Rabat, Rabat, Morocco
| | - Houda Bendani
- Laboratory of Biotechnology (MedBiotech), Rabat Medical & Pharmacy School, Bioinova Research Center, Mohammed V University in Rabat, Rabat, Morocco
| | - Lahcen Belyamani
- Mohammed VI Center for Research and Innovation (CM6), Rabat, Morocco
- Mohammed VI University of Sciences and Health (UM6SS), Casablanca, Morocco
- Emergency Department, Military Hospital Mohammed V, Rabat Medical and Pharmacy School, Mohammed V University, Rabat, Morocco
| | - Azeddine Ibrahimi
- Laboratory of Biotechnology (MedBiotech), Rabat Medical & Pharmacy School, Bioinova Research Center, Mohammed V University in Rabat, Rabat, Morocco.
| |
Collapse
|
9
|
Wu X, Zhang L, Tong X, Wang Y, Zhang Z, Kong X, Ni S, Luo X, Zheng M, Tang Y, Li X. miCGR: interpretable deep neural network for predicting both site-level and gene-level functional targets of microRNA. Brief Bioinform 2024; 26:bbae616. [PMID: 39592153 PMCID: PMC11596087 DOI: 10.1093/bib/bbae616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 10/29/2024] [Accepted: 11/12/2024] [Indexed: 11/28/2024] Open
Abstract
MicroRNAs (miRNAs) are critical regulators in various biological processes to cleave or repress translation of messenger RNAs (mRNAs). Accurately predicting miRNA targets is essential for developing miRNA-based therapies for diseases such as cancer and cardiovascular disease. Traditional miRNA target prediction methods often struggle due to incomplete knowledge of miRNA-target interactions and lack interpretability. To address these limitations, we propose miCGR, an end-to-end deep learning framework for predicting functional miRNA targets. MiCGR employs 2D convolutional neural networks alongside an enhanced Chaos Game Representation (CGR) of both miRNA sequences and their candidate target site (CTS) on mRNA. This advanced CGR transforms genetic sequences into informative 2D graphical representations based on sequence composition and subsequence frequencies, and explicitly incorporates important prior knowledge of seed regions and subsequence positions. Unlike one-dimensional methods based solely on sequence characters, this approach identifies functional motifs within sequences, even if they are distant in the original sequences. Our model outperforms existing methods in predicting functional targets at both the site and gene levels. To enhance interpretability, we incorporate Shapley value analysis for each subsequence within both miRNA sequences and their target sites, allowing miCGR to achieve improved accuracy, particularly with more lenient CTS selection criteria. Finally, two case studies demonstrate the practical applicability of miCGR, highlighting its potential to provide insights for optimizing artificial miRNA analogs that surpass endogenous counterparts.
Collapse
Affiliation(s)
- Xiaolong Wu
- School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Lehan Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- School of Pharmacy, University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Xiaochu Tong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- School of Pharmacy, University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Yitian Wang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- School of Pharmacy, University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Zimei Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Xiangtai Kong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- School of Pharmacy, University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Shengkun Ni
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- School of Pharmacy, University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Xiaomin Luo
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- School of Pharmacy, University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- School of Pharmacy, University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Yun Tang
- School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Xutong Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- School of Pharmacy, University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| |
Collapse
|
10
|
Liu F, Zhao Z, Liu Y. PHPGAT: predicting phage hosts based on multimodal heterogeneous knowledge graph with graph attention network. Brief Bioinform 2024; 26:bbaf017. [PMID: 39833104 PMCID: PMC11745545 DOI: 10.1093/bib/bbaf017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2024] [Revised: 12/18/2024] [Accepted: 01/07/2025] [Indexed: 01/22/2025] Open
Abstract
Antibiotic resistance poses a significant threat to global health, making the development of alternative strategies to combat bacterial pathogens increasingly urgent. One such promising approach is the strategic use of bacteriophages (or phages) to specifically target and eradicate antibiotic-resistant bacteria. Phages, being among the most prevalent life forms on Earth, play a critical role in maintaining ecological balance by regulating bacterial communities and driving genetic diversity. Accurate prediction of phage hosts is essential for successfully applying phage therapy. However, existing prediction models may not fully encapsulate the complex dynamics of phage-host interactions in diverse microbial environments, indicating a need for improved accuracy through more sophisticated modeling techniques. In response to this challenge, this study introduces a novel phage-host prediction model, PHPGAT, which leverages a multimodal heterogeneous knowledge graph with the advanced GATv2 (Graph Attention Network v2) framework. The model first constructs a multimodal heterogeneous knowledge graph by integrating phage-phage, host-host, and phage-host interactions to capture the intricate connections between biological entities. GATv2 is then employed to extract deep node features and learn dynamic interdependencies, generating context-aware embeddings. Finally, an inner product decoder is designed to compute the likelihood of interaction between a phage and host pair based on the embedding vectors produced by GATv2. Evaluation results using two datasets demonstrate that PHPGAT achieves precise phage host predictions and outperforms other models. PHPGAT is available at https://github.com/ZhaoZMer/PHPGAT.
Collapse
Affiliation(s)
- Fu Liu
- College of Communication Engineering, Jilin University, No. 2699 Qianjin Street, Chaoyang District, Changchun 130012, China
| | - Zhimiao Zhao
- School of Artificial Intelligence, Jilin University, No. 5988 Renmin Street, Nanguan District, Changchun 130022, China
| | - Yun Liu
- College of Communication Engineering, Jilin University, No. 2699 Qianjin Street, Chaoyang District, Changchun 130012, China
| |
Collapse
|
11
|
Li T, Li M, Wu Y, Li Y. Visualization Methods for DNA Sequences: A Review and Prospects. Biomolecules 2024; 14:1447. [PMID: 39595624 PMCID: PMC11592258 DOI: 10.3390/biom14111447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2024] [Revised: 11/08/2024] [Accepted: 11/12/2024] [Indexed: 11/28/2024] Open
Abstract
The efficient analysis and interpretation of biological sequence data remain major challenges in bioinformatics. Graphical representation, as an emerging and effective visualization technique, offers a more intuitive method for analyzing DNA sequences. However, many visualization approaches are dispersed across research databases, requiring urgent organization, integration, and analysis. Additionally, no single visualization method excels in all aspects. To advance these methods, knowledge graphs and advanced machine learning techniques have become key areas of exploration. This paper reviews the current 2D and 3D DNA sequence visualization methods and proposes a new research direction focused on constructing knowledge graphs for biological sequence visualization, explaining the relevant theories, techniques, and models involved. Additionally, we summarize machine learning techniques applicable to sequence visualization, such as graph embedding methods and the use of convolutional neural networks (CNNs) for processing graphical representations. These machine learning techniques and knowledge graphs aim to provide valuable insights into computational biology, bioinformatics, genomic computing, and evolutionary analysis. The study serves as an important reference for improving intelligent search systems, enriching knowledge bases, and enhancing query systems related to biological sequence visualization, offering a comprehensive framework for future research.
Collapse
Affiliation(s)
- Tan Li
- School of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, China; (T.L.); (Y.L.)
| | - Mengshan Li
- School of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, China; (T.L.); (Y.L.)
| | - Yan Wu
- School of Mathematics and Computer Science, Gannan Normal University, Ganzhou 341000, China;
| | - Yelin Li
- School of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, China; (T.L.); (Y.L.)
| |
Collapse
|
12
|
Li X, Zhou T, Feng X, Yau ST, Yau SST. Exploring geometry of genome space via Grassmann manifolds. Innovation (N Y) 2024; 5:100677. [PMID: 39206218 PMCID: PMC11350263 DOI: 10.1016/j.xinn.2024.100677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Accepted: 07/18/2024] [Indexed: 09/04/2024] Open
Abstract
It is important to understand the geometry of genome space in biology. After transforming genome sequences into frequency matrices of the chaos game representation (FCGR), we regard a genome sequence as a point in a suitable Grassmann manifold by analyzing the column space of the corresponding FCGR. To assess the sequence similarity, we employ the generalized Grassmannian distance, an intrinsic geometric distance that differs from the traditional Euclidean distance used in the classical k-mer frequency-based methods. With this method, we constructed phylogenetic trees for various genome datasets, including influenza A virus hemagglutinin gene, Orthocoronavirinae genome, and SARS-CoV-2 complete genome sequences. Our comparative analysis with multiple sequence alignment and alignment-free methods for large-scale sequences revealed that our method, which employs the subspace distance between the column spaces of different FCGRs (FCGR-SD), outperformed its competitors in terms of both speed and accuracy. In addition, we used low-dimensional visualization of the SARS-CoV-2 genome sequences and spike protein nucleotide sequences with our methods, resulting in some intriguing findings. We not only propose a novel and efficient algorithm for comparing genome sequences but also demonstrate that genome data have some intrinsic manifold structures, providing a new geometric perspective for molecular biology studies.
Collapse
Affiliation(s)
- Xiaoguang Li
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Tao Zhou
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Xingdong Feng
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Shing-Tung Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing 101408, China
| | - Stephen S.-T. Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing 101408, China
| |
Collapse
|
13
|
Mallikarjuna T, Thummadi NB, Vindal V, Manimaran P. Prioritizing cervical cancer candidate genes using chaos game and fractal-based time series approach. Theory Biosci 2024; 143:183-193. [PMID: 38807013 DOI: 10.1007/s12064-024-00418-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 05/14/2024] [Indexed: 05/30/2024]
Abstract
Cervical cancer is one of the most severe threats to women worldwide and holds fourth rank in lethality. It is estimated that 604, 127 cervical cancer cases have been reported in 2020 globally. With advancements in high throughput technologies and bioinformatics, several cervical candidate genes have been proposed for better therapeutic strategies. In this paper, we intend to prioritize the candidate genes that are involved in cervical cancer progression through a fractal time series-based cross-correlations approach. we apply the chaos game representation theory combining a two-dimensional multifractal detrended cross-correlations approach among the known and candidate genes involved in cervical cancer progression to prioritize the candidate genes. We obtained 16 candidate genes that showed cross-correlation with known cancer genes. Functional enrichment analysis of the candidate genes shows that they involve GO terms: biological processes, cell-cell junction assembly, cell-cell junction organization, regulation of cell shape, cortical actin cytoskeleton organization, and actomyosin structure organization. KEGG pathway analysis revealed genes' role in Rap1 signaling pathway, ErbB signaling pathway, MAPK signaling pathway, PI3K-Akt signaling pathway, mTOR signaling pathway, Acute myeloid leukemia, chronic myeloid leukemia, Breast cancer, Thyroid cancer, Bladder cancer, and Gastric cancer. Further, we performed survival analysis and prioritized six genes CDH2, PAIP1, BRAF, EPB41L3, OSMR, and RUNX1 as potential candidate genes for cervical cancer that has a crucial role in tumor progression. We found that our study through this integrative approach an efficient tool and paved a new way to prioritize the candidate genes and these genes could be evaluated experimentally for potential validation. We suggest this may be useful in analyzing the nucleotide sequences and protein sequences for clustering, classification, class affiliation, etc.
Collapse
Affiliation(s)
- T Mallikarjuna
- Department of Biotechnology and Bioinformatics, School of Life Sciences, University of Hyderabad, Gachibowli, Hyderabad, 500046, India
| | - N B Thummadi
- Department of Animal Biology, School of Life Sciences, University of Hyderabad, Gachibowli, Hyderabad, 500046, India
| | - Vaibhav Vindal
- Department of Biotechnology and Bioinformatics, School of Life Sciences, University of Hyderabad, Gachibowli, Hyderabad, 500046, India
| | - P Manimaran
- School of Physics, University of Hyderabad, Gachibowli, Hyderabad, Telangana, 500046, India.
| |
Collapse
|
14
|
Yao L, Xie P, Guan J, Chung CR, Huang Y, Pang Y, Wu H, Chiang YC, Lee TY. CapsEnhancer: An Effective Computational Framework for Identifying Enhancers Based on Chaos Game Representation and Capsule Network. J Chem Inf Model 2024; 64:5725-5736. [PMID: 38946113 PMCID: PMC11267569 DOI: 10.1021/acs.jcim.4c00546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Revised: 06/21/2024] [Accepted: 06/21/2024] [Indexed: 07/02/2024]
Abstract
Enhancers are a class of noncoding DNA, serving as crucial regulatory elements in governing gene expression by binding to transcription factors. The identification of enhancers holds paramount importance in the field of biology. However, traditional experimental methods for enhancer identification demand substantial human and material resources. Consequently, there is a growing interest in employing computational methods for enhancer prediction. In this study, we propose a two-stage framework based on deep learning, termed CapsEnhancer, for the identification of enhancers and their strengths. CapsEnhancer utilizes chaos game representation to encode DNA sequences into unique images and employs a capsule network to extract local and global features from sequence "images". Experimental results demonstrate that CapsEnhancer achieves state-of-the-art performance in both stages. In the first and second stages, the accuracy surpasses the previous best methods by 8 and 3.5%, reaching accuracies of 94.5 and 95%, respectively. Notably, this study represents the pioneering application of computer vision methods to enhancer identification tasks. Our work not only contributes novel insights to enhancer identification but also provides a fresh perspective for other biological sequence analysis tasks.
Collapse
Affiliation(s)
- Lantian Yao
- Kobilka
Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China
- School
of Science and Engineering, The Chinese
University of Hong Kong, Shenzhen 518172, China
| | - Peilin Xie
- Kobilka
Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China
| | - Jiahui Guan
- School
of Medicine, The Chinese University of Hong
Kong, Shenzhen 518172, China
| | - Chia-Ru Chung
- Department
of Computer Science and Information Engineering, National Central University, Taoyuan 320317, Taiwan
| | - Yixian Huang
- School
of Medicine, The Chinese University of Hong
Kong, Shenzhen 518172, China
| | - Yuxuan Pang
- Division
of Health Medical Intelligence, Human Genome Center, The Institute
of Medical Science, The University of Tokyo, Tokyo 108-8639, Japan
| | - Huacong Wu
- School
of Medicine, The Chinese University of Hong
Kong, Shenzhen 518172, China
| | - Ying-Chih Chiang
- Kobilka
Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen 518172, China
- School
of Medicine, The Chinese University of Hong
Kong, Shenzhen 518172, China
| | - Tzong-Yi Lee
- Institute
of Bioinformatics and Systems Biology, National
Yang Ming Chiao Tung University, Hsinchu 300093, Taiwan
- Center
for Intelligent Drug Systems and Smart Bio-devices (IDS2B), National Yang Ming Chiao Tung University, Hsinchu 300093, Taiwan
| |
Collapse
|
15
|
Asif S, Zhao M, Li Y, Tang F, Zhu Y. CGO-ensemble: Chaos game optimization algorithm-based fusion of deep neural networks for accurate Mpox detection. Neural Netw 2024; 173:106183. [PMID: 38382397 DOI: 10.1016/j.neunet.2024.106183] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Revised: 12/19/2023] [Accepted: 02/15/2024] [Indexed: 02/23/2024]
Abstract
The rising global incidence of human Mpox cases necessitates prompt and accurate identification for effective disease control. Previous studies have predominantly delved into traditional ensemble methods for detection, we introduce a novel approach by leveraging a metaheuristic-based ensemble framework. In this research, we present an innovative CGO-Ensemble framework designed to elevate the accuracy of detecting Mpox infection in patients. Initially, we employ five transfer learning base models that integrate feature integration layers and residual blocks. These components play a crucial role in capturing significant features from the skin images, thereby enhancing the models' efficacy. In the next step, we employ a weighted averaging scheme to consolidate predictions generated by distinct models. To achieve the optimal allocation of weights for each base model in the ensemble process, we leverage the Chaos Game Optimization (CGO) algorithm. This strategic weight assignment enhances classification outcomes considerably, surpassing the performance of randomly assigned weights. Implementing this approach yields notably enhanced prediction accuracy compared to using individual models. We evaluate the effectiveness of our proposed approach through comprehensive experiments conducted on two widely recognized benchmark datasets: the Mpox Skin Lesion Dataset (MSLD) and the Mpox Skin Image Dataset (MSID). To gain insights into the decision-making process of the base models, we have performed Gradient Class Activation Mapping (Grad-CAM) analysis. The experimental results showcase the outstanding performance of the CGO-ensemble, achieving an impressive accuracy of 100% on MSLD and 94.16% on MSID. Our approach significantly outperforms other state-of-the-art optimization algorithms, traditional ensemble methods, and existing techniques in the context of Mpox detection on these datasets. These findings underscore the effectiveness and superiority of the CGO-Ensemble in accurately identifying Mpox cases, highlighting its potential in disease detection and classification.
Collapse
Affiliation(s)
- Sohaib Asif
- School of Computer Science and Engineering, Central South University, Changsha, China.
| | - Ming Zhao
- School of Computer Science and Engineering, Central South University, Changsha, China.
| | - Yangfan Li
- School of Computer Science and Engineering, Central South University, Changsha, China.
| | - Fengxiao Tang
- School of Computer Science and Engineering, Central South University, Changsha, China.
| | - Yusen Zhu
- School of Mathematics, Hunan University, Changsha, China
| |
Collapse
|
16
|
Wang T, Yu ZG, Li J. CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front Microbiol 2024; 15:1339156. [PMID: 38572227 PMCID: PMC10987876 DOI: 10.3389/fmicb.2024.1339156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/23/2024] [Indexed: 04/05/2024] Open
Abstract
Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.
Collapse
Affiliation(s)
- Ting Wang
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Zu-Guo Yu
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Jinyan Li
- School of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
17
|
Zhou DD, Li HZ, Wang W, Kuang L. Changes in oscillatory patterns of microstate sequence in patients with first-episode psychosis. Sci Data 2024; 11:38. [PMID: 38182586 PMCID: PMC10770397 DOI: 10.1038/s41597-023-02892-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2023] [Accepted: 12/27/2023] [Indexed: 01/07/2024] Open
Abstract
We aimed to utilize chaos game representation (CGR) for the investigation of microstate sequences and explore its potential as neurobiomarkers for psychiatric disorders. We applied our proposed method to a public dataset including 82 patients with first-episode psychosis (FEP) and 61 control subjects. Two time series were constructed: one using the microstate spacing distance in CGR and the other using complex numbers representing the microstate coordinates in CGR. Power spectral features of both time series and frequency matrix CGR (FCGR) were compared between groups and employed in a machine learning application. The four canonical microstates (A, B, C, and D) were identified using both shared and separate templates. Our results showed the microstate oscillatory pattern exhibited alterations in the FEP group. Using oscillatory features improved machine learning performance compared with classical features and FCGR. This study opens up new avenues for exploring the use of CGR in analyzing EEG microstate sequences. Features derived from microstate sequence CGR offer fine-grained neurobiomarkers for psychiatric disorders.
Collapse
Affiliation(s)
- Dong-Dong Zhou
- Mental Health Center, University-Town Hospital of Chongqing Medical University, Chongqing, China.
| | - Hong-Zhi Li
- Mental Health Center, University-Town Hospital of Chongqing Medical University, Chongqing, China
| | - Wo Wang
- Mental Health Center, University-Town Hospital of Chongqing Medical University, Chongqing, China
| | - Li Kuang
- Mental Health Center, University-Town Hospital of Chongqing Medical University, Chongqing, China.
- Department of Psychiatry, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China.
| |
Collapse
|
18
|
Arsiccio A, Stratta L, Menzen T. Evaluating the chaos game representation of proteins for applications in machine learning models: prediction of antibody affinity and specificity as a case study. J Mol Model 2023; 29:377. [PMID: 37968495 DOI: 10.1007/s00894-023-05777-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 10/31/2023] [Indexed: 11/17/2023]
Abstract
CONTEXT Machine learning techniques are becoming increasingly important in the selection and optimization of therapeutic molecules, as well as for the selection of formulation components and the prediction of long-term stability. Compared to first-principle models, machine learning techniques are easier to implement, and can identify correlations that would be hard to describe at a mechanistic level, but strongly rely on high-quality input training data. Here, we evaluate the potential of the "chaos game" representation to provide input data for machine learning models. The chaos game is an algorithm originally developed for the production of fractal structures, and later on applied also to the representation of biological sequences, such as genes and proteins. Our results show that the combination of the chaos game representation with convolutional neural networks results in comparable accuracy to other machine learning approaches, thus indicating that chaos game representations could be a valid alternative to existing featurization strategies for machine learning models of biological sequences. METHODS We implement the chaos game in Python 3.8.10, and use it to produce fractal as well as novel expanding representations of protein sequences. We then feed the resulting images to a convolutional neural network, built in Python 3.8.10, using TensorFlow 2.9.1, Keras 2.9.0, and the scikit-learn 1.1.1 packages. We select as case study a recently published dataset for the antibody emibetuzumab, with the objective of co-optimizing antibodies variants with both high affinity and low non-specific binding.
Collapse
Affiliation(s)
- Andrea Arsiccio
- Coriolis Pharma, Fraunhoferstrasse 18 b, 82152, Martinsried, Germany.
| | - Lorenzo Stratta
- Molecular Engineering Laboratory (molE), Department of Applied Science and Technology, Politecnico di Torino, 24 corso Duca degli Abruzzi, IT-10129, Torino, Italy
| | - Tim Menzen
- Coriolis Pharma, Fraunhoferstrasse 18 b, 82152, Martinsried, Germany
| |
Collapse
|
19
|
Yan W, Tan L, Meng-Shan L, Sheng S, Jun W, Fu-an W. SaPt-CNN-LSTM-AR-EA: a hybrid ensemble learning framework for time series-based multivariate DNA sequence prediction. PeerJ 2023; 11:e16192. [PMID: 37810796 PMCID: PMC10559882 DOI: 10.7717/peerj.16192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 09/06/2023] [Indexed: 10/10/2023] Open
Abstract
Biological sequence data mining is hot spot in bioinformatics. A biological sequence can be regarded as a set of characters. Time series is similar to biological sequences in terms of both representation and mechanism. Therefore, in the article, biological sequences are represented with time series to obtain biological time sequence (BTS). Hybrid ensemble learning framework (SaPt-CNN-LSTM-AR-EA) for BTS is proposed. Single-sequence and multi-sequence models are respectively constructed with self-adaption pre-training one-dimensional convolutional recurrent neural network and autoregressive fractional integrated moving average fused evolutionary algorithm. In DNA sequence experiments with six viruses, SaPt-CNN-LSTM-AR-EA realized the good overall prediction performance and the prediction accuracy and correlation respectively reached 1.7073 and 0.9186. SaPt-CNN-LSTM-AR-EA was compared with other five benchmark models so as to verify its effectiveness and stability. SaPt-CNN-LSTM-AR-EA increased the average accuracy by about 30%. The framework proposed in this article is significant in biology, biomedicine, and computer science, and can be widely applied in sequence splicing, computational biology, bioinformation, and other fields.
Collapse
Affiliation(s)
- Wu Yan
- School of Biotechnology, Jiangsu University of Science & Technology, Zhenjiang, China
- School of Mathematics and Computer Science, Gannan Normal University, Ganzhou, Jiangxi, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, China
| | - Li Tan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, China
| | - Li Meng-Shan
- College of Physics and Electronic Information, Gannan Normal University, Ganzhou, China
| | - Sheng Sheng
- School of Biotechnology, Jiangsu University of Science & Technology, Zhenjiang, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, China
| | - Wang Jun
- School of Biotechnology, Jiangsu University of Science & Technology, Zhenjiang, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, China
| | - Wu Fu-an
- School of Biotechnology, Jiangsu University of Science & Technology, Zhenjiang, China
- Sericultural Research Institute, Chinese Academy of Agricultural Sciences, Zhenjiang, Jiangsu, China
| |
Collapse
|
20
|
Orlov YL, Orlova NG. Bioinformatics tools for the sequence complexity estimates. Biophys Rev 2023; 15:1367-1378. [PMID: 37974990 PMCID: PMC10643780 DOI: 10.1007/s12551-023-01140-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 09/01/2023] [Indexed: 11/19/2023] Open
Abstract
We review current methods and bioinformatics tools for the text complexity estimates (information and entropy measures). The search DNA regions with extreme statistical characteristics such as low complexity regions are important for biophysical models of chromosome function and gene transcription regulation in genome scale. We discuss the complexity profiling for segmentation and delineation of genome sequences, search for genome repeats and transposable elements, and applications to next-generation sequencing reads. We review the complexity methods and new applications fields: analysis of mutation hotspots loci, analysis of short sequencing reads with quality control, and alignment-free genome comparisons. The algorithms implementing various numerical measures of text complexity estimates including combinatorial and linguistic measures have been developed before genome sequencing era. The series of tools to estimate sequence complexity use compression approaches, mainly by modification of Lempel-Ziv compression. Most of the tools are available online providing large-scale service for whole genome analysis. Novel machine learning applications for classification of complete genome sequences also include sequence compression and complexity algorithms. We present comparison of the complexity methods on the different sequence sets, the applications for gene transcription regulatory regions analysis. Furthermore, we discuss approaches and application of sequence complexity for proteins. The complexity measures for amino acid sequences could be calculated by the same entropy and compression-based algorithms. But the functional and evolutionary roles of low complexity regions in protein have specific features differing from DNA. The tools for protein sequence complexity aimed for protein structural constraints. It was shown that low complexity regions in protein sequences are conservative in evolution and have important biological and structural functions. Finally, we summarize recent findings in large scale genome complexity comparison and applications for coronavirus genome analysis.
Collapse
Affiliation(s)
- Yuriy L. Orlov
- The Digital Health Institute, I.M. Sechenov First Moscow State Medical University of the Russian Ministry of Health (Sechenov University), Moscow, 119991 Russia
- Institute of Cytology and Genetics SB RAS, 630090 Novosibirsk, Russia
- Agrarian and Technological Institute, Peoples’ Friendship University of Russia, 117198 Moscow, Russia
| | - Nina G. Orlova
- Department of Mathematics, Financial University under the Government of the Russian Federation, Moscow, 125167 Russia
| |
Collapse
|
21
|
Zhang YZ, Liu Y, Bai Z, Fujimoto K, Uematsu S, Imoto S. Zero-shot-capable identification of phage-host relationships with whole-genome sequence representation by contrastive learning. Brief Bioinform 2023; 24:bbad239. [PMID: 37466138 PMCID: PMC10516345 DOI: 10.1093/bib/bbad239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 05/17/2023] [Accepted: 06/08/2023] [Indexed: 07/20/2023] Open
Abstract
Accurately identifying phage-host relationships from their genome sequences is still challenging, especially for those phages and hosts with less homologous sequences. In this work, focusing on identifying the phage-host relationships at the species and genus level, we propose a contrastive learning based approach to learn whole-genome sequence embeddings that can take account of phage-host interactions (PHIs). Contrastive learning is used to make phages infecting the same hosts close to each other in the new representation space. Specifically, we rephrase whole-genome sequences with frequency chaos game representation (FCGR) and learn latent embeddings that 'encapsulate' phages and host relationships through contrastive learning. The contrastive learning method works well on the imbalanced dataset. Based on the learned embeddings, a proposed pipeline named CL4PHI can predict known hosts and unseen hosts in training. We compare our method with two recently proposed state-of-the-art learning-based methods on their benchmark datasets. The experiment results demonstrate that the proposed method using contrastive learning improves the prediction accuracy on known hosts and demonstrates a zero-shot prediction capability on unseen hosts. In terms of potential applications, the rapid pace of genome sequencing across different species has resulted in a vast amount of whole-genome sequencing data that require efficient computational methods for identifying phage-host interactions. The proposed approach is expected to address this need by efficiently processing whole-genome sequences of phages and prokaryotic hosts and capturing features related to phage-host relationships for genome sequence representation. This approach can be used to accelerate the discovery of phage-host interactions and aid in the development of phage-based therapies for infectious diseases.
Collapse
Affiliation(s)
- Yao-zhong Zhang
- Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan
| | - Yunjie Liu
- Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan
| | - Zeheng Bai
- Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan
| | - Kosuke Fujimoto
- Department of Immunology and Genomics, Graduate School of Medicine, Osaka Metropolitan University, Asahi-machi 1-4-3, Abeno-ku, 545-8585 Osaka, Japan
- Division of Metagenome Medicine, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan
| | - Satoshi Uematsu
- Department of Immunology and Genomics, Graduate School of Medicine, Osaka Metropolitan University, Asahi-machi 1-4-3, Abeno-ku, 545-8585 Osaka, Japan
- Division of Metagenome Medicine, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan
| | - Seiya Imoto
- Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan
| |
Collapse
|
22
|
Shang J, Peng C, Tang X, Sun Y. PhaVIP: Phage VIrion Protein classification based on chaos game representation and Vision Transformer. Bioinformatics 2023; 39:i30-i39. [PMID: 37387136 DOI: 10.1093/bioinformatics/btad229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION As viruses that mainly infect bacteria, phages are key players across a wide range of ecosystems. Analyzing phage proteins is indispensable for understanding phages' functions and roles in microbiomes. High-throughput sequencing enables us to obtain phages in different microbiomes with low cost. However, compared to the fast accumulation of newly identified phages, phage protein classification remains difficult. In particular, a fundamental need is to annotate virion proteins, the structural proteins, such as major tail, baseplate, etc. Although there are experimental methods for virion protein identification, they are too expensive or time-consuming, leaving a large number of proteins unclassified. Thus, there is a great demand to develop a computational method for fast and accurate phage virion protein (PVP) classification. RESULTS In this work, we adapted the state-of-the-art image classification model, Vision Transformer, to conduct virion protein classification. By encoding protein sequences into unique images using chaos game representation, we can leverage Vision Transformer to learn both local and global features from sequence "images". Our method, PhaVIP, has two main functions: classifying PVP and non-PVP sequences and annotating the types of PVP, such as capsid and tail. We tested PhaVIP on several datasets with increasing difficulty and benchmarked it against alternative tools. The experimental results show that PhaVIP has superior performance. After validating the performance of PhaVIP, we investigated two applications that can use the output of PhaVIP: phage taxonomy classification and phage host prediction. The results showed the benefit of using classified proteins over all proteins. AVAILABILITY AND IMPLEMENTATION The web server of PhaVIP is available via: https://phage.ee.cityu.edu.hk/phavip. The source code of PhaVIP is available via: https://github.com/KennthShang/PhaVIP.
Collapse
Affiliation(s)
- Jiayu Shang
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| | - Cheng Peng
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| | - Xubo Tang
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China
| |
Collapse
|
23
|
Zimnyakov D, Alonova M, Skripal A, Dobdin S, Feodorova V. Quantification of the Diversity in Gene Structures Using the Principles of Polarization Mapping. Curr Issues Mol Biol 2023; 45:1720-1740. [PMID: 36826056 PMCID: PMC9955201 DOI: 10.3390/cimb45020111] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Revised: 02/05/2023] [Accepted: 02/16/2023] [Indexed: 02/22/2023] Open
Abstract
Results of computational analysis and visualization of differences in gene structures using polarization coding are presented. A two-dimensional phase screen, where each element of which corresponds to a specific basic nucleotide (adenine, cytosine, guanine, or thymine), displays the analyzed nucleotide sequence. Readout of the screen with a coherent beam characterized by a given polarization state forms a diffracted light field with a local polarization structure that is unique for the analyzed nucleotide sequence. This unique structure is described by spatial distributions of local values of the Stokes vector components. Analysis of these distributions allows the comparison of nucleotide sequences for different strains of pathogenic microorganisms and frequency analysis of the sequences. The possibilities of this polarization-based technique are illustrated by the model data obtained from a comparative analysis of the spike protein gene sequences for three different model variants (Wuhan, Delta, and Omicron) of the SARS-CoV-2 virus. Various modifications of polarization encoding and analysis of gene structures and a possibility for instrumental implementation of the proposed method are discussed.
Collapse
Affiliation(s)
- Dmitry Zimnyakov
- Physics Department, Yury Gagarin State Technical University of Saratov, 77 Polytechnicheskaya St., 410054 Saratov, Russia
- Precision Mechanics and Control Institute of Russian Academy of Sciences, 24 Rabochaya St., 410024 Saratov, Russia
- Institute of Physics, Saratov State University, 83 Astrakhanskaya St., 410012 Saratov, Russia
- Correspondence:
| | - Marina Alonova
- Physics Department, Yury Gagarin State Technical University of Saratov, 77 Polytechnicheskaya St., 410054 Saratov, Russia
| | - Anatoly Skripal
- Institute of Physics, Saratov State University, 83 Astrakhanskaya St., 410012 Saratov, Russia
| | - Sergey Dobdin
- Institute of Physics, Saratov State University, 83 Astrakhanskaya St., 410012 Saratov, Russia
| | - Valentina Feodorova
- Institute of Physics, Saratov State University, 83 Astrakhanskaya St., 410012 Saratov, Russia
| |
Collapse
|
24
|
Welzel M, Schwarz PM, Löchel HF, Kabdullayeva T, Clemens S, Becker A, Freisleben B, Heider D. DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage. Nat Commun 2023; 14:628. [PMID: 36746948 PMCID: PMC9902613 DOI: 10.1038/s41467-023-36297-3] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2022] [Accepted: 01/25/2023] [Indexed: 02/08/2023] Open
Abstract
The extensive information capacity of DNA, coupled with decreasing costs for DNA synthesis and sequencing, makes DNA an attractive alternative to traditional data storage. The processes of writing, storing, and reading DNA exhibit specific error profiles and constraints DNA sequences have to adhere to. We present DNA-Aeon, a concatenated coding scheme for DNA data storage. It supports the generation of variable-sized encoded sequences with a user-defined Guanine-Cytosine (GC) content, homopolymer length limitation, and the avoidance of undesired motifs. It further enables users to provide custom codebooks adhering to further constraints. DNA-Aeon can correct substitution errors, insertions, deletions, and the loss of whole DNA strands. Comparisons with other codes show better error-correction capabilities of DNA-Aeon at similar redundancy levels with decreased DNA synthesis costs. In-vitro tests indicate high reliability of DNA-Aeon even in the case of skewed sequencing read distributions and high read-dropout.
Collapse
Affiliation(s)
- Marius Welzel
- Department of Mathematics and Computer Science, University of Marburg, Marburg, Germany
- Center for Synthetic Microbiology (SYNMIKRO), University of Marburg, Marburg, Germany
| | - Peter Michael Schwarz
- Department of Mathematics and Computer Science, University of Marburg, Marburg, Germany
- Center for Synthetic Microbiology (SYNMIKRO), University of Marburg, Marburg, Germany
| | - Hannah F Löchel
- Department of Mathematics and Computer Science, University of Marburg, Marburg, Germany
- Center for Synthetic Microbiology (SYNMIKRO), University of Marburg, Marburg, Germany
| | - Tolganay Kabdullayeva
- Center for Synthetic Microbiology (SYNMIKRO), University of Marburg, Marburg, Germany
| | - Sandra Clemens
- Department of Mathematics and Computer Science, University of Marburg, Marburg, Germany
- Center for Synthetic Microbiology (SYNMIKRO), University of Marburg, Marburg, Germany
| | - Anke Becker
- Center for Synthetic Microbiology (SYNMIKRO), University of Marburg, Marburg, Germany
| | - Bernd Freisleben
- Department of Mathematics and Computer Science, University of Marburg, Marburg, Germany
- Center for Synthetic Microbiology (SYNMIKRO), University of Marburg, Marburg, Germany
| | - Dominik Heider
- Department of Mathematics and Computer Science, University of Marburg, Marburg, Germany.
- Center for Synthetic Microbiology (SYNMIKRO), University of Marburg, Marburg, Germany.
| |
Collapse
|
25
|
Tang R, Yu Z, Li J. KINN: An alignment-free accurate phylogeny reconstruction method based on inner distance distributions of k-mer pairs in biological sequences. Mol Phylogenet Evol 2023; 179:107662. [PMID: 36375789 DOI: 10.1016/j.ympev.2022.107662] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 10/10/2022] [Accepted: 11/02/2022] [Indexed: 11/13/2022]
Abstract
Alignment-based methods have faced disadvantages in sequence comparison and phylogeny reconstruction due to their high computational complexity. Alignment-free methods for sequence comparison and phylogeny inference have attracted a great deal of attention in recent years. Here, we explore an alignment-free approach that uses inner distance distributions of k-mer pairs in biological sequences for phylogeny inference. For every sequence in a dataset, our method transforms the sequence into a numeric feature vector consisting of features each representing a specific k-mer pair's contribution to the characterization of the sequentiality uniqueness of the sequence. This newly defined k-mer pair's contribution is an integration of the reverse Kullback-Leibler divergence, pseudo mode and the classic entropy of an inner distance distribution of the k-mer pair in the sequence. Our method has been tested on datasets of complete genome sequences, complete protein sequences, and gene sequences of rRNA of various lengths. Our method achieves the best performance in comparison with state-of-the-art alignment-free methods as measured by the Robinson-Foulds distance between the reference and the constructed phylogeny trees.
Collapse
Affiliation(s)
- Runbin Tang
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China; School of Mathematical Sciences, Chongqing Normal University, Chongqing 401331, China
| | - Zuguo Yu
- Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, China.
| | - Jinyan Li
- Data Science Institute, University of Technology Sydney, Ultimo, NSW 2007, Australia.
| |
Collapse
|
26
|
Avila Cartes J, Anand S, Ciccolella S, Bonizzoni P, Della Vedova G. Accurate and fast clade assignment via deep learning and frequency chaos game representation. Gigascience 2022; 12:giac119. [PMID: 36576129 PMCID: PMC9795481 DOI: 10.1093/gigascience/giac119] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 10/17/2022] [Accepted: 11/14/2022] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Since the beginning of the coronavirus disease 2019 pandemic, there has been an explosion of sequencing of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, making it the most widely sequenced virus in the history. Several databases and tools have been created to keep track of genome sequences and variants of the virus; most notably, the GISAID platform hosts millions of complete genome sequences, and it is continuously expanding every day. A challenging task is the development of fast and accurate tools that are able to distinguish between the different SARS-CoV-2 variants and assign them to a clade. RESULTS In this article, we leverage the frequency chaos game representation (FCGR) and convolutional neural networks (CNNs) to develop an original method that learns how to classify genome sequences that we implement into CouGaR-g, a tool for the clade assignment problem on SARS-CoV-2 sequences. On a testing subset of the GISAID, CouGaR-g achieved an $96.29\%$ overall accuracy, while a similar tool, Covidex, obtained a $77,12\%$ overall accuracy. As far as we know, our method is the first using deep learning and FCGR for intraspecies classification. Furthermore, by using some feature importance methods, CouGaR-g allows to identify k-mers that match SARS-CoV-2 marker variants. CONCLUSIONS By combining FCGR and CNNs, we develop a method that achieves a better accuracy than Covidex (which is based on random forest) for clade assignment of SARS-CoV-2 genome sequences, also thanks to our training on a much larger dataset, with comparable running times. Our method implemented in CouGaR-g is able to detect k-mers that capture relevant biological information that distinguishes the clades, known as marker variants. AVAILABILITY The trained models can be tested online providing a FASTA file (with 1 or multiple sequences) at https://huggingface.co/spaces/BIASLab/sars-cov-2-classification-fcgr. CouGaR-g is also available at https://github.com/AlgoLab/CouGaR-g under the GPL.
Collapse
Affiliation(s)
- Jorge Avila Cartes
- Department of Computer Science, Systems and Communications, University of Milano–Bicocca, Milan 20125, Italy
| | - Santosh Anand
- Department of Computer Science, Systems and Communications, University of Milano–Bicocca, Milan 20125, Italy
| | - Simone Ciccolella
- Department of Computer Science, Systems and Communications, University of Milano–Bicocca, Milan 20125, Italy
| | - Paola Bonizzoni
- Department of Computer Science, Systems and Communications, University of Milano–Bicocca, Milan 20125, Italy
| | - Gianluca Della Vedova
- Department of Computer Science, Systems and Communications, University of Milano–Bicocca, Milan 20125, Italy
| |
Collapse
|
27
|
Harrison TMR, Rudar J, Ogden N, Steeves R, Lapen DR, Baird D, Gagné N, Lung O. In silico identification of multiple conserved motifs within the control region of Culicidae mitogenomes. Sci Rep 2022; 12:21920. [PMID: 36536037 PMCID: PMC9763401 DOI: 10.1038/s41598-022-26236-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Accepted: 12/12/2022] [Indexed: 12/23/2022] Open
Abstract
Mosquitoes are important vectors for human and animal diseases. Genetic markers, like the mitochondrial COI gene, can facilitate the taxonomic classification of disease vectors, vector-borne disease surveillance, and prevention. Within the control region (CR) of the mitochondrial genome, there exists a highly variable and poorly studied non-coding AT-rich area that contains the origin of replication. Although the CR hypervariable region has been used for species differentiation of some animals, few studies have investigated the mosquito CR. In this study, we analyze the mosquito mitogenome CR sequences from 125 species and 17 genera. We discovered four conserved motifs located 80 to 230 bp upstream of the 12S rRNA gene. Two of these motifs were found within all 392 Anopheles (An.) CR sequences while the other two motifs were identified in all 37 Culex (Cx.) CR sequences. However, only 3 of the 304 non-Culicidae Dipteran mitogenome CR sequences contained these motifs. Interestingly, the short motif found in all 37 Culex sequences had poly-A and poly-T stretch of similar length that is predicted to form a stable hairpin. We show that supervised learning using the frequency chaos game representation of the CR can be used to differentiate mosquito genera from their dipteran relatives.
Collapse
Affiliation(s)
- Thomas M R Harrison
- Canadian Food Inspection Agency, National Centre for Foreign Animal Disease, 1015 Arlington St. Winnipeg, Manitoba, R3M 3E4, Canada
| | - Josip Rudar
- Canadian Food Inspection Agency, National Centre for Foreign Animal Disease, 1015 Arlington St. Winnipeg, Manitoba, R3M 3E4, Canada
| | - Nicholas Ogden
- Public Health Risk Sciences Division, National Microbiology Laboratory, Public Health Agency of Canada, Saint-Hyacinthe, QC, Canada
| | - Royce Steeves
- Gulf Fisheries Centre, Fisheries & Oceans Canada, Moncton, New Brunswick, Canada
| | - David R Lapen
- Ottawa Research Development Centre, Agriculture & Agri-Food Canada, Ottawa, ON, K1A 0C6, Canada
| | - Donald Baird
- Environment and Climate Change Canada, Canadian Rivers Institute, Department of Biology, University of New Brunswick, Fredericton, NB, Canada
| | - Nellie Gagné
- Gulf Fisheries Centre, Fisheries & Oceans Canada, Moncton, New Brunswick, Canada
| | - Oliver Lung
- Canadian Food Inspection Agency, National Centre for Foreign Animal Disease, 1015 Arlington St. Winnipeg, Manitoba, R3M 3E4, Canada.
- Department of Biological Sciences, University of Manitoba, Winnipeg, MB, Canada.
| |
Collapse
|
28
|
FMG: An observable DNA storage coding method based on frequency matrix game graphs. Comput Biol Med 2022; 151:106269. [PMID: 36356390 DOI: 10.1016/j.compbiomed.2022.106269] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Revised: 10/20/2022] [Accepted: 10/30/2022] [Indexed: 11/06/2022]
Abstract
Using complex biomolecules for storage is a new carbon-based storage method. For example, DNA has the potential to be a good method for archival long-term data storage. Reasonable and efficient coding is the first and most important step in DNA storage. However, current coding methods, such as altruism algorithm, have the problem of low coding efficiency and high complexity, and coding constraints and sets make it difficult to see the coding results visually. In this study, a new DNA storage coding method based on frequency matrix game graph (FMG) is proposed to generate DNA storage coding satisfying combinatorial constraints. Compared with the randomness of the heuristic algorithm that satisfies the constraints, the coding method based on the FMG is deterministic and can clearly explain the coding process. In addition, the constraints and coding results have observable characteristics and are better than the previously published results for the size of the coding set. For example, when length of the code n = 10, hamming distance d = 4, the results obtained by proposed approach combining chaos game and graph are 24% better than the previous results. The proposed coding scheme successfully constructs high-quality coding sets with less complexity, which effectively promotes the development of carbon-based storage coding.
Collapse
|
29
|
Kania A, Sarapata K. Multifarious aspects of the chaos game representation and its applications in biological sequence analysis. Comput Biol Med 2022; 151:106243. [PMID: 36335814 DOI: 10.1016/j.compbiomed.2022.106243] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2022] [Revised: 10/18/2022] [Accepted: 10/22/2022] [Indexed: 12/27/2022]
Abstract
Chaos game representation (CGR) has been successfully applied in bioinformatics for over 30 years. Since then, many further extensions were announced. Numerical encoding of biological sequences is especially convenient in the visualisation process, free-alignment methods and input preparation for machine learning techniques. The development and applications of CGR have embraced mainly linear nucleotide sequences. However, there were also some attempts to create a representation of proteins. The latter need to be more sophisticated, as arbitrary coordinates for amino acids do not reflect their properties which is crucial during the encoding process. In this paper, the authors summarised various variations of CGRs and their limitations. We began by studying the PROSITE motifs and showed the immense number of amino acid properties employed by different proteins. To this aim, we harnessed the Principal Component Analysis (PCA) and studied the relation between explained variance and the number of features that describe them. It appeared that even after many reductions, about 50 features are non-redundant. This was the reason we introduced an embedding concept from natural language processing which enables adjusting features for a given list of sequences. We presented a simple neural network architecture with one hidden layer and one neuron within it and showed it provides satisfactory results in phylogenetic tree construction in ND5 and SPARC protein cases. To this aim, we transformed CGR representations for all considered sequences using Discrete Fourier Transform (DFT) and applied Unweighted Pair Group Method with Arithmetic Mean (UPGMA) algorithm. Moreover, we indicated some similarities between CGR and Recurrent Neural Networks (RNN). In the end, we attempted to include information about the RNA secondary structure and defined some measures to validate biological significance. We studied their properties and showed on ALMV-3 example its usefulness.
Collapse
Affiliation(s)
- Adrian Kania
- Department of Computational Biophysics and Bioinformatics, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, Gronostajowa 7, 30-387 Cracow, Poland.
| | - Krzysztof Sarapata
- Department of Computational Biophysics and Bioinformatics, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, Gronostajowa 7, 30-387 Cracow, Poland
| |
Collapse
|
30
|
Löchel HF, Welzel M, Hattab G, Hauschild AC, Heider D. Fractal construction of constrained code words for DNA storage systems. Nucleic Acids Res 2021; 50:e30. [PMID: 34908135 PMCID: PMC8934655 DOI: 10.1093/nar/gkab1209] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Revised: 11/16/2021] [Accepted: 11/24/2021] [Indexed: 12/29/2022] Open
Abstract
The use of complex biological molecules to solve computational problems is an emerging field at the interface between biology and computer science. There are two main categories in which biological molecules, especially DNA, are investigated as alternatives to silicon-based computer technologies. One is to use DNA as a storage medium, and the other is to use DNA for computing. Both strategies come with certain constraints. In the current study, we present a novel approach derived from chaos game representation for DNA to generate DNA code words that fulfill user-defined constraints, namely GC content, homopolymers, and undesired motifs, and thus, can be used to build codes for reliable DNA storage systems.
Collapse
Affiliation(s)
- Hannah F Löchel
- Department of Mathematics and Computer Science, University of Marburg, Germany
| | - Marius Welzel
- Department of Mathematics and Computer Science, University of Marburg, Germany
| | - Georges Hattab
- Department of Mathematics and Computer Science, University of Marburg, Germany
| | | | - Dominik Heider
- Department of Mathematics and Computer Science, University of Marburg, Germany
| |
Collapse
|