1
|
Wu Y, Xie X, Zhu J, Guan L, Li M. Overview and Prospects of DNA Sequence Visualization. Int J Mol Sci 2025; 26:477. [PMID: 39859192 PMCID: PMC11764684 DOI: 10.3390/ijms26020477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2024] [Revised: 12/30/2024] [Accepted: 01/04/2025] [Indexed: 01/27/2025] Open
Abstract
Due to advances in big data technology, deep learning, and knowledge engineering, biological sequence visualization has been extensively explored. In the post-genome era, biological sequence visualization enables the visual representation of both structured and unstructured biological sequence data. However, a universal visualization method for all types of sequences has not been reported. Biological sequence data are rapidly expanding exponentially and the acquisition, extraction, fusion, and inference of knowledge from biological sequences are critical supporting technologies for visualization research. These areas are important and require in-depth exploration. This paper elaborates on a comprehensive overview of visualization methods for DNA sequences from four different perspectives-two-dimensional, three-dimensional, four-dimensional, and dynamic visualization approaches-and discusses the strengths and limitations of each method in detail. Furthermore, this paper proposes two potential future research directions for biological sequence visualization in response to the challenges of inefficient graphical feature extraction and knowledge association network generation in existing methods. The first direction is the construction of knowledge graphs for biological sequence big data, and the second direction is the cross-modal visualization of biological sequences using machine learning methods. This review is anticipated to provide valuable insights and contributions to computational biology, bioinformatics, genomic computing, genetic breeding, evolutionary analysis, and other related disciplines in the fields of biology, medicine, chemistry, statistics, and computing. It has an important reference value in biological sequence recommendation systems and knowledge question answering systems.
Collapse
Affiliation(s)
| | | | | | | | - Mengshan Li
- School of Mathematics and Computer Science, Gannan Normal University, Ganzhou 341000, China; (Y.W.); (X.X.); (J.Z.); (L.G.)
| |
Collapse
|
2
|
Alipour F, Hill KA, Kari L. CGRclust: Chaos Game Representation for twin contrastive clustering of unlabelled DNA sequences. BMC Genomics 2024; 25:1214. [PMID: 39695938 DOI: 10.1186/s12864-024-11135-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Accepted: 12/06/2024] [Indexed: 12/20/2024] Open
Abstract
BACKGROUND Traditional supervised learning methods applied to DNA sequence taxonomic classification rely on the labor-intensive and time-consuming step of labelling the primary DNA sequences. Additionally, standard DNA classification/clustering methods involve time-intensive multiple sequence alignments, which impacts their applicability to large genomic datasets or distantly related organisms. These limitations indicate a need for robust, efficient, and scalable unsupervised DNA sequence clustering methods that do not depend on sequence labels or alignment. RESULTS This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclust overcomes the limitations of traditional sequence classification methods by leveraging unsupervised twin contrastive learning to detect distinctive sequence patterns, without requiring DNA sequence alignment or biological/taxonomic labels. CGRclust accurately clustered twenty-five diverse datasets, with sequence lengths ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences. Compared with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy across all four taxonomic levels tested for mitochondrial DNA genomes of fish. Moreover, CGRclust also consistently demonstrates superior performance across all the viral genomic datasets. The high clustering accuracy of CGRclust on these twenty-five datasets, which vary significantly in terms of sequence length, number of genomes, number of clusters, and level of taxonomy, demonstrates its robustness, scalability, and versatility. CONCLUSION CGRclust is a novel, scalable, alignment-free DNA sequence clustering method that uses CGR images of DNA sequences and CNNs for twin contrastive clustering of unlabelled primary DNA sequences, achieving superior or comparable accuracy and performance over current approaches. CGRclust demonstrated enhanced reliability, by consistently achieving over 80% accuracy in more than 90% of the datasets analyzed. In particular, CGRclust performed especially well in clustering viral DNA datasets, where it consistently outperformed all competing methods.
Collapse
Affiliation(s)
- Fatemeh Alipour
- School of Computer Science, University of Waterloo, Waterloo, Canada.
| | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, Canada
| |
Collapse
|
3
|
Zhang YZ, Imoto S. Genome analysis through image processing with deep learning models. J Hum Genet 2024; 69:519-525. [PMID: 39085457 PMCID: PMC11422167 DOI: 10.1038/s10038-024-01275-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 07/08/2024] [Accepted: 07/08/2024] [Indexed: 08/02/2024]
Abstract
Genomic sequences are traditionally represented as strings of characters: A (adenine), C (cytosine), G (guanine), and T (thymine). However, an alternative approach involves depicting sequence-related information through image representations, such as Chaos Game Representation (CGR) and read pileup images. With rapid advancements in deep learning (DL) methods within computer vision and natural language processing, there is growing interest in applying image-based DL methods to genomic sequence analysis. These methods involve encoding genomic information as images or integrating spatial information from images into the analytical process. In this review, we summarize three typical applications that use image processing with DL models for genome analysis. We examine the utilization and advantages of these image-based approaches.
Collapse
Affiliation(s)
- Yao-Zhong Zhang
- Division of Health Medical Intelligence, Human Genome Center, the Institute of Medical Science, the University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan.
| | - Seiya Imoto
- Division of Health Medical Intelligence, Human Genome Center, the Institute of Medical Science, the University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan.
| |
Collapse
|
4
|
Li X, Zhou T, Feng X, Yau ST, Yau SST. Exploring geometry of genome space via Grassmann manifolds. Innovation (N Y) 2024; 5:100677. [PMID: 39206218 PMCID: PMC11350263 DOI: 10.1016/j.xinn.2024.100677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Accepted: 07/18/2024] [Indexed: 09/04/2024] Open
Abstract
It is important to understand the geometry of genome space in biology. After transforming genome sequences into frequency matrices of the chaos game representation (FCGR), we regard a genome sequence as a point in a suitable Grassmann manifold by analyzing the column space of the corresponding FCGR. To assess the sequence similarity, we employ the generalized Grassmannian distance, an intrinsic geometric distance that differs from the traditional Euclidean distance used in the classical k-mer frequency-based methods. With this method, we constructed phylogenetic trees for various genome datasets, including influenza A virus hemagglutinin gene, Orthocoronavirinae genome, and SARS-CoV-2 complete genome sequences. Our comparative analysis with multiple sequence alignment and alignment-free methods for large-scale sequences revealed that our method, which employs the subspace distance between the column spaces of different FCGRs (FCGR-SD), outperformed its competitors in terms of both speed and accuracy. In addition, we used low-dimensional visualization of the SARS-CoV-2 genome sequences and spike protein nucleotide sequences with our methods, resulting in some intriguing findings. We not only propose a novel and efficient algorithm for comparing genome sequences but also demonstrate that genome data have some intrinsic manifold structures, providing a new geometric perspective for molecular biology studies.
Collapse
Affiliation(s)
- Xiaoguang Li
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Tao Zhou
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
| | - Xingdong Feng
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Shing-Tung Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing 101408, China
| | - Stephen S.-T. Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing 101408, China
| |
Collapse
|
5
|
Akbari Rokn Abadi S, Mohammadi A, Koohi S. PC-mer: An Ultra-fast memory-efficient tool for metagenomics profiling and classification. PLoS One 2024; 19:e0307279. [PMID: 39088438 PMCID: PMC11293629 DOI: 10.1371/journal.pone.0307279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2023] [Accepted: 07/02/2024] [Indexed: 08/03/2024] Open
Abstract
Features extraction methods, such as k-mer-based methods, have recently made up a significant role in classifying and analyzing approaches for metagenomics data. But, they are challenged by various bottlenecks, such as performance limitations, high memory consumption, and computational overhead. To deal with these challenges, we developed an innovative features extraction and sequence profiling method for DNA/RNA sequences, called PC-mer, taking advantage of the physicochemical properties of nucleotides. PC-mer in comparison with the k-mer profiling methods provides a considerable memory usage reduction by a factor of 2k while improving the metagenomics classification performance, for both machine learning-based and computational-based methods, at the various levels and also archives speedup more than 1000x for the training phase. Examining ML-based PC-mer on various datasets confirms that it can achieve 100% accuracy in classifying samples at the class, order, and family levels. Despite the k-mer-based classification methods, it also improves genus-level classification accuracy by more than 14% for shotgun dataset (i.e. achieves accuracy of 97.5%) and more than 5% for amplicon dataset (i.e. achieves accuracy of 98.6%). Due to these improvements, we provide two PC-mer-based tools, which can actually replace the popular k-mer-based tools: one for classifying and another for comparing metagenomics data.
Collapse
Affiliation(s)
| | | | - Somayyeh Koohi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| |
Collapse
|
6
|
Zimnyakov DA, Alonova MV, Lavrukhin MS, Lyapina AM, Feodorova VA. Polarization- and Chaos-Game-Based Fingerprinting of Molecular Targets of Listeria Monocytogenes Vaccine and Fully Virulent Strains. Curr Issues Mol Biol 2023; 45:10056-10078. [PMID: 38132474 PMCID: PMC10742786 DOI: 10.3390/cimb45120628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 12/07/2023] [Accepted: 12/11/2023] [Indexed: 12/23/2023] Open
Abstract
Two approaches to the synthesis of 2D binary identifiers ("fingerprints") of DNA-associated symbol sequences are considered in this paper. One of these approaches is based on the simulation of polarization-dependent diffraction patterns formed by reading the modeled DNA-associated 2D phase-modulating structures with a coherent light beam. In this case, 2D binarized distributions of close-to-circular extreme polarization states are applied as fingerprints of analyzed nucleotide sequences. The second approach is based on the transformation of the DNA-associated chaos game representation (CGR) maps into finite-dimensional binary matrices. In both cases, the differences between the structures of the analyzed and reference symbol sequences are quantified by calculating the correlation coefficient of the synthesized binary matrices. A comparison of the approaches under consideration is carried out using symbol sequences corresponding to nucleotide sequences of the hly gene from the vaccine and wild-type strains of Listeria monocytogenes as the analyzed objects. These strains differ in terms of the number of substituted nucleotides in relation to the vaccine strain selected as a reference. The results of the performed analysis allow us to conclude that the identification of structural differences in the DNA-associated symbolic sequences is significantly more efficient when using the binary distributions of close-to-circular extreme polarization states. The approach given can be applicable for genetic differentiation immunized from vaccinated animals (DIVA).
Collapse
Affiliation(s)
- Dmitry A. Zimnyakov
- Physics Department, Yury Gagarin State Technical University of Saratov, 77 Polytechnicheskaya Str., 410054 Saratov, Russia;
- Laboratory for Fundamental and Applied Research, Saratov State University of Genetics, Biotechnology and Engineering Named after N.I. Vavilov, 335 Sokolovaya Str., 410005 Saratov, Russia; (M.S.L.); (A.M.L.); (V.A.F.)
| | - Marina V. Alonova
- Physics Department, Yury Gagarin State Technical University of Saratov, 77 Polytechnicheskaya Str., 410054 Saratov, Russia;
| | - Maxim S. Lavrukhin
- Laboratory for Fundamental and Applied Research, Saratov State University of Genetics, Biotechnology and Engineering Named after N.I. Vavilov, 335 Sokolovaya Str., 410005 Saratov, Russia; (M.S.L.); (A.M.L.); (V.A.F.)
| | - Anna M. Lyapina
- Laboratory for Fundamental and Applied Research, Saratov State University of Genetics, Biotechnology and Engineering Named after N.I. Vavilov, 335 Sokolovaya Str., 410005 Saratov, Russia; (M.S.L.); (A.M.L.); (V.A.F.)
| | - Valentina A. Feodorova
- Laboratory for Fundamental and Applied Research, Saratov State University of Genetics, Biotechnology and Engineering Named after N.I. Vavilov, 335 Sokolovaya Str., 410005 Saratov, Russia; (M.S.L.); (A.M.L.); (V.A.F.)
- Department for Microbiology and Biotechnology, Saratov State University of Genetics, Biotechnology and Engineering Named after N.I. Vavilov, 335 Sokolovaya Str., 410005 Saratov, Russia
| |
Collapse
|
7
|
Akbari Rokn Abadi S, Mohammadi A, Koohi S. A new profiling approach for DNA sequences based on the nucleotides' physicochemical features for accurate analysis of SARS-CoV-2 genomes. BMC Genomics 2023; 24:266. [PMID: 37202721 PMCID: PMC10193333 DOI: 10.1186/s12864-023-09373-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Accepted: 05/11/2023] [Indexed: 05/20/2023] Open
Abstract
BACKGROUND The prevalence of the COVID-19 disease in recent years and its widespread impact on mortality, as well as various aspects of life around the world, has made it important to study this disease and its viral cause. However, very long sequences of this virus increase the processing time, complexity of calculation, and memory consumption required by the available tools to compare and analyze the sequences. RESULTS We present a new encoding method, named PC-mer, based on the k-mer and physic-chemical properties of nucleotides. This method minimizes the size of encoded data by around 2 k times compared to the classical k-mer based profiling method. Moreover, using PC-mer, we designed two tools: 1) a machine-learning-based classification tool for coronavirus family members with the ability to recive input sequences from the NCBI database, and 2) an alignment-free computational comparison tool for calculating dissimilarity scores between coronaviruses at the genus and species levels. CONCLUSIONS PC-mer achieves 100% accuracy despite the use of very simple classification algorithms based on Machine Learning. Assuming dynamic programming-based pairwise alignment as the ground truth approach, we achieved a degree of convergence of more than 98% for coronavirus genus-level sequences and 93% for SARS-CoV-2 sequences using PC-mer in the alignment-free classification method. This outperformance of PC-mer suggests that it can serve as a replacement for alignment-based approaches in certain sequence analysis applications that rely on similarity/dissimilarity scores, such as searching sequences, comparing sequences, and certain types of phylogenetic analysis methods that are based on sequence comparison.
Collapse
Affiliation(s)
| | | | - Somayyeh Koohi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.
| |
Collapse
|
8
|
Lichtblau D, Stoean C. Chaos game representation for authorship attribution. ARTIF INTELL 2023. [DOI: 10.1016/j.artint.2023.103858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
|
9
|
Abadi SAR, Mohammadi A, Koohi S. An automated ultra-fast, memory-efficient, and accurate method for viral genome classification. J Biomed Inform 2023; 139:104316. [PMID: 36781036 DOI: 10.1016/j.jbi.2023.104316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2022] [Revised: 01/30/2023] [Accepted: 02/08/2023] [Indexed: 02/13/2023]
Abstract
The classification of different organisms into subtypes is one of the most important tools of organism studies, and among them, the classification of viruses itself has been the focus of many studies due to their use in virology and epidemiology. Many methods have been proposed to classify viruses, some of which are designed for a specific family of organisms and some of which are more general. But still, especially for certain categories such as Influenza and HIV, classification is facing performance challenges as well as processing and memory bottlenecks. In this way, we designed an automated classifier, called PC-mer, that is based on k-mer and physicochemical characteristics of nucleotides, which reduces the number of features about 2 k times compared to the alternative methods based on k-mer, and compared to integer and one-hot encoding methods, it is possible to keep the number of features constant despite the growth of the sequence length. In this way, it also increases the training speed by an average of 17.93 times. This improvement in processing complexity is provided while PC-mer can also improve the classifying performance for a variety of virus families.
Collapse
Affiliation(s)
| | - Amirhossein Mohammadi
- No 717, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Somayyeh Koohi
- No 717, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.
| |
Collapse
|
10
|
Harrison TMR, Rudar J, Ogden N, Steeves R, Lapen DR, Baird D, Gagné N, Lung O. In silico identification of multiple conserved motifs within the control region of Culicidae mitogenomes. Sci Rep 2022; 12:21920. [PMID: 36536037 PMCID: PMC9763401 DOI: 10.1038/s41598-022-26236-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Accepted: 12/12/2022] [Indexed: 12/23/2022] Open
Abstract
Mosquitoes are important vectors for human and animal diseases. Genetic markers, like the mitochondrial COI gene, can facilitate the taxonomic classification of disease vectors, vector-borne disease surveillance, and prevention. Within the control region (CR) of the mitochondrial genome, there exists a highly variable and poorly studied non-coding AT-rich area that contains the origin of replication. Although the CR hypervariable region has been used for species differentiation of some animals, few studies have investigated the mosquito CR. In this study, we analyze the mosquito mitogenome CR sequences from 125 species and 17 genera. We discovered four conserved motifs located 80 to 230 bp upstream of the 12S rRNA gene. Two of these motifs were found within all 392 Anopheles (An.) CR sequences while the other two motifs were identified in all 37 Culex (Cx.) CR sequences. However, only 3 of the 304 non-Culicidae Dipteran mitogenome CR sequences contained these motifs. Interestingly, the short motif found in all 37 Culex sequences had poly-A and poly-T stretch of similar length that is predicted to form a stable hairpin. We show that supervised learning using the frequency chaos game representation of the CR can be used to differentiate mosquito genera from their dipteran relatives.
Collapse
Affiliation(s)
- Thomas M R Harrison
- Canadian Food Inspection Agency, National Centre for Foreign Animal Disease, 1015 Arlington St. Winnipeg, Manitoba, R3M 3E4, Canada
| | - Josip Rudar
- Canadian Food Inspection Agency, National Centre for Foreign Animal Disease, 1015 Arlington St. Winnipeg, Manitoba, R3M 3E4, Canada
| | - Nicholas Ogden
- Public Health Risk Sciences Division, National Microbiology Laboratory, Public Health Agency of Canada, Saint-Hyacinthe, QC, Canada
| | - Royce Steeves
- Gulf Fisheries Centre, Fisheries & Oceans Canada, Moncton, New Brunswick, Canada
| | - David R Lapen
- Ottawa Research Development Centre, Agriculture & Agri-Food Canada, Ottawa, ON, K1A 0C6, Canada
| | - Donald Baird
- Environment and Climate Change Canada, Canadian Rivers Institute, Department of Biology, University of New Brunswick, Fredericton, NB, Canada
| | - Nellie Gagné
- Gulf Fisheries Centre, Fisheries & Oceans Canada, Moncton, New Brunswick, Canada
| | - Oliver Lung
- Canadian Food Inspection Agency, National Centre for Foreign Animal Disease, 1015 Arlington St. Winnipeg, Manitoba, R3M 3E4, Canada.
- Department of Biological Sciences, University of Manitoba, Winnipeg, MB, Canada.
| |
Collapse
|
11
|
Uddin M, Islam MK, Hassan MR, Jahan F, Baek JH. A fast and efficient algorithm for DNA sequence similarity identification. COMPLEX INTELL SYST 2022; 9:1265-1280. [PMID: 36035628 PMCID: PMC9395857 DOI: 10.1007/s40747-022-00846-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Accepted: 08/05/2022] [Indexed: 11/22/2022]
Abstract
DNA sequence similarity analysis is necessary for enormous purposes including genome analysis, extracting biological information, finding the evolutionary relationship of species. There are two types of sequence analysis which are alignment-based (AB) and alignment-free (AF). AB is effective for small homologous sequences but becomes NP-hard problem for long sequences. However, AF algorithms can solve the major limitations of AB. But most of the existing AF methods show high time complexity and memory consumption, less precision, and less performance on benchmark datasets. To minimize these limitations, we develop an AF algorithm using a 2D \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$k-mer$$\end{document}k-mer count matrix inspired by the CGR approach. Then we shrink the matrix by analyzing the neighbors and then measure similarities using the best combinations of pairwise distance (PD) and phylogenetic tree methods. We also dynamically choose the value of k for \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$k-mer$$\end{document}k-mer. We develop an efficient system for finding the positions of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$k-mer$$\end{document}k-mer in the count matrix. We apply our system in six different datasets. We achieve the top rank for two benchmark datasets from AFproject, 100% accuracy for two datasets (16 S Ribosomal, 18 Eutherian), and achieve a milestone for time complexity and memory consumption in comparison to the existing study datasets (HEV, HIV-1). Therefore, the comparative results of the benchmark datasets and existing studies demonstrate that our method is highly effective, efficient, and accurate. Thus, our method can be used with the top level of authenticity for DNA sequence similarity measurement.
Collapse
|
12
|
Li W, Yang L, Qiu Y, Yuan Y, Li X, Meng Z. FFP: joint Fast Fourier transform and fractal dimension in amino acid property-aware phylogenetic analysis. BMC Bioinformatics 2022; 23:347. [PMID: 35986255 PMCID: PMC9392226 DOI: 10.1186/s12859-022-04889-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 08/11/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Amino acid property-aware phylogenetic analysis (APPA) refers to the phylogenetic analysis method based on amino acid property encoding, which is used for understanding and inferring evolutionary relationships between species from the molecular perspective. Fast Fourier transform (FFT) and Higuchi’s fractal dimension (HFD) have excellent performance in describing sequences’ structural and complexity information for APPA. However, with the exponential growth of protein sequence data, it is very important to develop a reliable APPA method for protein sequence analysis.
Results
Consequently, we propose a new method named FFP, it joints FFT and HFD. Firstly, FFP is used to encode protein sequences on the basis of the important physicochemical properties of amino acids, the dissociation constant, which determines acidity and basicity of protein molecules. Secondly, FFT and HFD are used to generate the feature vectors of encoded sequences, whereafter, the distance matrix is calculated from the cosine function, which describes the degree of similarity between species. The smaller the distance between them, the more similar they are. Finally, the phylogenetic tree is constructed. When FFP is tested for phylogenetic analysis on four groups of protein sequences, the results are obviously better than other comparisons, with the highest accuracy up to more than 97%.
Conclusion
FFP has higher accuracy in APPA and multi-sequence alignment. It also can measure the protein sequence similarity effectively. And it is hoped to play a role in APPA’s related research.
Collapse
|
13
|
Swain MT, Vickers M. Interpreting alignment-free sequence comparison: what makes a score a good score? NAR Genom Bioinform 2022; 4:lqac062. [PMID: 36071721 PMCID: PMC9442500 DOI: 10.1093/nargab/lqac062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 07/01/2022] [Accepted: 08/16/2022] [Indexed: 11/13/2022] Open
Abstract
Alignment-free methods are alternatives to alignment-based methods when searching sequence data sets. The output from an alignment-free sequence comparison is a similarity score, the interpretation of which is not straightforward. We propose objective functions to interpret and calibrate outputs from alignment-free searches, noting that different objective functions are necessary for different biological contexts. This leads to advantages: visualising and comparing score distributions, including those from true positives, may be a relatively simple method to gain insight into the performance of different metrics. Using an empirical approach with both DNA and protein sequences, we characterise different similarity score distributions generated under different parameters. In particular, we demonstrate how sequence length can affect the scores. We show that scores of true positive sequence pairs may correlate significantly with their mean length; and even if the correlation is weak, the relative difference in length of the sequence pair may significantly reduce the effectiveness of alignment-free metrics. Importantly, we show how objective functions can be used with test data to accurately estimate the probability of true positives. This can significantly increase the utility of alignment-free approaches. Finally, we have developed a general-purpose software tool called KAST for use in high-throughput workflows on Linux clusters.
Collapse
Affiliation(s)
- Martin T Swain
- Department of Life Sciences, Aberystwyth University , Penglais, Aberystwyth, Ceredigion, SY23 3DA, UK
| | - Martin Vickers
- The John Innes Centre, Norwich Research Park , Norwich NR4 7UH, UK
| |
Collapse
|
14
|
4D-Dynamic Representation of DNA/RNA Sequences: Studies on Genetic Diversity of Echinococcus multilocularis in Red Foxes in Poland. LIFE (BASEL, SWITZERLAND) 2022; 12:life12060877. [PMID: 35743908 PMCID: PMC9227292 DOI: 10.3390/life12060877] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Revised: 05/20/2022] [Accepted: 06/08/2022] [Indexed: 11/17/2022]
Abstract
The 4D-Dynamic Representation of DNA/RNA Sequences, an alignment-free bioinformatics method recently developed by us, has been used to study the genetic diversity of Echinococcus multilocularis in red foxes in Poland. Sequences of three mitochondrial genes, i.e., NADH dehydrogenase subunit 2 (nad2), cytochrome b (cob), and cytochrome c oxidase subunit 1 (cox1), are analyzed. The sequences are represented by sets of material points in a 4D space, i.e., 4D-dynamic graphs. As a visualization of the sequences, projections of the graphs into 3D space are shown. The differences between 3D graphs corresponding to European, Asian, and American haplotypes are small. Numerical characteristics (sequence descriptors) applied in the studies can recognize the differences. The concept of creating descriptors of 4D-dynamic graphs has been borrowed from classical dynamics; these are coordinates of the centers or mass and moments of inertia of 4D-dynamic graphs. Based on these descriptors, classification maps are constructed. The concentrations of points in the maps indicate one Polish haplotype (EmPL9) of Asian origin.
Collapse
|
15
|
Paul T, Vainio S, Roning J. Detection of intra-family coronavirus genome sequences through graphical representation and artificial neural network. EXPERT SYSTEMS WITH APPLICATIONS 2022; 194:116559. [PMID: 35095217 PMCID: PMC8779865 DOI: 10.1016/j.eswa.2022.116559] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Revised: 12/29/2021] [Accepted: 01/16/2022] [Indexed: 05/06/2023]
Abstract
In this study, chaos game representation (CGR) is introduced for investigating the pattern of genome sequences. It is an image representation of the genome for the overall visualization of the sequence. The CGR representation is a mapping technique that assigns each sequence base into the respective position in the two-dimension plane to portray the DNA sequence. Importantly, CGR provides one to one mapping to nucleotides as well as sequence. A coordinate of the CGR plane can tell the corresponding base and its location in the original genome. Therefore, the whole nucleotide sequence (until the current nucleotide) can be restored from the one point of the CGR. In this study, CGR coupled with artificial neural network (ANN) is introduced as a new way to represent the genome and to classify intra-coronavirus sequences. A hierarchy clustering study is done to validate the approach and found to be more than 90% accurate while comparing the result with the phylogenetic tree of the corresponding genomes. Interestingly, the method makes the genome sequence significantly shorter (more than 99% compressed) saving the data space while preserving the genome features.
Collapse
Affiliation(s)
- Tirthankar Paul
- InfoTech Oulu, Faculty of Information Technology and Electrical Engineering, Biomimetics and Intelligent Systems Group (BISG), University of Oulu, Oulu, Finland
| | - Seppo Vainio
- Infotech Oulu and Kvantum Institute, Faculty of Biochemistry and Molecular Medicine, Disease Networks, University of Oulu, Oulu, Finland
| | - Juha Roning
- InfoTech Oulu, Faculty of Information Technology and Electrical Engineering, Biomimetics and Intelligent Systems Group (BISG), University of Oulu, Oulu, Finland
| |
Collapse
|
16
|
Akbari Rokn Abadi S, Mohammadi A, Koohi S. WalkIm: Compact image-based encoding for high-performance classification of biological sequences using simple tuning-free CNNs. PLoS One 2022; 17:e0267106. [PMID: 35427371 PMCID: PMC9012348 DOI: 10.1371/journal.pone.0267106] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Accepted: 04/01/2022] [Indexed: 11/28/2022] Open
Abstract
The classification of biological sequences is an open issue for a variety of data sets, such as viral and metagenomics sequences. Therefore, many studies utilize neural network tools, as the well-known methods in this field, and focus on designing customized network structures. However, a few works focus on more effective factors, such as input encoding method or implementation technology, to address accuracy and efficiency issues in this area. Therefore, in this work, we propose an image-based encoding method, called as WalkIm, whose adoption, even in a simple neural network, provides competitive accuracy and superior efficiency, compared to the existing classification methods (e.g. VGDC, CASTOR, and DLM-CNN) for a variety of biological sequences. Using WalkIm for classifying various data sets (i.e. viruses whole-genome data, metagenomics read data, and metabarcoding data), it achieves the same performance as the existing methods, with no enforcement of parameter initialization or network architecture adjustment for each data set. It is worth noting that even in the case of classifying high-mutant data sets, such as Coronaviruses, it achieves almost 100% accuracy for classifying its various types. In addition, WalkIm achieves high-speed convergence during network training, as well as reduction of network complexity. Therefore WalkIm method enables us to execute the classifying neural networks on a normal desktop system in a short time interval. Moreover, we addressed the compatibility of WalkIm encoding method with free-space optical processing technology. Taking advantages of optical implementation of convolutional layers, we illustrated that the training time can be reduced by up to 500 time. In addition to all aforementioned advantages, this encoding method preserves the structure of generated images in various modes of sequence transformation, such as reverse complement, complement, and reverse modes.
Collapse
Affiliation(s)
| | | | - Somayyeh Koohi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| |
Collapse
|
17
|
Löchel HF, Heider D. Chaos game representation and its applications in bioinformatics. Comput Struct Biotechnol J 2021; 19:6263-6271. [PMID: 34900136 PMCID: PMC8636998 DOI: 10.1016/j.csbj.2021.11.008] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Revised: 11/04/2021] [Accepted: 11/05/2021] [Indexed: 11/18/2022] Open
Abstract
Chaos game representation (CGR), a milestone in graphical bioinformatics, has become a powerful tool regarding alignment-free sequence comparison and feature encoding for machine learning. The algorithm maps a sequence to 2-dimensional space, while an extension of the CGR, the so-called frequency matrix representation (FCGR), transforms sequences of different lengths into equal-sized images or matrices. The CGR is a generalized Markov chain and includes various properties, which allow a unique representation of a sequence. Therefore, it has a broad spectrum of applications in bioinformatics, such as sequence comparison and phylogenetic analysis and as an encoding of sequences for machine learning. This review introduces the construction of CGRs and FCGRs, their applications on DNA and proteins, and gives an overview of recent applications and progress in bioinformatics.
Collapse
Affiliation(s)
- Hannah Franziska Löchel
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, D-35032 Marburg, Germany
| | - Dominik Heider
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, D-35032 Marburg, Germany
| |
Collapse
|
18
|
Ren Y, Chakraborty T, Doijad S, Falgenhauer L, Falgenhauer J, Goesmann A, Hauschild AC, Schwengers O, Heider D. Prediction of antimicrobial resistance based on whole-genome sequencing and machine learning. Bioinformatics 2021; 38:325-334. [PMID: 34613360 PMCID: PMC8722762 DOI: 10.1093/bioinformatics/btab681] [Citation(s) in RCA: 69] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 08/27/2021] [Accepted: 09/24/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Antimicrobial resistance (AMR) is one of the biggest global problems threatening human and animal health. Rapid and accurate AMR diagnostic methods are thus very urgently needed. However, traditional antimicrobial susceptibility testing (AST) is time-consuming, low throughput and viable only for cultivable bacteria. Machine learning methods may pave the way for automated AMR prediction based on genomic data of the bacteria. However, comparing different machine learning methods for the prediction of AMR based on different encodings and whole-genome sequencing data without previously known knowledge remains to be done. RESULTS In this study, we evaluated logistic regression (LR), support vector machine (SVM), random forest (RF) and convolutional neural network (CNN) for the prediction of AMR for the antibiotics ciprofloxacin, cefotaxime, ceftazidime and gentamicin. We could demonstrate that these models can effectively predict AMR with label encoding, one-hot encoding and frequency matrix chaos game representation (FCGR encoding) on whole-genome sequencing data. We trained these models on a large AMR dataset and evaluated them on an independent public dataset. Generally, RFs and CNNs perform better than LR and SVM with AUCs up to 0.96. Furthermore, we were able to identify mutations that are associated with AMR for each antibiotic. AVAILABILITY AND IMPLEMENTATION Source code in data preparation and model training are provided at GitHub website (https://github.com/YunxiaoRen/ML-iAMR). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yunxiao Ren
- Department of Data Science in Biomedicine, Faculty of Mathematics and Computer Science, Philipps-University of Marburg, Marburg 35032, Germany
| | - Trinad Chakraborty
- Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen 35392, Germany,German Center for Infection Research, Partner site Giessen-Marburg-Langen, Giessen 35392, Germany
| | - Swapnil Doijad
- Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen 35392, Germany,German Center for Infection Research, Partner site Giessen-Marburg-Langen, Giessen 35392, Germany
| | - Linda Falgenhauer
- German Center for Infection Research, Partner site Giessen-Marburg-Langen, Giessen 35392, Germany,Institute of Hygiene and Environmental Medicine, Justus Liebig University Giessen, Giessen 35392, Germany,Hessisches universitäres Kompetenzzentrum Krankenhaushygiene, Giessen 35392, Germany
| | - Jane Falgenhauer
- Institute of Medical Microbiology, Justus Liebig University Giessen, Giessen 35392, Germany,German Center for Infection Research, Partner site Giessen-Marburg-Langen, Giessen 35392, Germany
| | - Alexander Goesmann
- German Center for Infection Research, Partner site Giessen-Marburg-Langen, Giessen 35392, Germany,Department of Bioinformatics and Systems Biology, Justus Liebig University Giessen, Giessen 35392, Germany
| | - Anne-Christin Hauschild
- Department of Data Science in Biomedicine, Faculty of Mathematics and Computer Science, Philipps-University of Marburg, Marburg 35032, Germany
| | - Oliver Schwengers
- German Center for Infection Research, Partner site Giessen-Marburg-Langen, Giessen 35392, Germany,Department of Bioinformatics and Systems Biology, Justus Liebig University Giessen, Giessen 35392, Germany
| | | |
Collapse
|
19
|
Ni H, Mu H, Qi D. Applying frequency chaos game representation with perceptual image hashing to gene sequence phylogenetic analyses. J Mol Graph Model 2021; 107:107942. [PMID: 34058640 DOI: 10.1016/j.jmgm.2021.107942] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2021] [Revised: 04/16/2021] [Accepted: 05/10/2021] [Indexed: 11/28/2022]
Abstract
As a very important research direction in the field of bioinformatics, sequence alignment plays a vital role in the research and development of biology. Converting genome sequence to graph by using frequency chaos game representation (FCGR) is an excellent gene sequence mapping technology, which can store rich genetic information into FCGR graphics. To each FCGR image, we construct its perceptual image hashing (PIH) matrix using the bicubic interpolation zooming. The difference of the perceptual hash matrix of each two images is calculated, and the clustering distance of the corresponding two gene sequences is represented by the differentials of the perceptual hash matrix. In this paper, we aligned and analyzed several typical genome sequence datasets including mammalian mitochondrial genes, human immunodeficiency virus 1 (HIV-1) and hepatitis E virus (HEV) to build their evolutionary trees. Experimental results showed that our PIH combining FCGR method (FCGR-PIH) has similar classification accuracy to the classical Clustal W sequence alignment method. Furthermore, 25 complete mitochondrial DNA sequences of cichlid fishes and 27 Escherichia coli/Shigella full genome sequences were selected from the AFproject test platform for tests. The performance benchmark rankings demonstrate the effectiveness of the FCGR-PIH algorithm and its potential for large-scale genome sequence analysis.
Collapse
Affiliation(s)
- Haiming Ni
- College of Science, Northeast Forestry University, Hexing Road 26, Harbin, Heilongjiang Province, 150040, PR China.
| | - Hongbo Mu
- College of Science, Northeast Forestry University, Hexing Road 26, Harbin, Heilongjiang Province, 150040, PR China
| | - Dawei Qi
- College of Science, Northeast Forestry University, Hexing Road 26, Harbin, Heilongjiang Province, 150040, PR China.
| |
Collapse
|
20
|
Lee B, Smith DK, Guan Y. Alignment free sequence comparison methods and reservoir host prediction. Bioinformatics 2021; 37:3337-3342. [PMID: 33964132 PMCID: PMC8135978 DOI: 10.1093/bioinformatics/btab338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Revised: 03/29/2021] [Accepted: 04/30/2021] [Indexed: 11/19/2022] Open
Abstract
Motivation The emergence and subsequent pandemic of the SARS-CoV-2 virus raised urgent questions about its origin and, particularly, its reservoir host. These types of questions are long-standing problems in the management of emerging infectious diseases and are linked to virus discovery programs and the prediction of viruses that are likely to become zoonotic. Conventional means to identify reservoir hosts have relied on surveillance, experimental studies and phylogenetics. More recently, machine learning approaches have been applied to generate tools to swiftly predict reservoir hosts from sequence data. Results Here, we extend a recent work that combined sequence alignment and a mixture of alignment-free approaches using a gradient boosting machines (GBMs) machine learning model, which integrates genomic traits (GT) and phylogenetic neighbourhood (PN) signatures to predict reservoir hosts. We add a more uniform approach by applying Machine Learning with Digital Signal Processing (MLDSP)-based structural patterns (M-SP). The extended model was applied to an existing virus/reservoir host dataset and to the SARS-CoV-2 and related viruses and generated an improvement in prediction accuracy. Availability and implementation The source code used in this work is freely available at https://github.com/bill1167/hostgbms. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bill Lee
- State Key Laboratory of Emerging Infectious Diseases, School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, 21 Sassoon Rd., Pok Fu Lam, Hong Kong
| | - David K Smith
- State Key Laboratory of Emerging Infectious Diseases, School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, 21 Sassoon Rd., Pok Fu Lam, Hong Kong
| | - Yi Guan
- State Key Laboratory of Emerging Infectious Diseases, School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, 21 Sassoon Rd., Pok Fu Lam, Hong Kong.,Joint Institute of Virology (Shantou University and The University of Hong Kong), Guangdong-Hongkong Joint Laboratory of Emerging Infectious Diseases, Shantou University, Shantou, P. R. China
| |
Collapse
|
21
|
Bielińska-Wąż D, Wąż P. Non-standard bioinformatics characterization of SARS-CoV-2. Comput Biol Med 2021; 131:104247. [PMID: 33611129 PMCID: PMC7966820 DOI: 10.1016/j.compbiomed.2021.104247] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2020] [Revised: 01/22/2021] [Accepted: 01/26/2021] [Indexed: 12/16/2022]
Abstract
A non-standard bioinformatics method, 4D-Dynamic Representation of DNA/RNA Sequences, aiming at an analysis of the information available in nucleotide databases, has been formulated. The sequences are represented by sets of "material points" in a 4D space - 4D-dynamic graphs. The graphs representing the sequences are treated as "rigid bodies" and characterized by values analogous to the ones used in the classical dynamics. As the graphical representations of the sequences, the projections of the graphs into 2D and 3D spaces are used. The method has been applied to an analysis of the complete genome sequences of the 2019 novel coronavirus. As a result, 2D and 3D classification maps are obtained. The coordinate axes in the maps correspond to the values derived from the exact formulas characterizing the graphs: the coordinates of the centers of mass and the 4D moments of inertia. The points in the maps represent sequences and their coordinates are used as the classifiers. The main result of this work has been derived from the 3D classification maps. The distribution of clusters of points which emerged in these maps, supports the hypothesis that SARS-CoV-2 may have originated in bat and in pangolin. Pilot calculations for Zika virus sequence data prove that the proposed approach is also applicable to a description of time evolution of genome sequences of viruses.
Collapse
Affiliation(s)
- Dorota Bielińska-Wąż
- Department of Radiological Informatics and Statistics, Medical University of Gdańsk, 80-210, Gdańsk, Poland.
| | - Piotr Wąż
- Department of Nuclear Medicine, Medical University of Gdańsk, 80-210, Gdańsk, Poland.
| |
Collapse
|
22
|
Kania A, Sarapata K. The robustness of the chaos game representation to mutations and its application in free-alignment methods. Genomics 2021; 113:1428-1437. [PMID: 33713823 DOI: 10.1016/j.ygeno.2021.03.015] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Revised: 01/22/2021] [Accepted: 03/05/2021] [Indexed: 02/06/2023]
Abstract
Numerical representation of biological sequences plays an important role in bioinformatics and has many practical applications. One of the most popular approaches is the chaos game representation. In this paper, the authors propose a novel look into chaos game construction - an analytical description of this procedure. This type enables to build more general number sequences using different weight functions. The authors suggest three conditions that these functions should hold. Additionally, they present some criteria to compare them and check whether they provide a unique representation. One of the most important advantages of our approach is the possibility to construct such a description that is less sensitive to mutations and as a result, give more reliable values for free-alignment phylogenetic trees constructions. Finally, the authors applied the DFT method using four types of functions and compared the obtained results using the BLAST tool.
Collapse
Affiliation(s)
- Adrian Kania
- Department of Computational Biophysics and Bioinformatics, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, Gronostajowa 7, 30-387 Cracow, Poland.
| | - Krzysztof Sarapata
- Department of Computational Biophysics and Bioinformatics, Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, Gronostajowa 7, 30-387 Cracow, Poland
| |
Collapse
|
23
|
Abstract
An author unconsciously encodes in the written text a certain style that is often difficult to recognize. Still, there are many computational means developed for this purpose that take into account various features, from lexical and character-based attributes to syntactic or semantic ones. We propose an approach that starts from the character level and uses chaos game representation to illustrate documents like images which are subsequently classified by a deep learning algorithm. The experiments are made on three data sets and the outputs are comparable to the results from the literature. The study also verifies the suitability of the method for small data sets and whether image augmentation can improve the classification efficiency.
Collapse
|
24
|
Wekesa JS, Meng J, Luan Y. A deep learning model for plant lncRNA-protein interaction prediction with graph attention. Mol Genet Genomics 2020; 295:1091-1102. [DOI: 10.1007/s00438-020-01682-w] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2020] [Accepted: 05/01/2020] [Indexed: 02/06/2023]
|