1
|
van Zyl DJ, Dunaiski M, Tegally H, Baxter C, de Oliveira T, Xavier JS. Alignment-free viral sequence classification at scale. BMC Genomics 2025; 26:389. [PMID: 40251515 PMCID: PMC12007369 DOI: 10.1186/s12864-025-11554-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2024] [Accepted: 04/01/2025] [Indexed: 04/20/2025] Open
Abstract
BACKGROUND The rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-free (AF) methods offer a scalable alternative to traditional alignment-based approaches such as BLAST. This study evaluates alignment-free methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets. RESULTS We employed six established AF techniques to extract feature vectors from viral genomes, which were subsequently used to train Random Forest classifiers. Our primary dataset comprises 297,186 SARS-CoV- 2 nucleotide sequences, categorized into 3502 distinct lineages. Furthermore, we validated our models using dengue and HIV sequences to demonstrate robustness across different viral datasets. Our AF classifiers achieved 97.8% accuracy on the SARS-CoV- 2 test set, and 99.8% and 89.1% accuracy on dengue and HIV test sets, respectively. CONCLUSION Despite the high-class dimensionality, we show that word-based AF methods effectively represent viral sequences. Our study highlights the practical advantages of AF techniques, including significantly faster processing compared to alignment-based methods and the ability to classify sequences using modest computational resources.
Collapse
Affiliation(s)
- Daniel J van Zyl
- Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa.
- Computer Science Division, Department of Mathematical Sciences, Faculty of Science, Stellenbosch University, Stellenbosch, South Africa.
| | - Marcel Dunaiski
- Computer Science Division, Department of Mathematical Sciences, Faculty of Science, Stellenbosch University, Stellenbosch, South Africa
| | - Houriiyah Tegally
- Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa
| | - Cheryl Baxter
- Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa
- Centre for the AIDS Programme of Research in South Africa (CAPRISA), Durban, South Africa
| | - Tulio de Oliveira
- Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa
- Centre for the AIDS Programme of Research in South Africa (CAPRISA), Durban, South Africa
- Kwazulu-Natal Research Innovation and Sequencing Platform (KRISP), Nelson R Mandela School of Medicine, University of Kwazulu-Natal, Durban, South Africa
- Department of Global Health, University of Washington, Seattle, USA
| | - Joicymara S Xavier
- Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa
- Institute of Agricultural Sciences, Universidade Federal dos Vales do Jequitinhonha e Mucuri (UFVJM), Unaí, Brazil
- Institute of Biological Sciences, Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil
| |
Collapse
|
2
|
van Zyl DJ, Dunaiski M, Tegally H, Baxter C, The INFORM Africa research study group, de Oliveira T, Xavier JS. Alignment-Free Viral Sequence Classification at Scale. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.12.10.627186. [PMID: 39713356 PMCID: PMC11661207 DOI: 10.1101/2024.12.10.627186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/24/2024]
Abstract
Background The rapid increase in nucleotide sequence data generated by next-generation sequencing (NGS) technologies demands efficient computational tools for sequence comparison. Alignment-based methods, such as BLAST, are increasingly overwhelmed by the scale of contemporary datasets due to their high computational demands for classification. This study evaluates alignment-free (AF) methods as scalable and rapid alternatives for viral sequence classification, focusing on identifying techniques that maintain high accuracy and efficiency when applied to extremely large datasets. Results We employed six established AF techniques to extract feature vectors from viral genomes, which were subsequently used to train Random Forest classifiers. Our primary dataset comprises 297,186 SARS-CoV-2 nucleotide sequences, categorized into 3502 distinct lineages. Furthermore, we validated our models using dengue and HIV sequences to demonstrate robustness across different viral datasets. Our AF classifiers achieved 97.8% accuracy on the SARS-CoV-2 test set, and 99.8% and 89.1% accuracy on dengue and HIV test sets, respectively. Conclusion Despite the high-class dimensionality, we show that word-based AF methods effectively represent viral sequences. Our study highlights the practical advantages of AF techniques, including significantly faster processing compared to alignment-based methods and the ability to classify sequences using modest computational resources.
Collapse
Affiliation(s)
- Daniel J. van Zyl
- Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa
- Computer Science Division, Department of Mathematical Sciences, Faculty of Science, Stellenbosch University, Stellenbosch, South Africa
| | - Marcel Dunaiski
- Computer Science Division, Department of Mathematical Sciences, Faculty of Science, Stellenbosch University, Stellenbosch, South Africa
| | - Houriiyah Tegally
- Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa
| | - Cheryl Baxter
- Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa
- Centre for the AIDS Programme of Research in South Africa (CAPRISA), Durban, South Africa
| | | | - Tulio de Oliveira
- Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa
- Centre for the AIDS Programme of Research in South Africa (CAPRISA), Durban, South Africa
- KwaZulu-Natal Research Innovation and Sequencing Platform (KRISP), Nelson R Mandela School of Medicine, University of KwaZulu-Natal, Durban, South Africa
- Department of Global Health, University of Washington, Seattle, USA
| | - Joicymara S. Xavier
- Centre for Epidemic Response and Innovation (CERI), School of Data Science and Computational Thinking, Stellenbosch University, Stellenbosch, South Africa
- Institute of Agricultural Sciences, Universidade Federal dos Vales do Jequitinhonha e Mucuri (UFVJM), Unaí, Brazil
- Institute of Biological Sciences, Universidade Federal de Minas Gerais (UFMG), Belo Horizonte, Brazil
| |
Collapse
|
3
|
Wang T, Yu ZG, Li J. CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front Microbiol 2024; 15:1339156. [PMID: 38572227 PMCID: PMC10987876 DOI: 10.3389/fmicb.2024.1339156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/23/2024] [Indexed: 04/05/2024] Open
Abstract
Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.
Collapse
Affiliation(s)
- Ting Wang
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Zu-Guo Yu
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Jinyan Li
- School of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
4
|
Zhou DD, Li HZ, Wang W, Kuang L. Changes in oscillatory patterns of microstate sequence in patients with first-episode psychosis. Sci Data 2024; 11:38. [PMID: 38182586 PMCID: PMC10770397 DOI: 10.1038/s41597-023-02892-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2023] [Accepted: 12/27/2023] [Indexed: 01/07/2024] Open
Abstract
We aimed to utilize chaos game representation (CGR) for the investigation of microstate sequences and explore its potential as neurobiomarkers for psychiatric disorders. We applied our proposed method to a public dataset including 82 patients with first-episode psychosis (FEP) and 61 control subjects. Two time series were constructed: one using the microstate spacing distance in CGR and the other using complex numbers representing the microstate coordinates in CGR. Power spectral features of both time series and frequency matrix CGR (FCGR) were compared between groups and employed in a machine learning application. The four canonical microstates (A, B, C, and D) were identified using both shared and separate templates. Our results showed the microstate oscillatory pattern exhibited alterations in the FEP group. Using oscillatory features improved machine learning performance compared with classical features and FCGR. This study opens up new avenues for exploring the use of CGR in analyzing EEG microstate sequences. Features derived from microstate sequence CGR offer fine-grained neurobiomarkers for psychiatric disorders.
Collapse
Affiliation(s)
- Dong-Dong Zhou
- Mental Health Center, University-Town Hospital of Chongqing Medical University, Chongqing, China.
| | - Hong-Zhi Li
- Mental Health Center, University-Town Hospital of Chongqing Medical University, Chongqing, China
| | - Wo Wang
- Mental Health Center, University-Town Hospital of Chongqing Medical University, Chongqing, China
| | - Li Kuang
- Mental Health Center, University-Town Hospital of Chongqing Medical University, Chongqing, China.
- Department of Psychiatry, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China.
| |
Collapse
|
5
|
Dey S, Ghosh P, Das S. Positional difference and Frequency (PdF) based alignment-free technique for genome sequence comparison. J Biomol Struct Dyn 2023; 42:12660-12688. [PMID: 37885236 DOI: 10.1080/07391102.2023.2272748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 09/19/2023] [Indexed: 10/28/2023]
Abstract
In the field of computational biology, genome sequence comparison among different species is essential and has applications in both the research and scientific fields. Owing to the lengthy processing time and large number of data sets, the alignment-based approaches are unsuitable and ineffective. Therefore, alignment-free techniques have obtained popularity for acquiring proper sequence clustering and evolutionary relationship among species. In this paper, a complete bipartite graph based Positional difference and Frequency (PdF) vector descriptor is introduced. Positional difference and Frequency, two parameters, are applied to the genome sequence to create a 16- dimensional vector descriptor using the di-nucleotide representation of genome sequence. Subsequently, a distance matrix is calculated to construct the phylogenetic trees for different data sets of mammals and viruses. The achieved outcomes are compared with the phylogenetic trees of the earlier methods viz. the FFP method, the ClustalW method, the MEV method, the PCNV method and the FIS method. In most instances, the proposed method produces more precise outcomes than the preceding techniques and has potential for genome sequence comparison on both the equal and unequal length of data-sets.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Sudeshna Dey
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India
| | - Papri Ghosh
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India
| | - Subhram Das
- Computer Science and Engineering, Narula Institute of Technology, Kolkata, India
| |
Collapse
|
6
|
Dey S, Das S, Bhattacharya DK. Biochemical Property Based Positional Matrix: A New Approach Towards Genome Sequence Comparison. J Mol Evol 2023; 91:93-131. [PMID: 36587178 PMCID: PMC9805373 DOI: 10.1007/s00239-022-10082-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2022] [Accepted: 12/01/2022] [Indexed: 01/01/2023]
Abstract
The growth of the genome sequence has become one of the emerging areas in the study of bioinformatics. It has led to an excessive demand for researchers to develop advanced methodologies for evolutionary relationships among species. The alignment-free methods have been proved to be more efficient and appropriate related to time and space than existing alignment-based methods for sequence analysis. In this study, a new alignment-free genome sequence comparison technique is proposed based on the biochemical properties of nucleotides. Each genome sequence can be distributed in four parameters to represent a 21-dimensional numerical descriptor using the Positional Matrix. To substantiate the proposed method, phylogenetic trees are constructed on the viral and mammalian datasets by applying the UPGMA/NJ clustering method. Further, the results of this method are compared with the results of the Feature Frequency Profiles method, the Positional Correlation Natural Vector method, the Graph-theoretic method, the Multiple Encoding Vector method, and the Fuzzy Integral Similarity method. In most cases, it is found that the present method produces more accurate results than the prior methods. Also, in the present method, the execution time for computation is comparatively small.
Collapse
Affiliation(s)
- Sudeshna Dey
- grid.440742.10000 0004 1799 6713Computer Science and Engineering, Narula Institute of Technology, Kolkata, 700109 India
| | - Subhram Das
- grid.440742.10000 0004 1799 6713Computer Science and Engineering, Narula Institute of Technology, Kolkata, 700109 India
| | - D. K. Bhattacharya
- grid.59056.3f0000 0001 0664 9773Pure Mathematics, Calcutta University, Kolkata, 700019 India
| |
Collapse
|
7
|
Swain MT, Vickers M. Interpreting alignment-free sequence comparison: what makes a score a good score? NAR Genom Bioinform 2022; 4:lqac062. [PMID: 36071721 PMCID: PMC9442500 DOI: 10.1093/nargab/lqac062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Revised: 07/01/2022] [Accepted: 08/16/2022] [Indexed: 11/13/2022] Open
Abstract
Alignment-free methods are alternatives to alignment-based methods when searching sequence data sets. The output from an alignment-free sequence comparison is a similarity score, the interpretation of which is not straightforward. We propose objective functions to interpret and calibrate outputs from alignment-free searches, noting that different objective functions are necessary for different biological contexts. This leads to advantages: visualising and comparing score distributions, including those from true positives, may be a relatively simple method to gain insight into the performance of different metrics. Using an empirical approach with both DNA and protein sequences, we characterise different similarity score distributions generated under different parameters. In particular, we demonstrate how sequence length can affect the scores. We show that scores of true positive sequence pairs may correlate significantly with their mean length; and even if the correlation is weak, the relative difference in length of the sequence pair may significantly reduce the effectiveness of alignment-free metrics. Importantly, we show how objective functions can be used with test data to accurately estimate the probability of true positives. This can significantly increase the utility of alignment-free approaches. Finally, we have developed a general-purpose software tool called KAST for use in high-throughput workflows on Linux clusters.
Collapse
Affiliation(s)
- Martin T Swain
- Department of Life Sciences, Aberystwyth University , Penglais, Aberystwyth, Ceredigion, SY23 3DA, UK
| | - Martin Vickers
- The John Innes Centre, Norwich Research Park , Norwich NR4 7UH, UK
| |
Collapse
|
8
|
Sun N, Zhao X, Yau SST. An efficient numerical representation of genome sequence: natural vector with covariance component. PeerJ 2022; 10:e13544. [PMID: 35729905 PMCID: PMC9206847 DOI: 10.7717/peerj.13544] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Accepted: 05/16/2022] [Indexed: 01/17/2023] Open
Abstract
Background The characterization and comparison of microbial sequences, including archaea, bacteria, viruses and fungi, are very important to understand their evolutionary origin and the population relationship. Most methods are limited by the sequence length and lack of generality. The purpose of this study is to propose a general characterization method, and to study the classification and phylogeny of the existing datasets. Methods We present a new alignment-free method to represent and compare biological sequences. By adding the covariance between each two nucleotides, the new 18-dimensional natural vector successfully describes 24,250 genomic sequences and 95,542 DNA barcode sequences. The new numerical representation is used to study the classification and phylogenetic relationship of microbial sequences. Results First, the classification results validate that the six-dimensional covariance vector is necessary to characterize sequences. Then, the 18-dimensional natural vector is further used to conduct the similarity relationship between giant virus and archaea, bacteria, other viruses. The nearest distance calculation results reflect that the giant viruses are closer to bacteria in distribution of four nucleotides. The phylogenetic relationships of the three representative families, Mimiviridae, Pandoraviridae and Marsellieviridae from giant viruses are analyzed. The trees show that ten sequences of Mimiviridae are clustered with Pandoraviridae, and Mimiviridae is closer to the root of the tree than Marsellieviridae. The new developed alignment-free method can be computed very fast, which provides an effective numerical representation for the sequence of microorganisms.
Collapse
Affiliation(s)
- Nan Sun
- Department of Mathematical Sciences, Tsinghua University, Beijing, China
| | - Xin Zhao
- Beijing Electronic Science and Technology Institute, Beijing, China
| | - Stephen S.-T. Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, China,Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, China
| |
Collapse
|
9
|
Löchel HF, Heider D. Chaos game representation and its applications in bioinformatics. Comput Struct Biotechnol J 2021; 19:6263-6271. [PMID: 34900136 PMCID: PMC8636998 DOI: 10.1016/j.csbj.2021.11.008] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Revised: 11/04/2021] [Accepted: 11/05/2021] [Indexed: 11/18/2022] Open
Abstract
Chaos game representation (CGR), a milestone in graphical bioinformatics, has become a powerful tool regarding alignment-free sequence comparison and feature encoding for machine learning. The algorithm maps a sequence to 2-dimensional space, while an extension of the CGR, the so-called frequency matrix representation (FCGR), transforms sequences of different lengths into equal-sized images or matrices. The CGR is a generalized Markov chain and includes various properties, which allow a unique representation of a sequence. Therefore, it has a broad spectrum of applications in bioinformatics, such as sequence comparison and phylogenetic analysis and as an encoding of sequences for machine learning. This review introduces the construction of CGRs and FCGRs, their applications on DNA and proteins, and gives an overview of recent applications and progress in bioinformatics.
Collapse
Affiliation(s)
- Hannah Franziska Löchel
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, D-35032 Marburg, Germany
| | - Dominik Heider
- Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, D-35032 Marburg, Germany
| |
Collapse
|
10
|
VanWallendael A, Alvarez M. Alignment-free methods for polyploid genomes: Quick and reliable genetic distance estimation. Mol Ecol Resour 2021; 22:612-622. [PMID: 34478242 DOI: 10.1111/1755-0998.13499] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Accepted: 08/20/2021] [Indexed: 01/10/2023]
Abstract
Polyploid genomes pose several inherent challenges to population genetic analyses. While alignment-based methods are fundamentally limited in their applicability to polyploids, alignment-free methods bypass most of these limits. We investigated the use of Mash, a k-mer analysis tool that uses the MinHash method to reduce complexity in large genomic data sets, for basic population genetic analyses of polyploid sequences. We measured the degree to which Mash correctly estimated pairwise genetic distance in simulated haploid and polyploid short-read sequences with various levels of missing data. Mash-based estimates of genetic distance were comparable to alignment-based estimates, and were less impacted by missing data. We also used Mash to analyse publicly available short-read data for three polyploid and one diploid species, then compared Mash results to published results. For both simulated and real data, Mash accurately estimated pairwise genetic differences for polyploids as well as diploids as much as 476 times faster than alignment-based methods, though we found that Mash genetic distance estimates could be biased by per-sample read depth. Mash may be a particularly useful addition to the toolkit of polyploid geneticists for rapid confirmation of alignment-based results and for basic population genetics in reference-free systems or those with only poor-quality sequence data available.
Collapse
Affiliation(s)
- Acer VanWallendael
- Department of Plant Biology, Michigan State University, East Lansing, MI, USA
| | - Mariano Alvarez
- Biology Department, Wesleyan University, Middletown, CT, USA
| |
Collapse
|
11
|
Symbolic Sequence Classification in the Fractal Space. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 2021. [DOI: 10.1109/tetci.2018.2876528] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
12
|
Mahmood Aamir K, Bilal M, Ramzan M, Attique Khan M, Nam Y, Kadry S. Classification of Retroviruses Based on Genomic Data Using RVGC. COMPUTERS, MATERIALS & CONTINUA 2021; 69:3829-3844. [DOI: 10.32604/cmc.2021.017835] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/13/2021] [Accepted: 04/17/2021] [Indexed: 08/25/2024]
|
13
|
Huang J, Dai Q, Yao Y, He PA. A Generalized Iterative Map for Analysis of Protein Sequences. Comb Chem High Throughput Screen 2020; 25:381-391. [PMID: 33045963 DOI: 10.2174/1386207323666201012142318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Revised: 07/30/2020] [Accepted: 08/09/2020] [Indexed: 11/22/2022]
Abstract
AIM AND OBJECTIVE The similarities comparison of biological sequences is an important task in bioinformatics. The methods of the similarities comparison for biological sequences are divided into two classes: sequence alignment method and alignment-free method. The graphical representation of biological sequences is a kind of alignment-free method, which constitutes a tool for analyzing and visualizing the biological sequences. In this article, a generalized iterative map of protein sequences was suggested to analyze the similarities of biological sequences. MATERIALS AND METHODS Based on the normalized physicochemical indexes of 20 amino acids, each amino acid can be mapped into a point in 5D space. A generalized iterative function system was introduced to outline a generalized iterative map of protein sequences, which can not only reflect various physicochemical properties of amino acids but also incorporate with different compression ratios of the component of a generalized iterative map. Several properties were proved to illustrate the advantage of the generalized iterative map. The mathematical description of the generalized iterative map was suggested to compare the similarities and dissimilarities of protein sequences. Based on this method, similarities/dissimilarities were compared among ND5 protein sequences, as well as ND6 protein sequences of ten different species. RESULTS By correlation analysis, the ClustalW results were compared with our similarity/dissimilarity results and other graphical representation results to show the utility of our approach. The comparison results show that our approach has better correlations with ClustalW for all species than other approaches and illustrate the effectiveness of our approach. CONCLUSION Two examples show that our method not only has good performances and effects in the similarity/dissimilarity analysis of protein sequences but also does not require complex computation.
Collapse
Affiliation(s)
- Jiahe Huang
- School of Science, Zhejiang Sci-Tech University, Hangzhou,China
| | - Qi Dai
- College of Life Science, Zhejiang Sci-Tech University, Hangzhou,China
| | - Yuhua Yao
- School of Mathematics and Statistics, Hainan Normal University, Haikou,China
| | - Ping-An He
- School of Science, Zhejiang Sci-Tech University, Hangzhou,China
| |
Collapse
|
14
|
Olyaee MH, Pirgazi J, Khalifeh K, Khanteymoori A. RCOVID19: Recurrence-based SARS-CoV-2 features using chaos game representation. Data Brief 2020; 32:106144. [PMID: 32835040 PMCID: PMC7411429 DOI: 10.1016/j.dib.2020.106144] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 07/28/2020] [Accepted: 08/04/2020] [Indexed: 11/28/2022] Open
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is responsible for the COVID-19 pandemic. It was first detected in China and was rapidly spread to other countries. Several thousands of whole genome sequences of SARS-CoV-2 have been reported and it is important to compare them and identify distinctive evolutionary/mutant markers. Utilizing chaos game representation (CGR) as well as recurrence quantification analysis (RQA) as a powerful nonlinear analysis technique, we proposed an effective process to extract several valuable features from genomic sequences of SARS-CoV-2. The represented features enable us to compare genomic sequences with different lengths. The provided dataset involves totally 18 RQA-based features for 4496 instances of SARS-CoV-2.
Collapse
Affiliation(s)
- Mohammad Hossein Olyaee
- Faculty of Engineering, Department of Computer Engineering, University of Gonabad, Gonabad, Iran
| | - Jamshid Pirgazi
- Department of Electrical and Computer Engineering, University of Science and Technology of Mazandaran, Behshahr, Iran
| | - Khosrow Khalifeh
- Department of Biology, Faculty of Sciences, University of Zanjan, Zanjan, Iran
| | | |
Collapse
|
15
|
Sun Z, Pei S, He RL, Yau SST. A novel numerical representation for proteins: Three-dimensional Chaos Game Representation and its Extended Natural Vector. Comput Struct Biotechnol J 2020; 18:1904-1913. [PMID: 32774785 PMCID: PMC7390779 DOI: 10.1016/j.csbj.2020.07.004] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Revised: 07/04/2020] [Accepted: 07/05/2020] [Indexed: 12/16/2022] Open
Abstract
Chaos Game Representation (CGR) was first proposed to be an image representation method of DNA and have been extended to the case of other biological macromolecules. Compared with the CGR images of DNA, where DNA sequences are converted into a series of points in the unit square, the existing CGR images of protein are not so elegant in geometry and the implications of the distribution of points in the CGR image are not so obvious. In this study, by naturally distributing the twenty amino acids on the vertices of a regular dodecahedron, we introduce a novel three-dimensional image representation of protein sequences with CGR method. We also associate each CGR image with a vector in high dimensional Euclidean space, called the extended natural vector (ENV), in order to analyze the information contained in the CGR images. Based on the results of protein classification and phylogenetic analysis, our method could serve as a precise method to discover biological relationships between proteins.
Collapse
Affiliation(s)
- Zeju Sun
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| | - Shaojun Pei
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| | - Rong Lucy He
- Department of Biological Sciences, Chicago State University, Chicago, IL 60628, USA
| | - Stephen S-T Yau
- Department of Mathematical Sciences, Tsinghua University, Beijing, PR China
| |
Collapse
|
16
|
Chanda P, Costa E, Hu J, Sukumar S, Van Hemert J, Walia R. Information Theory in Computational Biology: Where We Stand Today. ENTROPY (BASEL, SWITZERLAND) 2020; 22:E627. [PMID: 33286399 PMCID: PMC7517167 DOI: 10.3390/e22060627] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 05/31/2020] [Accepted: 06/03/2020] [Indexed: 12/30/2022]
Abstract
"A Mathematical Theory of Communication" was published in 1948 by Claude Shannon to address the problems in the field of data compression and communication over (noisy) communication channels. Since then, the concepts and ideas developed in Shannon's work have formed the basis of information theory, a cornerstone of statistical learning and inference, and has been playing a key role in disciplines such as physics and thermodynamics, probability and statistics, computational sciences and biological sciences. In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational biology-gene expression and transcriptomics, alignment-free sequence comparison, sequencing and error correction, genome-wide disease-gene association mapping, metabolic networks and metabolomics, and protein sequence, structure and interaction analysis.
Collapse
Affiliation(s)
- Pritam Chanda
- Corteva Agriscience™, Indianapolis, IN 46268, USA
- Computer and Information Science, Indiana University-Purdue University, Indianapolis, IN 46202, USA
| | - Eduardo Costa
- Corteva Agriscience™, Mogi Mirim, Sao Paulo 13801-540, Brazil
| | - Jie Hu
- Corteva Agriscience™, Indianapolis, IN 46268, USA
| | | | | | - Rasna Walia
- Corteva Agriscience™, Johnston, IA 50131, USA
| |
Collapse
|
17
|
Positional Correlation Natural Vector: A Novel Method for Genome Comparison. Int J Mol Sci 2020; 21:ijms21113859. [PMID: 32485813 PMCID: PMC7312176 DOI: 10.3390/ijms21113859] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 05/17/2020] [Accepted: 05/26/2020] [Indexed: 12/17/2022] Open
Abstract
Advances in sequencing technology have made large amounts of biological data available. Evolutionary analysis of data such as DNA sequences is highly important in biological studies. As alignment methods are ineffective for analyzing large-scale data due to their inherently high costs, alignment-free methods have recently attracted attention in the field of bioinformatics. In this paper, we introduce a new positional correlation natural vector (PCNV) method that involves converting a DNA sequence into an 18-dimensional numerical feature vector. Using frequency and position correlation to represent the nucleotide distribution, it is possible to obtain a PCNV for a DNA sequence. This new numerical vector design uses six suitable features to characterize the correlation among nucleotide positions in sequences. PCNV is also very easy to compute and can be used for rapid genome comparison. To test our novel method, we performed phylogenetic analysis with several viral and bacterial genome datasets with PCNV. For comparison, an alignment-based method, Bayesian inference, and two alignment-free methods, feature frequency profile and natural vector, were performed using the same datasets. We found that the PCNV technique is fast and accurate when used for phylogenetic analysis and classification of viruses and bacteria.
Collapse
|
18
|
Complex Network Characterization Using Graph Theory and Fractal Geometry: The Case Study of Lung Cancer DNA Sequences. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10093037] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
This paper discusses an approach developed for exploiting the local elementary movements of evolution to study complex networks in terms of shared common embedding and, consequently, shared fractal properties. This approach can be useful for the analysis of lung cancer DNA sequences and their properties by using the concepts of graph theory and fractal geometry. The proposed method advances a renewed consideration of network complexity both on local and global scales. Several researchers have illustrated the advantages of fractal mathematics, as well as its applicability to lung cancer research. Nevertheless, many researchers and clinicians continue to be unaware of its potential. Therefore, this paper aims to examine the underlying assumptions of fractals and analyze the fractal dimension and related measurements for possible application to complex networks and, especially, to the lung cancer network. The strict relationship between the lung cancer network properties and the fractal dimension is proved. Results show that the fractal dimension decreases in the lung cancer network while the topological properties of the network increase in the lung cancer network. Finally, statistical and topological significance between the complexity of the network and lung cancer network is shown.
Collapse
|
19
|
Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, Lau AK, Röhling S, Choi JJ, Waterman MS, Comin M, Kim SH, Vinga S, Almeida JS, Chan CX, James BT, Sun F, Morgenstern B, Karlowski WM. Benchmarking of alignment-free sequence comparison methods. Genome Biol 2019; 20:144. [PMID: 31345254 PMCID: PMC6659240 DOI: 10.1186/s13059-019-1755-7] [Citation(s) in RCA: 113] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 07/03/2019] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. RESULTS Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. CONCLUSION The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland
| | - Hani Z Girgis
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | | | - Chris-Andre Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Kujin Tang
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Anna Katharina Lau
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Sophie Röhling
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Jae Jin Choi
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Michael S Waterman
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Sung-Hou Kim
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Susana Vinga
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas S Almeida
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NIH/NCI), Bethesda, USA
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, and School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Benjamin T James
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | - Fengzhu Sun
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland.
| |
Collapse
|
20
|
Olyaee MH, Khanteymoori A, Khalifeh K. Application of Chaotic Laws to Improve Haplotype Assembly Using Chaos Game Representation. Sci Rep 2019; 9:10361. [PMID: 31316124 PMCID: PMC6637069 DOI: 10.1038/s41598-019-46844-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 07/01/2019] [Indexed: 02/06/2023] Open
Abstract
Sequence data are deposited in the form of unphased genotypes and it is not possible to directly identify the location of a particular allele on a specific parental chromosome or haplotype. This study employed nonlinear time series modeling approaches to analyze the haplotype sequences obtained from the NGS sequencing method. To evaluate the chaotic behavior of haplotypes, we analyzed their whole sequences, as well as several subsequences from distinct haplotypes, in terms of the SNP distribution on their chromosomes. This analysis utilized chaos game representation (CGR) followed by the application of two different scaling methods. It was found that chaotic behavior clearly exists in most haplotype subsequences. For testing the applicability of the proposed model, the present research determined the alleles in gap positions and positions with low coverage by using chromosome subsequences in which 10% of each subsequence's alleles are replaced by gaps. After conversion of the subsequences' CGR into the coordinate series, a Local Projection (LP) method predicted the measure of ambiguous positions in the coordinate series. It was discovered that the average reconstruction rate for all input data is more than 97%, demonstrating that applying this knowledge can effectively improve the reconstruction rate of given haplotypes.
Collapse
Affiliation(s)
| | | | - Khosrow Khalifeh
- Department of Biology, Faculty of Sciences, University of Zanjan, Zanjan, Iran
| |
Collapse
|
21
|
Randhawa GS, Hill KA, Kari L. ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels. BMC Genomics 2019; 20:267. [PMID: 30943897 PMCID: PMC6448311 DOI: 10.1186/s12864-019-5571-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 02/27/2019] [Indexed: 11/11/2022] Open
Abstract
Background Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods. Results We propose a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. We test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of >97%. A quantitative comparison with state-of-the-art classification software tools is performed, on two small benchmark datasets and one large 4322 vertebrate mtDNA genomes dataset. Our results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy, and is overall faster. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4,710 bacterial genomes into phyla with 95.5% accuracy. Lastly, our analysis shows that the “Purine/Pyrimidine”, “Just-A” and “Real” numerical representations of DNA sequences outperform ten other such numerical representations used in the Digital Signal Processing literature for DNA classification purposes. Conclusions Due to its superior classification accuracy, speed, and scalability to large datasets, ML-DSP is highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity.
Collapse
Affiliation(s)
- Gurjit S Randhawa
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
22
|
Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, Sun F. Alignment-Free Sequence Analysis and Applications. Annu Rev Biomed Data Sci 2018; 1:93-114. [PMID: 31828235 PMCID: PMC6905628 DOI: 10.1146/annurev-biodatasci-080917-013431] [Citation(s) in RCA: 58] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Genome and metagenome comparisons based on large amounts of next generation sequencing (NGS) data pose significant challenges for alignment-based approaches due to the huge data size and the relatively short length of the reads. Alignment-free approaches based on the counts of word patterns in NGS data do not depend on the complete genome and are generally computationally efficient. Thus, they contribute significantly to genome and metagenome comparison. Recently, novel statistical approaches have been developed for the comparison of both long and shotgun sequences. These approaches have been applied to many problems including the comparison of gene regulatory regions, genome sequences, metagenomes, binning contigs in metagenomic data, identification of virus-host interactions, and detection of horizontal gene transfers. We provide an updated review of these applications and other related developments of word-count based approaches for alignment-free sequence analysis.
Collapse
Affiliation(s)
- Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Xin Bai
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Yang Young Lu
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Kujin Tang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, Fujian, China
| | - Gesine Reinert
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
23
|
ALUminating the Path of Atherosclerosis Progression: Chaos Theory Suggests a Role for Alu Repeats in the Development of Atherosclerotic Vascular Disease. Int J Mol Sci 2018; 19:ijms19061734. [PMID: 29895733 PMCID: PMC6032270 DOI: 10.3390/ijms19061734] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2018] [Revised: 06/04/2018] [Accepted: 06/09/2018] [Indexed: 12/12/2022] Open
Abstract
Atherosclerosis (ATH) and coronary artery disease (CAD) are chronic inflammatory diseases with an important genetic background; they derive from the cumulative effect of multiple common risk alleles, most of which are located in genomic noncoding regions. These complex diseases behave as nonlinear dynamical systems that show a high dependence on their initial conditions; thus, long-term predictions of disease progression are unreliable. One likely possibility is that the nonlinear nature of ATH could be dependent on nonlinear correlations in the structure of the human genome. In this review, we show how chaos theory analysis has highlighted genomic regions that have shared specific structural constraints, which could have a role in ATH progression. These regions were shown to be enriched with repetitive sequences of the Alu family, genomic parasites that have colonized the human genome, which show a particular secondary structure and are involved in the regulation of gene expression. Here, we show the impact of Alu elements on the mechanisms that regulate gene expression, especially highlighting the molecular mechanisms via which the Alu elements alter the inflammatory response. We devote special attention to their relationship with the long noncoding RNA (lncRNA); antisense noncoding RNA in the INK4 locus (ANRIL), a risk factor for ATH; their role as microRNA (miRNA) sponges; and their ability to interfere with the regulatory circuitry of the (nuclear factor kappa B) NF-κB response. We aim to characterize ATH as a nonlinear dynamic system, in which small initial alterations in the expression of a number of repetitive elements are somehow amplified to reach phenotypic significance.
Collapse
|
24
|
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017; 18:186. [PMID: 28974235 PMCID: PMC5627421 DOI: 10.1186/s13059-017-1319-7] [Citation(s) in RCA: 285] [Impact Index Per Article: 35.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. The strength of these methods makes them particularly useful for next-generation sequencing data processing and analysis. However, many researchers are unclear about how these methods work, how they compare to alignment-based methods, and what their potential is for use for their research. We address these questions and provide a guide to the currently available alignment-free sequence analysis tools.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland
| | - Susana Vinga
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas Almeida
- Stony Brook University (SUNY), 101 Nicolls Road, Stony Brook, NY, 11794, USA
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland.
| |
Collapse
|
25
|
Li Y, Shi X, Liang Y, Xie J, Zhang Y, Ma Q. RNA-TVcurve: a Web server for RNA secondary structure comparison based on a multi-scale similarity of its triple vector curve representation. BMC Bioinformatics 2017; 18:51. [PMID: 28109252 PMCID: PMC5251234 DOI: 10.1186/s12859-017-1481-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2016] [Accepted: 01/10/2017] [Indexed: 01/10/2023] Open
Abstract
Background RNAs have been found to carry diverse functionalities in nature. Inferring the similarity between two given RNAs is a fundamental step to understand and interpret their functional relationship. The majority of functional RNAs show conserved secondary structures, rather than sequence conservation. Those algorithms relying on sequence-based features usually have limitations in their prediction performance. Hence, integrating RNA structure features is very critical for RNA analysis. Existing algorithms mainly fall into two categories: alignment-based and alignment-free. The alignment-free algorithms of RNA comparison usually have lower time complexity than alignment-based algorithms. Results An alignment-free RNA comparison algorithm was proposed, in which novel numerical representations RNA-TVcurve (triple vector curve representation) of RNA sequence and corresponding secondary structure features are provided. Then a multi-scale similarity score of two given RNAs was designed based on wavelet decomposition of their numerical representation. In support of RNA mutation and phylogenetic analysis, a web server (RNA-TVcurve) was designed based on this alignment-free RNA comparison algorithm. It provides three functional modules: 1) visualization of numerical representation of RNA secondary structure; 2) detection of single-point mutation based on secondary structure; and 3) comparison of pairwise and multiple RNA secondary structures. The inputs of the web server require RNA primary sequences, while corresponding secondary structures are optional. For the primary sequences alone, the web server can compute the secondary structures using free energy minimization algorithm in terms of RNAfold tool from Vienna RNA package. Conclusion RNA-TVcurve is the first integrated web server, based on an alignment-free method, to deliver a suite of RNA analysis functions, including visualization, mutation analysis and multiple RNAs structure comparison. The comparison results with two popular RNA comparison tools, RNApdist and RNAdistance, showcased that RNA-TVcurve can efficiently capture subtle relationships among RNAs for mutation detection and non-coding RNA classification. All the relevant results were shown in an intuitive graphical manner, and can be freely downloaded from this server. RNA-TVcurve, along with test examples and detailed documents, are available at: http://ml.jlu.edu.cn/tvcurve/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1481-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ying Li
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China.,Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun, 130012, China
| | - Xiaohu Shi
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China.,Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun, 130012, China
| | - Yanchun Liang
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China.,Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun, 130012, China.,Zhuhai Laboratory of Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, Zhuhai College of Jilin University, Zhuhai, 519041, China
| | - Juan Xie
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, 57007, USA.,Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, South Dakota State University, Brookings, SD, 57007, USA.,BioSNTR, Brookings, SD, USA
| | - Yu Zhang
- College of Computer Science and Technology, Jilin University, Changchun, 130012, China. .,Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun, 130012, China.
| | - Qin Ma
- Department of Mathematics and Statistics, South Dakota State University, Brookings, SD, 57007, USA. .,Bioinformatics and Mathematical Biosciences Lab, Department of Agronomy, Horticulture and Plant Science, South Dakota State University, Brookings, SD, 57007, USA. .,BioSNTR, Brookings, SD, USA.
| |
Collapse
|
26
|
Yu Y, Yang L, Liu Z, Zhu C. Gene essentiality prediction based on fractal features and machine learning. MOLECULAR BIOSYSTEMS 2017; 13:577-584. [PMID: 28145541 DOI: 10.1039/c6mb00806b] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Predicting bacterial essential genes using only fractal features.
Collapse
Affiliation(s)
- Yongming Yu
- Department of Biomedical Engineering
- Shandong University
- Jinan
- China
| | - Licai Yang
- Department of Biomedical Engineering
- Shandong University
- Jinan
- China
| | - Zhiping Liu
- Department of Biomedical Engineering
- Shandong University
- Jinan
- China
| | - Chuansheng Zhu
- Department of Hematology
- Shandong University Affiliated Qianfoshan Hospital
- Jinan
- China
| |
Collapse
|
27
|
Karamichalis R, Kari L, Konstantinidis S, Kopecki S, Solis-Reyes S. Additive methods for genomic signatures. BMC Bioinformatics 2016; 17:313. [PMID: 27549194 PMCID: PMC4994249 DOI: 10.1186/s12859-016-1157-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2016] [Accepted: 07/19/2016] [Indexed: 01/09/2023] Open
Abstract
Background Studies exploring the potential of Chaos Game Representations (CGR) of genomic sequences to act as “genomic signatures” (to be species- and genome-specific) showed that CGR patterns of nuclear and organellar DNA sequences of the same organism can be very different. While the hypothesis that CGRs of mitochondrial DNA sequences can act as genomic signatures was validated for a snapshot of all sequenced mitochondrial genomes available in the NCBI GenBank sequence database, to our knowledge no such extensive analysis of CGRs of nuclear DNA sequences exists to date. Results We analyzed an extensive dataset, totalling 1.45 gigabase pairs, of nuclear/nucleoid genomic sequences (nDNA) from 42 different organisms, spanning all major kingdoms of life. Our computational experiments indicate that CGR signatures of nDNA of two different origins cannot always be differentiated, especially if they originate from closely-related species such as H. sapiens and P. troglodytes or E. coli and E. fergusonii. To address this issue, we propose the general concept of additive DNA signature of a set (collection) of DNA sequences. One particular instance, the composite DNA signature, combines information from nDNA fragments and organellar (mitochondrial, chloroplast, or plasmid) genomes. We demonstrate that, in this dataset, composite DNA signatures originating from two different organisms can be differentiated in all cases, including those where the use of CGR signatures of nDNA failed or was inconclusive. Another instance, the assembled DNA signature, combines information from many short DNA subfragments (e.g., 100 basepairs) of a given DNA fragment, to produce its signature. We show that an assembled DNA signature has the same distinguishing power as a conventionally computed CGR signature, while using shorter contiguous sequences and potentially less sequence information. Conclusions Our results suggest that, while CGR signatures of nDNA cannot always play the role of genomic signatures, composite and assembled DNA signatures (separately or in combination) could potentially be used instead. Such additive signatures could be used, e.g., with raw unassembled next-generation sequencing (NGS) read data, when high-quality sequencing data is not available, or to complement information obtained by other methods of species identification or classification. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1157-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Rallis Karamichalis
- Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada
| | - Lila Kari
- School of Computing Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada. .,Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada.
| | - Stavros Konstantinidis
- Department of Mathematics and Computing Science, Saint Mary's University, Halifax NS, Canada
| | - Steffen Kopecki
- Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada.,Department of Mathematics and Computing Science, Saint Mary's University, Halifax NS, Canada
| | - Stephen Solis-Reyes
- Department of Computer Science, University of Western Ontario, London ON, N6A 5B7, Canada
| |
Collapse
|
28
|
Dinov ID. Methodological challenges and analytic opportunities for modeling and interpreting Big Healthcare Data. Gigascience 2016; 5:12. [PMID: 26918190 PMCID: PMC4766610 DOI: 10.1186/s13742-016-0117-6] [Citation(s) in RCA: 59] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2015] [Accepted: 02/09/2016] [Indexed: 11/25/2022] Open
Abstract
Managing, processing and understanding big healthcare data is challenging, costly and demanding. Without a robust fundamental theory for representation, analysis and inference, a roadmap for uniform handling and analyzing of such complex data remains elusive. In this article, we outline various big data challenges, opportunities, modeling methods and software techniques for blending complex healthcare data, advanced analytic tools, and distributed scientific computing. Using imaging, genetic and healthcare data we provide examples of processing heterogeneous datasets using distributed cloud services, automated and semi-automated classification techniques, and open-science protocols. Despite substantial advances, new innovative technologies need to be developed that enhance, scale and optimize the management and processing of large, complex and heterogeneous data. Stakeholder investments in data acquisition, research and development, computational infrastructure and education will be critical to realize the huge potential of big data, to reap the expected information benefits and to build lasting knowledge assets. Multi-faceted proprietary, open-source, and community developments will be essential to enable broad, reliable, sustainable and efficient data-driven discovery and analytics. Big data will affect every sector of the economy and their hallmark will be 'team science'.
Collapse
Affiliation(s)
- Ivo D. Dinov
- Statistics Online Computational Resource (SOCR), Health Behavior and Biological Sciences, Michigan Institute for Data Science, University of Michigan, 426 N. Ingalls, Ann Arbor, MI 49109 USA
| |
Collapse
|
29
|
Karamichalis R, Kari L, Konstantinidis S, Kopecki S. An investigation into inter- and intragenomic variations of graphic genomic signatures. BMC Bioinformatics 2015; 16:246. [PMID: 26249837 PMCID: PMC4527362 DOI: 10.1186/s12859-015-0655-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2014] [Accepted: 06/30/2015] [Indexed: 11/30/2022] Open
Abstract
Background Motivated by the general need to identify and classify species based on molecular evidence, genome comparisons have been proposed that are based on measuring mostly Euclidean distances between Chaos Game Representation (CGR) patterns of genomic DNA sequences. Results We provide, on an extensive dataset and using several different distances, confirmation of the hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA sequences originating from genomes of different species. This finding lends support to the theory that CGRs of genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over five hundred different 150,000 bp genomic sequences spanning one complete chromosome from each of six organisms, representing all kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi; chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli (Bacteria - full genome), and P. furiosus (Archaea - full genome). To maximize the diversity within each species, we also analyze the interrelationships within a set of over five hundred 150,000 bp genomic sequences sampled from the entire aforementioned genomes. Lastly, we provide some preliminary evidence of this method’s ability to classify genomic DNA sequences at lower taxonomic levels by comparing sequences sampled from the entire genome of H. sapiens (class Mammalia, order Primates) and of M. musculus (class Mammalia, order Rodentia), for a total length of approximately 174 million basepairs analyzed. We compute pairwise distances between CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps, which visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display their interrelationships. Conclusion Our analysis confirms, for this dataset, that CGR patterns of DNA sequences from the same genome are in general quantitatively similar, while being different for DNA sequences from genomes of different species. Our assessment of the performance of the six distances analyzed uses three different quality measures and suggests that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies.
Collapse
Affiliation(s)
- Rallis Karamichalis
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Lila Kari
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Stavros Konstantinidis
- Department of Mathematics and Computing Science, Saint Mary's University, Halifax, NS, Canada.
| | - Steffen Kopecki
- Department of Computer Science, University of Western Ontario, London, ON, Canada. .,Department of Mathematics and Computing Science, Saint Mary's University, Halifax, NS, Canada.
| |
Collapse
|
30
|
Lennon FE, Cianci GC, Cipriani NA, Hensing TA, Zhang HJ, Chen CT, Murgu SD, Vokes EE, Vannier MW, Salgia R. Lung cancer-a fractal viewpoint. Nat Rev Clin Oncol 2015; 12:664-75. [PMID: 26169924 DOI: 10.1038/nrclinonc.2015.108] [Citation(s) in RCA: 109] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Fractals are mathematical constructs that show self-similarity over a range of scales and non-integer (fractal) dimensions. Owing to these properties, fractal geometry can be used to efficiently estimate the geometrical complexity, and the irregularity of shapes and patterns observed in lung tumour growth (over space or time), whereas the use of traditional Euclidean geometry in such calculations is more challenging. The application of fractal analysis in biomedical imaging and time series has shown considerable promise for measuring processes as varied as heart and respiratory rates, neuronal cell characterization, and vascular development. Despite the advantages of fractal mathematics and numerous studies demonstrating its applicability to lung cancer research, many researchers and clinicians remain unaware of its potential. Therefore, this Review aims to introduce the fundamental basis of fractals and to illustrate how analysis of fractal dimension (FD) and associated measurements, such as lacunarity (texture) can be performed. We describe the fractal nature of the lung and explain why this organ is particularly suited to fractal analysis. Studies that have used fractal analyses to quantify changes in nuclear and chromatin FD in primary and metastatic tumour cells, and clinical imaging studies that correlated changes in the FD of tumours on CT and/or PET images with tumour growth and treatment responses are reviewed. Moreover, the potential use of these techniques in the diagnosis and therapeutic management of lung cancer are discussed.
Collapse
Affiliation(s)
- Frances E Lennon
- Section of Hematology/Oncology, University of Chicago, 5841 South Maryland Avenue, MC 2115 Chicago, IL 60637, USA
| | - Gianguido C Cianci
- Department of Cell and Molecular Biology, Feinberg School of Medicine, Northwestern University, 303 East Chicago Avenue, Chicago, IL 60611, USA
| | - Nicole A Cipriani
- Department of Pathology, University of Chicago, 5841 South Maryland Avenue, MC 2115 Chicago, IL 60637, USA
| | - Thomas A Hensing
- NorthShore University Health System, 2650 Ridge Avenue, Evanston, IL 60201, USA
| | - Hannah J Zhang
- Department of Radiology, University of Chicago, 5841 South Maryland Avenue, MC 2115 Chicago, IL 60637, USA
| | - Chin-Tu Chen
- Department of Radiology, University of Chicago, 5841 South Maryland Avenue, MC 2115 Chicago, IL 60637, USA
| | - Septimiu D Murgu
- Department of Medicine, University of Chicago, 5841 South Maryland Avenue, MC 2115 Chicago, IL 60637, USA
| | - Everett E Vokes
- Section of Hematology/Oncology, University of Chicago, 5841 South Maryland Avenue, MC 2115 Chicago, IL 60637, USA
| | - Michael W Vannier
- Department of Radiology, University of Chicago, 5841 South Maryland Avenue, MC 2115 Chicago, IL 60637, USA
| | - Ravi Salgia
- Section of Hematology/Oncology, University of Chicago, 5841 South Maryland Avenue, MC 2115 Chicago, IL 60637, USA
| |
Collapse
|