1
|
Prusokiene A, Boonham N, Fox A, Howard TP. Mottle: Accurate pairwise substitution distance at high divergence through the exploitation of short-read mappers and gradient descent. PLoS One 2024; 19:e0298834. [PMID: 38512939 PMCID: PMC10956839 DOI: 10.1371/journal.pone.0298834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 01/30/2024] [Indexed: 03/23/2024] Open
Abstract
Current tools for estimating the substitution distance between two related sequences struggle to remain accurate at a high divergence. Difficulties at distant homologies, such as false seeding and over-alignment, create a high barrier for the development of a stable estimator. This is especially true for viral genomes, which carry a high rate of mutation, small size, and sparse taxonomy. Developing an accurate substitution distance measure would help to elucidate the relationship between highly divergent sequences, interrogate their evolutionary history, and better facilitate the discovery of new viral genomes. To tackle these problems, we propose an approach that uses short-read mappers to create whole-genome maps, and gradient descent to isolate the homologous fraction and calculate the final distance value. We implement this approach as Mottle. With the use of simulated and biological sequences, Mottle was able to remain stable to 0.66-0.96 substitutions per base pair and identify viral outgroup genomes with 95% accuracy at the family-order level. Our results indicate that Mottle performs as well as existing programs in identifying taxonomic relationships, with more accurate numerical estimation of genomic distance over greater divergences. By contrast, one limitation is a reduced numerical accuracy at low divergences, and on genomes where insertions and deletions are uncommon, when compared to alternative approaches. We propose that Mottle may therefore be of particular interest in the study of viruses, viral relationships, and notably for viral discovery platforms, helping in benchmarking of homology search tools and defining the limits of taxonomic classification methods. The code for Mottle is available at https://github.com/tphoward/Mottle_Repo.
Collapse
Affiliation(s)
- Alisa Prusokiene
- Faculty of Science, Agriculture and Engineering, School of Natural and Environmental Sciences, Newcastle University, United Kingdom
| | - Neil Boonham
- Faculty of Science, Agriculture and Engineering, School of Natural and Environmental Sciences, Newcastle University, United Kingdom
| | - Adrian Fox
- Faculty of Science, Agriculture and Engineering, School of Natural and Environmental Sciences, Newcastle University, United Kingdom
- Fera Ltd., Biotech Campus, York, United Kingdom
| | - Thomas P. Howard
- Faculty of Science, Agriculture and Engineering, School of Natural and Environmental Sciences, Newcastle University, United Kingdom
| |
Collapse
|
2
|
Wang T, Yu ZG, Li J. CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front Microbiol 2024; 15:1339156. [PMID: 38572227 PMCID: PMC10987876 DOI: 10.3389/fmicb.2024.1339156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/23/2024] [Indexed: 04/05/2024] Open
Abstract
Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.
Collapse
Affiliation(s)
- Ting Wang
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Zu-Guo Yu
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Jinyan Li
- School of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
3
|
Li B, Tian Y, Tian Y, Zhang S, Zhang X. Predicting Cancer Lymph-Node Metastasis From LncRNA Expression Profiles Using Local Linear Reconstruction Guided Distance Metric Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3179-3189. [PMID: 35139024 DOI: 10.1109/tcbb.2022.3149791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Lymph-node metastasis is the most perilous cancer progressive state, where long non-coding RNA (lncRNA) has been confirmed to be an important genetic indicator in cancer prediction. However, lncRNA expression profile is often characterized of large features and small samples, it is urgent to establish an efficient judgment to deal with such high dimensional lncRNA data, which will aid in clinical targeted treatment. Thus, in this study, a local linear reconstruction guided distance metric learning is put forward to handle lncRNA data for determination of cancer lymph-node metastasis. In the original locally linear embedding (LLE) approach, any point can be approximately linearly reconstructed using its nearest neighborhood points, from which a novel distance metric can be learned by satisfying both nonnegative and sum-to-one constraints on the reconstruction weights. Taking the defined distance metric and lncRNA data supervised information into account, a local margin model will be deduced to find a low dimensional subspace for lncRNA signature extraction. At last, a classifier is constructed to predict cancer lymph-node metastasis, where the learned distance metric is also adopted. Several experiments on lncRNA data sets have been carried out, and experimental results show the performance of the proposed method by making comparisons with some other related dimensionality reduction methods and the classical classifier models.
Collapse
|
4
|
Kille B, Balaji A, Sedlazeck FJ, Nute M, Treangen TJ. Multiple genome alignment in the telomere-to-telomere assembly era. Genome Biol 2022; 23:182. [PMID: 36038949 PMCID: PMC9421119 DOI: 10.1186/s13059-022-02735-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Accepted: 07/21/2022] [Indexed: 01/22/2023] Open
Abstract
With the arrival of telomere-to-telomere (T2T) assemblies of the human genome comes the computational challenge of efficiently and accurately constructing multiple genome alignments at an unprecedented scale. By identifying nucleotides across genomes which share a common ancestor, multiple genome alignments commonly serve as the bedrock for comparative genomics studies. In this review, we provide an overview of the algorithmic template that most multiple genome alignment methods follow. We also discuss prospective areas of improvement of multiple genome alignment for keeping up with continuously arriving high-quality T2T assembled genomes and for unlocking clinically-relevant insights.
Collapse
Affiliation(s)
- Bryce Kille
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Advait Balaji
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Michael Nute
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX, USA.
| |
Collapse
|
5
|
Ebel M, Migliorelli G, Stanke M. Global, highly specific and fast filtering of alignment seeds. BMC Bioinformatics 2022; 23:225. [PMID: 35689182 PMCID: PMC9188137 DOI: 10.1186/s12859-022-04745-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2020] [Accepted: 05/23/2022] [Indexed: 11/25/2022] Open
Abstract
Background An important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding. The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only. Results We introduce a new method for filtering alignment seeds that we call geometric hashing. Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed. Geometric hashing was tested on the task of finding homologous positions in the coding regions of human and mouse genome sequences. Thereby, the number of false positives was decreased about million-fold over sets of spaced seeds while maintaining a very high sensitivity. Conclusions An additional geometric hashing filtering phase could improve the run-time, accuracy or both of programs for various homology-search-and-align tasks. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04745-4.
Collapse
Affiliation(s)
- Matthis Ebel
- Institute for Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Str. 47, 17489, Greifswald, Germany.,Center for Functional Genomics of Microbes, University of Greifswald, Felix-Hausdorff-Str. 8, 17489, Greifswald, Germany
| | - Giovanna Migliorelli
- Institute for Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Str. 47, 17489, Greifswald, Germany.,Center for Functional Genomics of Microbes, University of Greifswald, Felix-Hausdorff-Str. 8, 17489, Greifswald, Germany
| | - Mario Stanke
- Institute for Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Str. 47, 17489, Greifswald, Germany. .,Center for Functional Genomics of Microbes, University of Greifswald, Felix-Hausdorff-Str. 8, 17489, Greifswald, Germany.
| |
Collapse
|
6
|
Sequence Comparison Without Alignment: The SpaM Approaches. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2021; 2231:121-134. [PMID: 33289890 DOI: 10.1007/978-1-0716-1036-7_8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Sequence alignment is at the heart of DNA and protein sequence analysis. For the data volumes that are nowadays produced by massively parallel sequencing technologies, however, pairwise and multiple alignment methods are often too slow. Therefore, fast alignment-free approaches to sequence comparison have become popular in recent years. Most of these approaches are based on word frequencies, for words of a fixed length, or on word-matching statistics. Other approaches are using the length of maximal word matches. While these methods are very fast, most of them rely on ad hoc measures of sequences similarity or dissimilarity that are hard to interpret. In this chapter, I describe a number of alignment-free methods that we developed in recent years. Our approaches are based on spaced-word matches ("SpaM"), i.e. on inexact word matches, that are allowed to contain mismatches at certain pre-defined positions. Unlike most previous alignment-free approaches, our approaches are able to accurately estimate phylogenetic distances between DNA or protein sequences using a stochastic model of molecular evolution.
Collapse
|
7
|
Garrett Vieira F, Samaniego Castruita JA, Gilbert MTP. Using in silico predicted ancestral genomes to improve the efficiency of paleogenome reconstruction. Ecol Evol 2020; 10:12700-12709. [PMID: 33304488 PMCID: PMC7713980 DOI: 10.1002/ece3.6925] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2020] [Revised: 09/23/2020] [Accepted: 09/28/2020] [Indexed: 01/20/2023] Open
Abstract
Paleogenomics is the nascent discipline concerned with sequencing and analysis of genome-scale information from historic, ancient, and even extinct samples. While once inconceivable due to the challenges of DNA damage, contamination, and the technical limitations of PCR-based Sanger sequencing, following the dawn of the second-generation sequencing revolution, it has rapidly become a reality. However, a significant challenge facing ancient DNA studies on extinct species is the lack of closely related reference genomes against which to map the sequencing reads from ancient samples. Although bioinformatic efforts to improve the assemblies have focused mainly in mapping algorithms, in this article we explore the potential of an alternative approach, namely using reconstructed ancestral genome as reference for mapping DNA sequences of ancient samples. Specifically, we present a preliminary proof of concept for a general framework and demonstrate how under certain evolutionary divergence thresholds, considerable mapping improvements can be easily obtained.
Collapse
Affiliation(s)
- Filipe Garrett Vieira
- Section for Evolutionary GenomicsThe GLOBE InstituteFaculty of Health and Medical SciencesUniversity of CopenhagenCopenhagenDenmark
| | - José Alfredo Samaniego Castruita
- Section for Evolutionary GenomicsThe GLOBE InstituteFaculty of Health and Medical SciencesUniversity of CopenhagenCopenhagenDenmark
| | - M. Thomas P. Gilbert
- Section for Evolutionary GenomicsThe GLOBE InstituteFaculty of Health and Medical SciencesUniversity of CopenhagenCopenhagenDenmark
- University MuseumNorwegian University of Science and TechnologyTrondheimNorway
| |
Collapse
|
8
|
Bohmann K, Mirarab S, Bafna V, Gilbert MTP. Beyond DNA barcoding: The unrealized potential of genome skim data in sample identification. Mol Ecol 2020; 29:2521-2534. [PMID: 32542933 PMCID: PMC7496323 DOI: 10.1111/mec.15507] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Revised: 06/03/2020] [Accepted: 06/05/2020] [Indexed: 02/06/2023]
Abstract
Genetic tools are increasingly used to identify and discriminate between species. One key transition in this process was the recognition of the potential of the ca 658bp fragment of the organelle cytochrome c oxidase I (COI) as a barcode region, which revolutionized animal bioidentification and lead, among others, to the instigation of the Barcode of Life Database (BOLD), containing currently barcodes from >7.9 million specimens. Following this discovery, suggestions for other organellar regions and markers, and the primers with which to amplify them, have been continuously proposed. Most recently, the field has taken the leap from PCR-based generation of DNA references into shotgun sequencing-based "genome skimming" alternatives, with the ultimate goal of assembling organellar reference genomes. Unfortunately, in genome skimming approaches, much of the nuclear genome (as much as 99% of the sequence data) is discarded, which is not only wasteful, but can also limit the power of discrimination at, or below, the species level. Here, we advocate that the full shotgun sequence data can be used to assign an identity (that we term for convenience its "DNA-mark") for both voucher and query samples, without requiring any computationally intensive pretreatment (e.g. assembly) of reads. We argue that if reference databases are populated with such "DNA-marks," it will enable future DNA-based taxonomic identification to complement, or even replace PCR of barcodes with genome skimming, and we discuss how such methodology ultimately could enable identification to population, or even individual, level.
Collapse
Affiliation(s)
- Kristine Bohmann
- Section for Evolutionary GenomicsThe GLOBE InstituteUniversity of CopenhagenCopenhagenDenmark
| | - Siavash Mirarab
- Department of Electrical and Computer EngineeringUniversity of CaliforniaSan DiegoCAUSA
| | - Vineet Bafna
- Department of Computer Science and EngineeringUniversity of CaliforniaSan DiegoCAUSA
| | - M. Thomas P. Gilbert
- Section for Evolutionary GenomicsThe GLOBE InstituteUniversity of CopenhagenCopenhagenDenmark
- Center for Evolutionary HologenomicsThe GLOBE InstituteUniversity of CopenhagenCopenhagenDenmark
- NTNU University MuseumTrondheimNorway
| |
Collapse
|