1
|
Chen G, Yin L, Zhang H. Isolation and characterization of goose astrovirus genotype 1 causing enteritis in goslings from Sichuan Province, China. BMC Vet Res 2025; 21:259. [PMID: 40205381 PMCID: PMC11983725 DOI: 10.1186/s12917-025-04482-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2024] [Accepted: 01/07/2025] [Indexed: 04/11/2025] Open
Abstract
Since 2017, goose astrovirus (GoAstV) has been widely prevalent in various provinces of China, causing economic losses in the goose industry, with outbreak mortality rates ranging from 10 to 60%. Notably, a goose farm in Sichuan Province has faced an outbreak of infectious disease in 1-3 weeks old goslings, with a mortality rate of approximately 30%. Viral metagenomic analysis of fecal samples identified Goose astrovirus genotype 1 (GoAstV-1), and PCR analysis confirmed the presence of GoAstV-1. Furthermore, we successfully isolated a GoAstV-C1 strain using goose embryos named AAstV/Goose/CHN/2023/C1 (GenBank No. PP108251), and its viral titer was calculated as 10^4.834 ELD50/0.5 mL using the Reed-Muench method. The genome size of GoAstV-C1 was about 7,261 nucleotides through amplifying with Sanger sequencing and assembling with SeqMan software. Phylogenetic analysis revealed that GoAstV-1 strains are classified into three major subtypes: A, B, and C, with the GoAstV-C1 strain identified as a unique variant within subtype B, characterized by distinct genetic divergence features. Experimental inoculation of one-day-old goslings with the virus resulted in a mortality rate of 5 out of 15 (p-value = 0.0421) and a significant reduction in weight gain compared to controls (p-value = 0.005). Pathological examination revealed that GoAstV-C1 infection caused severe damage to the liver, spleen, and kidneys. Interestingly, unlike most GoAstV, which leads to characteristic gout symptoms, our isolates GoAstV-C1 caused obvious intestinal damage characterized by necrosis, inflammatory infiltration, and crypt architectural disruption. We indicated that GoAstV-C1 displays a unique intestinal tropism rather than characteristic gout symptoms and elucidated genomic features and evolutionary relationships of GoAstV strains. These findings help advance our knowledge of the epidemiology and pathogenicity of GoAstV-1, and the predicted structure of capsid protein could serve as a potential target for designing novel antiviral drugs or vaccines against GoAstV-1.
Collapse
Affiliation(s)
- Guo Chen
- College of Animal and Veterinary Sciences, Southwest Minzu University, Chengdu, 610041, PR China
- Key Laboratory of Veterinary Medicine of Universities in Sichuan, Chengdu, 610041, PR China
| | - Lingdan Yin
- College of Animal and Veterinary Sciences, Southwest Minzu University, Chengdu, 610041, PR China
- Key Laboratory of Veterinary Medicine of Universities in Sichuan, Chengdu, 610041, PR China
| | - Huanrong Zhang
- College of Animal and Veterinary Sciences, Southwest Minzu University, Chengdu, 610041, PR China.
- Key Laboratory of Veterinary Medicine of Universities in Sichuan, Chengdu, 610041, PR China.
| |
Collapse
|
2
|
Roberts MD, Davis O, Josephs EB, Williamson RJ. K-mer-based Approaches to Bridging Pangenomics and Population Genetics. Mol Biol Evol 2025; 42:msaf047. [PMID: 40111256 PMCID: PMC11925024 DOI: 10.1093/molbev/msaf047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2024] [Revised: 01/10/2025] [Accepted: 02/04/2025] [Indexed: 03/12/2025] Open
Abstract
Many commonly studied species now have more than one chromosome-scale genome assembly, revealing a large amount of genetic diversity previously missed by approaches that map short reads to a single reference. However, many species still lack multiple reference genomes and correctly aligning references to build pangenomes can be challenging for many species, limiting our ability to study this missing genomic variation in population genetics. Here, we argue that k-mers are a very useful but underutilized tool for bridging the reference-focused paradigms of population genetics with the reference-free paradigms of pangenomics. We review current literature on the uses of k-mers for performing three core components of most population genetics analyses: identifying, measuring, and explaining patterns of genetic variation. We also demonstrate how different k-mer-based measures of genetic variation behave in population genetic simulations according to the choice of k, depth of sequencing coverage, and degree of data compression. Overall, we find that k-mer-based measures of genetic diversity scale consistently with pairwise nucleotide diversity (π) up to values of about π=0.025 (R2=0.97) for neutrally evolving populations. For populations with even more variation, using shorter k-mers will maintain the scalability up to at least π=0.1. Furthermore, in our simulated populations, k-mer dissimilarity values can be reliably approximated from counting bloom filters, highlighting a potential avenue to decreasing the memory burden of k-mer-based genomic dissimilarity analyses. For future studies, there is a great opportunity to further develop methods to identifying selected loci using k-mers.
Collapse
Affiliation(s)
- Miles D Roberts
- Genetics and Genome Sciences Program, Michigan State University, East Lansing, MI 48824, USA
| | - Olivia Davis
- Department of Computer Science and Software Engineering, Rose-Hulman Institute of Technology, Terre Haute, IN 47803, USA
| | - Emily B Josephs
- Department of Plant Biology, Michigan State University, East Lansing, MI 48824, USA
- Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, MI 48824, USA
- Plant Resilience Institute, Michigan State University, East Lansing, MI 48824, USA
| | - Robert J Williamson
- Department of Computer Science and Software Engineering, Rose-Hulman Institute of Technology, Terre Haute, IN 47803, USA
- Department of Biology and Biomedical Engineering, Rose-Hulman Institute of Technology, Terre Haute, IN 47803, USA
| |
Collapse
|
3
|
Boumajdi N, Bendani H, Belyamani L, Ibrahimi A. TreeWave: command line tool for alignment-free phylogeny reconstruction based on graphical representation of DNA sequences and genomic signal processing. BMC Bioinformatics 2024; 25:367. [PMID: 39604838 PMCID: PMC11600722 DOI: 10.1186/s12859-024-05992-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Accepted: 11/18/2024] [Indexed: 11/29/2024] Open
Abstract
BACKGROUND Genomic sequence similarity comparison is a crucial research area in bioinformatics. Multiple Sequence Alignment (MSA) is the basic technique used to identify regions of similarity between sequences, although MSA tools are widely used and highly accurate, they are often limited by computational complexity, and inaccuracies when handling highly divergent sequences, which leads to the development of alignment-free (AF) algorithms. RESULTS This paper presents TreeWave, a novel AF approach based on frequency chaos game representation and discrete wavelet transform of sequences for phylogeny inference. We validate our method on various genomic datasets such as complete virus genome sequences, bacteria genome sequences, human mitochondrial genome sequences, and rRNA gene sequences. Compared to classical methods, our tool demonstrates a significant reduction in running time, especially when analyzing large datasets. The resulting phylogenetic trees show that TreeWave has similar classification accuracy to the classical MSA methods based on the normalized Robinson-Foulds distances and Baker's Gamma coefficients. CONCLUSIONS TreeWave is an open source and user-friendly command line tool for phylogeny reconstruction. It is a faster and more scalable tool that prioritizes computational efficiency while maintaining accuracy. TreeWave is freely available at https://github.com/nasmaB/TreeWave .
Collapse
Affiliation(s)
- Nasma Boumajdi
- Laboratory of Biotechnology (MedBiotech), Rabat Medical & Pharmacy School, Bioinova Research Center, Mohammed V University in Rabat, Rabat, Morocco
| | - Houda Bendani
- Laboratory of Biotechnology (MedBiotech), Rabat Medical & Pharmacy School, Bioinova Research Center, Mohammed V University in Rabat, Rabat, Morocco
| | - Lahcen Belyamani
- Mohammed VI Center for Research and Innovation (CM6), Rabat, Morocco
- Mohammed VI University of Sciences and Health (UM6SS), Casablanca, Morocco
- Emergency Department, Military Hospital Mohammed V, Rabat Medical and Pharmacy School, Mohammed V University, Rabat, Morocco
| | - Azeddine Ibrahimi
- Laboratory of Biotechnology (MedBiotech), Rabat Medical & Pharmacy School, Bioinova Research Center, Mohammed V University in Rabat, Rabat, Morocco.
| |
Collapse
|
4
|
Tharanga S, Ünlü ES, Hu Y, Sjaugi MF, Çelik MA, Hekimoğlu H, Miotto O, Öncel MM, Khan AM. DiMA: sequence diversity dynamics analyser for viruses. Brief Bioinform 2024; 26:bbae607. [PMID: 39592151 PMCID: PMC11596295 DOI: 10.1093/bib/bbae607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Revised: 09/22/2024] [Accepted: 11/13/2024] [Indexed: 11/28/2024] Open
Abstract
Sequence diversity is one of the major challenges in the design of diagnostic, prophylactic, and therapeutic interventions against viruses. DiMA is a novel tool that is big data-ready and designed to facilitate the dissection of sequence diversity dynamics for viruses. DiMA stands out from other diversity analysis tools by offering various unique features. DiMA provides a quantitative overview of sequence (DNA/RNA/protein) diversity by use of Shannon's entropy corrected for size bias, applied via a user-defined k-mer sliding window to an input alignment file, and each k-mer position is dissected to various diversity motifs. The motifs are defined based on the probability of distinct sequences at a given k-mer alignment position, whereby an index is the predominant sequence, while all the others are (total) variants to the index. The total variants are sub-classified into the major (most common) variant, minor variants (occurring more than once and of incidence lower than the major), and the unique (singleton) variants. DiMA allows user-defined, sequence metadata enrichment for analyses of the motifs. The application of DiMA was demonstrated for the alignment data of the relatively conserved Spike protein (2,106,985 sequences) of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and the relatively highly diverse pol gene (2637) of the human immunodeficiency virus-1 (HIV-1). The tool is publicly available as a web server (https://dima.bezmialem.edu.tr), as a Python library (via PyPi) and as a command line client (via GitHub).
Collapse
Affiliation(s)
- Shan Tharanga
- Centre for Bioinformatics, School of Data Sciences, Perdana University, MAEPS Building, Jalan MAEPS Perdana, Serdang, Kuala Lumpur 50490, Malaysia
| | - Eyyüb Selim Ünlü
- Istanbul Faculty of Medicine, Istanbul University, Turgut Özal Millet St, Topkapi, Istanbul 34093, Türkiye
- Genome Surveillance Unit, Wellcome Sanger Institute, Mill Ln, Hinxton, Saffron Walden CB10 1SA, United Kingdom
| | - Yongli Hu
- Centre for Bioinformatics, School of Data Sciences, Perdana University, MAEPS Building, Jalan MAEPS Perdana, Serdang, Kuala Lumpur 50490, Malaysia
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, 8 Medical Drive, Singapore 117597, Singapore
| | - Muhammad Farhan Sjaugi
- Centre for Bioinformatics, School of Data Sciences, Perdana University, MAEPS Building, Jalan MAEPS Perdana, Serdang, Kuala Lumpur 50490, Malaysia
| | - Muhammet A Çelik
- Celik Sarayı, Yeni Elektrik Santral St. No:29/2, Meram, Konya 42090, Türkiye
| | - Hilal Hekimoğlu
- Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Ali Ihsan Kalmaz St., No.10 Beykoz, Istanbul 34820, Türkiye
| | - Olivo Miotto
- Nuffield Department of Clinical Medicine, University of Oxford, Old Road, Old Road Campus, Oxford OX3 7LF, United Kingdom
- Mahidol-Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine, Mahidol University, 420/6 Ratchawithi Rd., Ratchathewi District, Bangkok 10400, Thailand
| | - Muhammed Miran Öncel
- Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Ali Ihsan Kalmaz St., No.10 Beykoz, Istanbul 34820, Türkiye
| | - Asif M Khan
- Centre for Bioinformatics, School of Data Sciences, Perdana University, MAEPS Building, Jalan MAEPS Perdana, Serdang, Kuala Lumpur 50490, Malaysia
- Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Ali Ihsan Kalmaz St., No.10 Beykoz, Istanbul 34820, Türkiye
- College of Computing and Information Technology, University of Doha for Science and Technology, Jelaiah Street, Duhail North, Doha, Qatar
| |
Collapse
|
5
|
Islam R, Rahman A. An alignment-free method for detection of missing regions for phylogenetic analysis. Heliyon 2024; 10:e32227. [PMID: 38933968 PMCID: PMC11200290 DOI: 10.1016/j.heliyon.2024.e32227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Revised: 05/17/2024] [Accepted: 05/29/2024] [Indexed: 06/28/2024] Open
Abstract
Phylogenetic tree estimation using conventional approaches usually requires pairwise or multiple sequence alignment. However, sequence alignment has difficulties related to scalability and accuracy in case of long sequences such as whole genomes, low sequence identity, and in presence of genomic rearrangements. To address these issues, alignment-free approaches have been proposed. While these methods have demonstrated promising results, many of these lead to errors when regions are missing from the sequences of one or more species that are trivially detected in alignment-based methods. Here, we present an alignment-free method for detecting missing regions in sequences of species for which phylogeny is to be estimated. It is based on counts of k-mers and can be used to filter out k-mers belonging to regions in one species that are missing in one or more of the other species. We perform experiments with real and simulated datasets containing missing regions and find that it can successfully detect a large fraction of such k-mers and can lead to improvements in the estimated phylogenies. Our method can be used in k-mer based alignment-free phylogeny estimation methods to filter out k-mers corresponding to missing regions.
Collapse
Affiliation(s)
- Rubyeat Islam
- Department of Computer Science and Engineering, Military Institute of Science and Technology, Dhaka, Bangladesh
| | - Atif Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| |
Collapse
|
6
|
Achouri E, Felt SA, Hackbart M, Rivera-Espinal NS, López CB. VODKA2: a fast and accurate method to detect non-standard viral genomes from large RNA-seq data sets. RNA (NEW YORK, N.Y.) 2023; 30:16-25. [PMID: 37891004 PMCID: PMC10726161 DOI: 10.1261/rna.079747.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 10/12/2023] [Indexed: 10/29/2023]
Abstract
During viral replication, viruses carrying an RNA genome produce non-standard viral genomes (nsVGs), including copy-back viral genomes (cbVGs) and deletion viral genomes (delVGs), that play a crucial role in regulating viral replication and pathogenesis. Because of their critical roles in determining the outcome of RNA virus infections, the study of nsVGs has flourished in recent years, exposing a need for bioinformatic tools that can accurately identify them within next-generation sequencing data obtained from infected samples. Here, we present our data analysis pipeline, Viral Opensource DVG Key Algorithm 2 (VODKA2), that is optimized to run on a parallel computing environment for fast and accurate detection of nsVGs from large data sets.
Collapse
Affiliation(s)
- Emna Achouri
- Department of Molecular Microbiology and Center for Women's Infectious Disease Research, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Sébastien A Felt
- Department of Molecular Microbiology and Center for Women's Infectious Disease Research, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Matthew Hackbart
- Department of Molecular Microbiology and Center for Women's Infectious Disease Research, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Nicole S Rivera-Espinal
- Department of Molecular Microbiology and Center for Women's Infectious Disease Research, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| | - Carolina B López
- Department of Molecular Microbiology and Center for Women's Infectious Disease Research, Washington University School of Medicine, St. Louis, Missouri 63110, USA
| |
Collapse
|
7
|
Perico CP, De Pierri CR, Neto GP, Fernandes DR, Pedrosa FO, de Souza EM, Raittz RT. Genomic landscape of the SARS-CoV-2 pandemic in Brazil suggests an external P.1 variant origin. Front Microbiol 2022; 13:1037455. [PMID: 36620039 PMCID: PMC9814972 DOI: 10.3389/fmicb.2022.1037455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 12/01/2022] [Indexed: 12/24/2022] Open
Abstract
Brazil was the epicenter of worldwide pandemics at the peak of its second wave. The genomic/proteomic perspective of the COVID-19 pandemic in Brazil could provide insights to understand the global pandemics behavior. In this study, we track SARS-CoV-2 molecular information in Brazil using real-time bioinformatics and data science strategies to provide a comparative and evolutive panorama of the lineages in the country. SWeeP vectors represented the Brazilian and worldwide genomic/proteomic data from Global Initiative on Sharing Avian Influenza Data (GISAID) between February 2020 and August 2021. Clusters were analyzed and compared with PANGO lineages. Hierarchical clustering provided phylogenetic and evolutionary analyses of the lineages, and we tracked the P.1 (Gamma) variant origin. The genomic diversity based on Chao's estimation allowed us to compare richness and coverage among Brazilian states and other representative countries. We found that epidemics in Brazil occurred in two moments with different genetic profiles. The P.1 lineages emerged in the second wave, which was more aggressive. We could not trace the origin of P.1 from the variants present in Brazil. Instead, we found evidence pointing to its external source and a possible recombinant event that may relate P.1 to a B.1.1.28 variant subset. We discussed the potential application of the pipeline for emerging variants detection and the PANGO terminology stability over time. The diversity analysis showed that the low coverage and unbalanced sequencing among states in Brazil could have allowed the silent entry and dissemination of P.1 and other dangerous variants. This study may help to understand the development and consequences of variants of concern (VOC) entry.
Collapse
Affiliation(s)
- Camila P Perico
- Laboratory of Artificial Intelligence Applied to Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
- Graduate Program in Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
| | - Camilla R De Pierri
- Laboratory of Artificial Intelligence Applied to Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
- Department of Biochemistry and Molecular Biology, Federal University of Paraná, Curitiba, Brazil
| | - Giuseppe Pasqualato Neto
- Laboratory of Artificial Intelligence Applied to Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
| | - Danrley R Fernandes
- Laboratory of Artificial Intelligence Applied to Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
- Graduate Program in Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
| | - Fabio O Pedrosa
- Graduate Program in Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
- Department of Biochemistry and Molecular Biology, Federal University of Paraná, Curitiba, Brazil
| | - Emanuel M de Souza
- Graduate Program in Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
- Department of Biochemistry and Molecular Biology, Federal University of Paraná, Curitiba, Brazil
| | - Roberto T Raittz
- Laboratory of Artificial Intelligence Applied to Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
- Graduate Program in Bioinformatics, Professional and Technological Education Sector (SEPT), Federal University of Paraná, Curitiba, Brazil
| |
Collapse
|
8
|
Coombes CE, Coombes KR, Fareed N. Sequences of Events from the Electronic Medical Record and the Onset of Infection. Chem Biodivers 2022; 19:e202200657. [PMID: 36216587 DOI: 10.1002/cbdv.202200657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Accepted: 09/15/2022] [Indexed: 11/06/2022]
Abstract
We present a novel model of time-series analysis to learn from electronic health record (EHR) data when infection occurred in the intensive care unit (ICU) by translating methods from proteomics and Bayesian statistics. Using 48,536 patients hospitalized in an ICU, we describe each hospital course as an 'alphabet' of 23 physician actions ('events') in temporal order. We analyze these as k-mers of length 3-12 events and apply a Bayesian model of (cumulative) relative risk (RR). The log2-transformed RR (median=0.248, mean=0.226) supported the conclusion that the events selected were individually associated with increased risk of infection. Selecting from all possible cutoffs of maximum gain (MG), MG>0.0244 predicts administration of antibiotics with PPV 82.0 %, NPV 44.4 %, and AUC 0.706. Our approach holds value for retrospective analysis of other clinical syndromes for which time-of-onset is critical to analysis but poorly marked in EHRs, including delirium and decompensation.
Collapse
Affiliation(s)
- Caitlin E Coombes
- Department of Anesthesiology, Stanford University, 300 Pasteur Dr., Palo Alto, CA 94305, USA
| | - Kevin R Coombes
- Department of Population Health Sciences, Medical College of Georgia, 1420 Laney Walker Blvd, Augusta, GA 30912, USA
| | - Naleef Fareed
- Department of Biomedical Informatics, The Ohio State University College of Medicine, 370 W 9th Ave, Columbus, OH 43210, USA
| |
Collapse
|
9
|
Silva JM, Pratas D, Caetano T, Matos S. The complexity landscape of viral genomes. Gigascience 2022; 11:giac079. [PMID: 35950839 PMCID: PMC9366995 DOI: 10.1093/gigascience/giac079] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2022] [Revised: 05/25/2022] [Accepted: 07/26/2022] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Viruses are among the shortest yet highly abundant species that harbor minimal instructions to infect cells, adapt, multiply, and exist. However, with the current substantial availability of viral genome sequences, the scientific repertory lacks a complexity landscape that automatically enlights viral genomes' organization, relation, and fundamental characteristics. RESULTS This work provides a comprehensive landscape of the viral genome's complexity (or quantity of information), identifying the most redundant and complex groups regarding their genome sequence while providing their distribution and characteristics at a large and local scale. Moreover, we identify and quantify inverted repeats abundance in viral genomes. For this purpose, we measure the sequence complexity of each available viral genome using data compression, demonstrating that adequate data compressors can efficiently quantify the complexity of viral genome sequences, including subsequences better represented by algorithmic sources (e.g., inverted repeats). Using a state-of-the-art genomic compressor on an extensive viral genomes database, we show that double-stranded DNA viruses are, on average, the most redundant viruses while single-stranded DNA viruses are the least. Contrarily, double-stranded RNA viruses show a lower redundancy relative to single-stranded RNA. Furthermore, we extend the ability of data compressors to quantify local complexity (or information content) in viral genomes using complexity profiles, unprecedently providing a direct complexity analysis of human herpesviruses. We also conceive a features-based classification methodology that can accurately distinguish viral genomes at different taxonomic levels without direct comparisons between sequences. This methodology combines data compression with simple measures such as GC-content percentage and sequence length, followed by machine learning classifiers. CONCLUSIONS This article presents methodologies and findings that are highly relevant for understanding the patterns of similarity and singularity between viral groups, opening new frontiers for studying viral genomes' organization while depicting the complexity trends and classification components of these genomes at different taxonomic levels. The whole study is supported by an extensive website (https://asilab.github.io/canvas/) for comprehending the viral genome characterization using dynamic and interactive approaches.
Collapse
Affiliation(s)
- Jorge Miguel Silva
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - Diogo Pratas
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193 Aveiro, Portugal
- Department of Virology, University of Helsinki, Haartmaninkatu 3, 00014 Helsinki, Finland
| | - Tânia Caetano
- Department of Biology, University of Aveiro, Campus Universitario de Santiago, 3810-193 Aveiro, Portugal
| | - Sérgio Matos
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitario de Santiago, 3810-193 Aveiro, Portugal
| |
Collapse
|
10
|
Liu S, Koslicki D. CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices. Bioinformatics 2022; 38:i28-i35. [PMID: 35758788 PMCID: PMC9235470 DOI: 10.1093/bioinformatics/btac237] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Motivation K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. Results We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. Availability and implementation A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shaopeng Liu
- Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA 16801, USA
| | - David Koslicki
- Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA 16801, USA.,Department of Computer Science and Engineering, Pennsylvania State University, State College, PA 16801, USA.,Department of Biology, Pennsylvania State University, State College, PA 16801, USA
| |
Collapse
|
11
|
Moore MP, Wilcox MH, Walker AS, Eyre DW. K-mer based prediction of Clostridioides difficile relatedness and ribotypes. Microb Genom 2022; 8. [PMID: 35384833 PMCID: PMC9453075 DOI: 10.1099/mgen.0.000804] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Comparative analysis of Clostridioides difficile whole-genome sequencing (WGS) data enables fine scaled investigation of transmission and is increasingly becoming part of routine surveillance. However, these analyses are constrained by the computational requirements of the large volumes of data involved. By decomposing WGS reads or assemblies into k-mers and using the dimensionality reduction technique MinHash, it is possible to rapidly approximate genomic distances without alignment. Here we assessed the performance of MinHash, as implemented by sourmash, in predicting single nucleotide differences between genomes (SNPs) and C. difficile ribotypes (RTs). For a set of 1905 diverse C. difficile genomes (differing by 0–168 519 SNPs), using sourmash to screen for closely related genomes, at a sensitivity of 100 % for pairs ≤10 SNPs, sourmash reduced the number of pairs from 1 813 560 overall to 161 934, i.e. by 91 %, with a positive predictive value of 32 % to correctly identify pairs ≤10 SNPs (maximum SNP distance 4144). At a sensitivity of 95 %, pairs were reduced by 94 % to 108 266 and PPV increased to 45 % (maximum SNP distance 1009). Increasing the MinHash sketch size above 2000 produced minimal performance improvement. We also explored a MinHash similarity-based ribotype prediction method. Genomes with known ribotypes (n=3937) were split into a training set (2937) and test set (1000) randomly. The training set was used to construct a sourmash index against which genomes from the test set were compared. If the closest five genomes in the index had the same ribotype this was taken to predict the searched genome’s ribotype. Using our MinHash ribotype index, predicted ribotypes were correct in 780/1000 (78 %) genomes, incorrect in 20 (2 %), and indeterminant in 200 (20 %). Relaxing the classifier to 4/5 closest matches with the same RT improved the correct predictions to 87 %. Using MinHash it is possible to subsample C. difficile genome k-mer hashes and use them to approximate small genomic differences within minutes, significantly reducing the search space for further analysis.
Collapse
Affiliation(s)
- Matthew Phillip Moore
- Big Data Institute, Nuffield Department of Population Health, University of Oxford, Oxford, UK.,Nuffield Department of Medicine, University of Oxford, Oxford, UK.,NIHR Oxford Biomedical Research Centre, University of Oxford, Oxford, UK
| | - Mark H Wilcox
- Healthcare Associated Infection Research Group, Leeds Teaching Hospitals NHS Trust and University of Leeds, Leeds, UK
| | - A Sarah Walker
- Nuffield Department of Medicine, University of Oxford, Oxford, UK.,NIHR Oxford Biomedical Research Centre, University of Oxford, Oxford, UK.,NIHR Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, UK
| | - David W Eyre
- Big Data Institute, Nuffield Department of Population Health, University of Oxford, Oxford, UK.,NIHR Oxford Biomedical Research Centre, University of Oxford, Oxford, UK.,NIHR Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, UK
| |
Collapse
|
12
|
Dougan KE, González-Pech RA, Stephens TG, Shah S, Chen Y, Ragan MA, Bhattacharya D, Chan CX. Genome-powered classification of microbial eukaryotes: focus on coral algal symbionts. Trends Microbiol 2022; 30:831-840. [DOI: 10.1016/j.tim.2022.02.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Revised: 01/20/2022] [Accepted: 02/01/2022] [Indexed: 12/20/2022]
|
13
|
Aguilar-Valdez S, Morales JA, Paredes O. Unraveling the hCoV-19 Informational Architecture. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2021; 2021:2392-2395. [PMID: 34891763 DOI: 10.1109/embc46164.2021.9630954] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The hCoV-19 virus is continuously evolving to highly infectious and lethal variants. There is a latent risk that current vaccines will not be effective over these novel variants. This entails comprehending the genome-wide viral information to unveil mutagenic mechanisms of hCoV-19. To date, this virus is studied as a collection of non-related variants, making it challenging to forecast hotspots and their upcoming effects. In this work, we explore genome-wide information to disentangle informational mechanisms that lead to insights into viral mutagenicity. Towards this aim, we modeled informational compartments based on a topic-free-alignment workflow. These compartments illustrate that hCoV-19 has a complex informational architecture that addresses high-level virus phenomena, i.e., mutagenicity. This new framework represents the first step towards identifying the virus mutagenicity leading to the development of all-variants-effective vaccines.
Collapse
|
14
|
Chong LC, Lim WL, Ban KHK, Khan AM. An Alignment-Independent Approach for the Study of Viral Sequence Diversity at Any Given Rank of Taxonomy Lineage. BIOLOGY 2021; 10:biology10090853. [PMID: 34571730 PMCID: PMC8466476 DOI: 10.3390/biology10090853] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 08/13/2021] [Accepted: 08/19/2021] [Indexed: 11/16/2022]
Abstract
The study of viral diversity is imperative in understanding sequence change and its implications for intervention strategies. The widely used alignment-dependent approaches to study viral diversity are limited in their utility as sequence dissimilarity increases, particularly when expanded to the genus or higher ranks of viral species lineage. Herein, we present an alignment-independent algorithm, implemented as a tool, UNIQmin, to determine the effective viral sequence diversity at any rank of the viral taxonomy lineage. This is done by performing an exhaustive search to generate the minimal set of sequences for a given viral non-redundant sequence dataset. The minimal set is comprised of the smallest possible number of unique sequences required to capture the diversity inherent in the complete set of overlapping k-mers encoded by all the unique sequences in the given dataset. Such dataset compression is possible through the removal of unique sequences, whose entire repertoire of overlapping k-mers can be represented by other sequences, thus rendering them redundant to the collective pool of sequence diversity. A significant reduction, namely ~44%, ~45%, and ~53%, was observed for all reported unique sequences of species Dengue virus, genus Flavivirus, and family Flaviviridae, respectively, while still capturing the entire repertoire of nonamer (9-mer) viral peptidome diversity present in the initial input dataset. The algorithm is scalable for big data as it was applied to ~2.2 million non-redundant sequences of all reported viruses. UNIQmin is open source and publicly available on GitHub. The concept of a minimal set is generic and, thus, potentially applicable to other pathogenic microorganisms of non-viral origin, such as bacteria.
Collapse
Affiliation(s)
- Li Chuin Chong
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Kuala Lumpur 50490, Malaysia;
| | - Wei Lun Lim
- Faculty of Computing and Informatics, Multimedia University, Cyberjaya 63100, Malaysia;
| | - Kenneth Hon Kim Ban
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore 117596, Singapore;
| | - Asif M. Khan
- Centre for Bioinformatics, School of Data Sciences, Perdana University, Kuala Lumpur 50490, Malaysia;
- Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz, 34820 Istanbul, Turkey
- Correspondence: or
| |
Collapse
|
15
|
Bakhtiari M, Park J, Ding YC, Shleizer-Burko S, Neuhausen SL, Halldórsson BV, Stefánsson K, Gymrek M, Bafna V. Variable number tandem repeats mediate the expression of proximal genes. Nat Commun 2021; 12:2075. [PMID: 33824302 PMCID: PMC8024321 DOI: 10.1038/s41467-021-22206-z] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Accepted: 02/17/2021] [Indexed: 12/12/2022] Open
Abstract
Variable number tandem repeats (VNTRs) account for significant genetic variation in many organisms. In humans, VNTRs have been implicated in both Mendelian and complex disorders, but are largely ignored by genomic pipelines due to the complexity of genotyping and the computational expense. We describe adVNTR-NN, a method that uses shallow neural networks to genotype a VNTR in 18 seconds on 55X whole genome data, while maintaining high accuracy. We use adVNTR-NN to genotype 10,264 VNTRs in 652 GTEx individuals. Associating VNTR length with gene expression in 46 tissues, we identify 163 "eVNTRs". Of the 22 eVNTRs in blood where independent data is available, 21 (95%) are replicated in terms of significance and direction of association. 49% of the eVNTR loci show a strong and likely causal impact on the expression of genes and 80% have maximum effect size at least 0.3. The impacted genes are involved in diseases including Alzheimer's, obesity and familial cancers, highlighting the importance of VNTRs for understanding the genetic basis of complex diseases.
Collapse
Affiliation(s)
- Mehrdad Bakhtiari
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Jonghun Park
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Yuan-Chun Ding
- Department of Population Sciences, Beckman Research Institute of City of Hope, Duarte, CA, USA
| | | | - Susan L Neuhausen
- Department of Population Sciences, Beckman Research Institute of City of Hope, Duarte, CA, USA
| | | | | | - Melissa Gymrek
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, USA
- Department of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Vineet Bafna
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, USA.
| |
Collapse
|
16
|
Acera Mateos P, Balboa RF, Easteal S, Eyras E, Patel HR. PACIFIC: a lightweight deep-learning classifier of SARS-CoV-2 and co-infecting RNA viruses. Sci Rep 2021; 11:3209. [PMID: 33547380 PMCID: PMC7864945 DOI: 10.1038/s41598-021-82043-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 01/12/2021] [Indexed: 01/30/2023] Open
Abstract
Viral co-infections occur in COVID-19 patients, potentially impacting disease progression and severity. However, there is currently no dedicated method to identify viral co-infections in patient RNA-seq data. We developed PACIFIC, a deep-learning algorithm that accurately detects SARS-CoV-2 and other common RNA respiratory viruses from RNA-seq data. Using in silico data, PACIFIC recovers the presence and relative concentrations of viruses with > 99% precision and recall. PACIFIC accurately detects SARS-CoV-2 and other viral infections in 63 independent in vitro cell culture and patient datasets. PACIFIC is an end-to-end tool that enables the systematic monitoring of viral infections in the current global pandemic.
Collapse
Affiliation(s)
- Pablo Acera Mateos
- John Curtin School of Medical Research, Australian National University, Canberra, ACT 2600 Australia
- EMBL Australia Partner Laboratory Network at the Australian National University, Canberra, ACT 2600 Australia
| | - Renzo F. Balboa
- John Curtin School of Medical Research, Australian National University, Canberra, ACT 2600 Australia
- National Centre for Indigenous Genomics, Australian National University, Canberra, ACT 2600 Australia
| | - Simon Easteal
- John Curtin School of Medical Research, Australian National University, Canberra, ACT 2600 Australia
- National Centre for Indigenous Genomics, Australian National University, Canberra, ACT 2600 Australia
| | - Eduardo Eyras
- John Curtin School of Medical Research, Australian National University, Canberra, ACT 2600 Australia
- EMBL Australia Partner Laboratory Network at the Australian National University, Canberra, ACT 2600 Australia
- IMIM - Hospital del Mar Medical Research Institute, 08003 Barcelona, Spain
- Catalan Institution for Research and Advanced Studies, 08010 Barcelona, Spain
| | - Hardip R. Patel
- John Curtin School of Medical Research, Australian National University, Canberra, ACT 2600 Australia
- National Centre for Indigenous Genomics, Australian National University, Canberra, ACT 2600 Australia
| |
Collapse
|
17
|
Yang Z, Li H, Jia Y, Zheng Y, Meng H, Bao T, Li X, Luo L. Intrinsic laws of k-mer spectra of genome sequences and evolution mechanism of genomes. BMC Evol Biol 2020; 20:157. [PMID: 33228538 PMCID: PMC7684957 DOI: 10.1186/s12862-020-01723-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Accepted: 11/10/2020] [Indexed: 11/17/2022] Open
Abstract
Background K-mer spectra of DNA sequences contain important information about sequence composition and sequence evolution. We want to reveal the evolution rules of genome sequences by studying the k-mer spectra of genome sequences. Results The intrinsic laws of k-mer spectra of 920 genome sequences from primate to prokaryote were analyzed. We found that there are two types of evolution selection modes in genome sequences, named as CG Independent Selection and TA Independent Selection. There is a mutual inhibition relationship between CG and TA independent selections. We found that the intensity of CG and TA independent selections correlates closely with genome evolution and G + C content of genome sequences. The living habits of species are related closely to the independent selection modes adopted by species genomes. Consequently, we proposed an evolution mechanism of genomes in which the genome evolution is determined by the intensities of the CG and TA independent selections and the mutual inhibition relationship. Besides, by the evolution mechanism of genomes, we speculated the evolution modes of prokaryotes in mild and extreme environments in the anaerobic age and the evolving process of prokaryotes from anaerobic to aerobic environment on earth as well as the originations of different eukaryotes. Conclusion We found that there are two independent selection modes in genome sequences. The evolution of genome sequence is determined by the two independent selection modes and the mutual inhibition relationship between them.
Collapse
Affiliation(s)
- Zhenhua Yang
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China.,School of Economics and Management, Inner Mongolia University of Science & Technology, Baotou, 014010, China
| | - Hong Li
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China.
| | - Yun Jia
- College of Science, Inner Mongolia University of Technology, Hohhot, 010051, China
| | - Yan Zheng
- Baotou Medical College, Inner Mongolia University of Science & Technology, Baotou, 014040, China
| | - Hu Meng
- School of Life Science & Technology, Inner Mongolia University of Science & Technology, Baotou, 014010, China
| | - Tonglaga Bao
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Xiaolong Li
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| | - Liaofu Luo
- Laboratory of Theoretical Biophysics, School of Physical Science & Technology, Inner Mongolia University, Hohhot, 010021, China
| |
Collapse
|
18
|
Pornputtapong N, Acheampong DA, Patumcharoenpol P, Jenjaroenpun P, Wongsurawat T, Jun SR, Yongkiettrakul S, Chokesajjawatee N, Nookaew I. KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis. Front Bioeng Biotechnol 2020; 8:556413. [PMID: 33072720 PMCID: PMC7538862 DOI: 10.3389/fbioe.2020.556413] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Accepted: 08/24/2020] [Indexed: 12/22/2022] Open
Abstract
Genomic DNA is the best “unique identifier” for organisms. Alignment-free phylogenomic analysis, simple, fast, and efficient method to compare genome sequences, relies on looking at the distribution of small DNA sequence of a particular length, referred to as k-mer. The k-mer approach has been explored as a basis for sequence analysis applications, including assembly, phylogenetic tree inference, and classification. Although this approach is not novel, selecting the appropriate k-mer length to obtain the optimal resolution is rather arbitrary. However, it is a very important parameter for achieving the appropriate resolution for genome/sequence distances to infer biologically meaningful phylogenetic relationships. Thus, there is a need for a systematic approach to identify the appropriate k-mer from whole-genome sequences. We present K-mer–length Iterative Selection for UNbiased Ecophylogenomics (KITSUNE), a tool for assessing the empirically optimal k-mer length of any given set of genomes of interest for phylogenomic analysis via a three-step approach based on (1) cumulative relative entropy (CRE), (2) average number of common features (ACF), and (3) observed common features (OCF). Using KITSUNE, we demonstrated the feasibility and reliability of these measurements to obtain empirically optimal k-mer lengths of 11, 17, and ∼34 from large genome datasets of viruses, bacteria, and fungi, respectively. Moreover, we demonstrated a feature of KITSUNE for accurate species identification for the two de novo assembled bacterial genomes derived from error-prone long-reads sequences, and for a published yeast genome. In addition, KITSUNE was used to identify the shortest species-specific k-mer accurately identifying viruses. KITSUNE is freely available at https://github.com/natapol/kitsune.
Collapse
Affiliation(s)
- Natapol Pornputtapong
- Department of Biochemistry and Microbiology, Faculty of Pharmaceutical Sciences, and Research Unit of DNA Barcoding of Thai Medicinal Plants, Chulalongkorn University, Bangkok, Thailand
| | - Daniel A Acheampong
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States.,Joint Graduate Program in Bioinformatics, University of Arkansas at Little Rock and University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Preecha Patumcharoenpol
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Piroon Jenjaroenpun
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Thidathip Wongsurawat
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Se-Ran Jun
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Suganya Yongkiettrakul
- National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency, Pathum Thani, Thailand
| | - Nipa Chokesajjawatee
- National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency, Pathum Thani, Thailand
| | - Intawat Nookaew
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| |
Collapse
|
19
|
Wang W, Ren J, Tang K, Dart E, Ignacio-Espinoza JC, Fuhrman JA, Braun J, Sun F, Ahlgren NA. A network-based integrated framework for predicting virus-prokaryote interactions. NAR Genom Bioinform 2020; 2:lqaa044. [PMID: 32626849 PMCID: PMC7324143 DOI: 10.1093/nargab/lqaa044] [Citation(s) in RCA: 63] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Revised: 03/12/2020] [Accepted: 06/05/2020] [Indexed: 12/12/2022] Open
Abstract
Metagenomic sequencing has greatly enhanced the discovery of viral genomic sequences; however, it remains challenging to identify the host(s) of these new viruses. We developed VirHostMatcher-Net, a flexible, network-based, Markov random field framework for predicting virus–prokaryote interactions using multiple, integrated features: CRISPR sequences and alignment-free similarity measures (\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
}{}$s_2^*$\end{document} and WIsH). Evaluation of this method on a benchmark set of 1462 known virus–prokaryote pairs yielded host prediction accuracy of 59% and 86% at the genus and phylum levels, representing 16–27% and 6–10% improvement, respectively, over previous single-feature prediction approaches. We applied our host prediction tool to crAssphage, a human gut phage, and two metagenomic virus datasets: marine viruses and viral contigs recovered from globally distributed, diverse habitats. Host predictions were frequently consistent with those of previous studies, but more importantly, this new tool made many more confident predictions than previous tools, up to nearly 3-fold more (n > 27 000), greatly expanding the diversity of known virus–host interactions.
Collapse
Affiliation(s)
- Weili Wang
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA
| | - Jie Ren
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA
| | - Kujin Tang
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA
| | - Emily Dart
- Biology Department, Clark University, Worcester, MA 01610, USA
| | | | - Jed A Fuhrman
- Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Jonathan Braun
- Inflammatory Bowel and Immunobiology Research Institute, Cedars-Sinai Medical Center, Los Angeles, CA 90048, USA
| | - Fengzhu Sun
- Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA
| | | |
Collapse
|
20
|
Young F, Rogers S, Robertson DL. Predicting host taxonomic information from viral genomes: A comparison of feature representations. PLoS Comput Biol 2020; 16:e1007894. [PMID: 32453718 PMCID: PMC7307784 DOI: 10.1371/journal.pcbi.1007894] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2019] [Revised: 06/22/2020] [Accepted: 04/21/2020] [Indexed: 12/13/2022] Open
Abstract
The rise in metagenomics has led to an exponential growth in virus discovery. However, the majority of these new virus sequences have no assigned host. Current machine learning approaches to predicting virus host interactions have a tendency to focus on nucleotide features, ignoring other representations of genomic information. Here we investigate the predictive potential of features generated from four different ‘levels’ of viral genome representation: nucleotide, amino acid, amino acid properties and protein domains. This more fully exploits the biological information present in the virus genomes. Over a hundred and eighty binary datasets for infecting versus non-infecting viruses at all taxonomic ranks of both eukaryote and prokaryote hosts were compiled. The viral genomes were converted into the four different levels of genome representation and twenty feature sets were generated by extracting k-mer compositions and predicted protein domains. We trained and tested Support Vector Machine, SVM, classifiers to compare the predictive capacity of each of these feature sets for each dataset. Our results show that all levels of genome representation are consistently predictive of host taxonomy and that prediction k-mer composition improves with increasing k-mer length for all k-mer based features. Using a phylogenetically aware holdout method, we demonstrate that the predictive feature sets contain signals reflecting both the evolutionary relationship between the viruses infecting related hosts, and host-mimicry. Our results demonstrate that incorporating a range of complementary features, generated purely from virus genome sequences, leads to improved accuracy for a range of virus host prediction tasks enabling computational assignment of host taxonomic information. Elucidating the host of a newly identified virus species is an important challenge, with applications from knowing the source species of a newly emerged pathogen to understanding the bacteriophage-host relationships within the microbiome of any of earth’s ecosystems. Current high throughput methods used to identify viruses within biological or environmental samples have resulted in an unprecedented increase in virus discovery. However, for the majority of these virus genomes the host species/taxonomic classification remains unknown. To address this gap in our knowledge there is a need for fast, accurate computational methods for the assignment of putative host taxonomic information. Machine learning is an ideal approach but to maximise predictive accuracy the viral genomes need to be represented in a format (sets of features) that makes the discriminative information available to the machine learning algorithm. Here, we compare different types of features derived from the same viral genomes for their ability to predict host information. Our results demonstrate that all these feature sets are predictive of host taxonomy and when combined have the potential to improve accuracy over the use of individual feature sets across many virus host prediction applications.
Collapse
Affiliation(s)
- Francesca Young
- MRC-University of Glasgow Centre For Virus Research, Glasgow, United Kingdom
| | - Simon Rogers
- School of Computing Science, University of Glasgow, Glasgow, United Kingdom
| | - David L. Robertson
- MRC-University of Glasgow Centre For Virus Research, Glasgow, United Kingdom
- * E-mail:
| |
Collapse
|
21
|
R-loop-forming Sequences Analysis in Thousands of Viral Genomes Identify A New Common Element in Herpesviruses. Sci Rep 2020; 10:6389. [PMID: 32286400 PMCID: PMC7156643 DOI: 10.1038/s41598-020-63101-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Accepted: 03/20/2020] [Indexed: 11/16/2022] Open
Abstract
R-loops are RNA-DNA hybrid sequences that are emerging players in various biological processes, occurring in both prokaryotic and eukaryotic cells. In viruses, R-loop investigation is limited and functional importance is poorly understood. Here, we performed a computational approach to investigate prevalence, distribution, and location of R-loop forming sequences (RLFS) across more than 6000 viral genomes. A total of 14637 RLFS loci were identified in 1586 viral genomes. Over 70% of RLFS-positive genomes are dsDNA viruses. In the order Herpesvirales, RLFS were presented in all members whereas no RLFS was predicted in the order Ligamenvirales. Analysis of RLFS density in all RLFS-positive genomes revealed unusually high RLFS densities in herpesvirus genomes, with RLFS densities particularly enriched within repeat regions such as the terminal repeats (TRs). RLFS in TRs are positionally conserved between herpesviruses. Validating the computationally-identified RLFS, R-loop formation was experimentally confirmed in the TR and viral Bcl-2 promoter of Kaposi sarcoma-associated herpesvirus (KSHV). These predictions and validations support future analysis of RLFS in regulating the replication, transcription, and genome maintenance of herpesviruses.
Collapse
|
22
|
Borrayo E, May-Canche I, Paredes O, Morales JA, Romo-Vázquez R, Vélez-Pérez H. Whole-Genome k-mer Topic Modeling AssociatesBacterial Families. Genes (Basel) 2020; 11:genes11020197. [PMID: 32075081 PMCID: PMC7074292 DOI: 10.3390/genes11020197] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Revised: 02/07/2020] [Accepted: 02/09/2020] [Indexed: 11/16/2022] Open
Abstract
Alignment-free k-mer-based algorithms in whole genome sequence comparisons remainan ongoing challenge. Here, we explore the possibility to use Topic Modeling for organismwhole-genome comparisons. We analyzed 30 complete genomes from three bacterial families bytopic modeling. For this, each genome was considered as a document and 13-mer nucleotiderepresentations as words. Latent Dirichlet allocation was used as the probabilistic modeling of thecorpus. We where able to identify the topic distribution among analyzed genomes, which is highlyconsistent with traditional hierarchical classification. It is possible that topic modeling may be appliedto establish relationships between genome's composition and biological phenomena.
Collapse
Affiliation(s)
- Ernesto Borrayo
- Electronics Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico;
| | - Isaias May-Canche
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
- Instituto Tecnológico de Chetumal, Quintana Roo 77000, Mexico
| | - Omar Paredes
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
| | - J. Alejandro Morales
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
| | - Rebeca Romo-Vázquez
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
| | - Hugo Vélez-Pérez
- Computer Sciences Department, CUCEI, Universidad de Guadalajara, Jalisco 44100, Mexico; (I.M.-C.); (O.P.); (J.A.M.); (R.R.-V.)
- Correspondence:
| |
Collapse
|
23
|
Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B. The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLoS One 2020; 15:e0228070. [PMID: 32040534 PMCID: PMC7010260 DOI: 10.1371/journal.pone.0228070] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 01/08/2020] [Indexed: 12/14/2022] Open
Abstract
We study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on Nk and that is affine-linear within a certain range of k. Integers kmin and kmax can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(kmin) and F(kmax). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.
Collapse
Affiliation(s)
- Sophie Röhling
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Alexander Linne
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Jendrik Schellhorn
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | | | - Thomas Dencker
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Burkhard Morgenstern
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Göttingen, Germany
| |
Collapse
|
24
|
De Pierri CR, Voyceik R, Santos de Mattos LGC, Kulik MG, Camargo JO, Repula de Oliveira AM, de Lima Nichio BT, Marchaukoski JN, da Silva Filho AC, Guizelini D, Ortega JM, Pedrosa FO, Raittz RT. SWeeP: representing large biological sequences datasets in compact vectors. Sci Rep 2020; 10:91. [PMID: 31919449 PMCID: PMC6952362 DOI: 10.1038/s41598-019-55627-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Accepted: 12/02/2019] [Indexed: 12/25/2022] Open
Abstract
Vectoral and alignment-free approaches to biological sequence representation have been explored in bioinformatics to efficiently handle big data. Even so, most current methods involve sequence comparisons via alignment-based heuristics and fail when applied to the analysis of large data sets. Here, we present “Spaced Words Projection (SWeeP)”, a method for representing biological sequences using relatively small vectors while preserving intersequence comparability. SWeeP uses spaced-words by scanning the sequences and generating indices to create a higher-dimensional vector that is later projected onto a smaller randomly oriented orthonormal base. We constructed phylogenetic trees for all organisms with mitochondrial and bacterial protein data in the NCBI database. SWeeP quickly built complete and accurate trees for these organisms with low computational cost. We compared SWeeP to other alignment-free methods and Sweep was 10 to 100 times quicker than the other techniques. A tool to build SWeeP vectors is available at https://sourceforge.net/projects/spacedwordsprojection/.
Collapse
Affiliation(s)
- Camilla Reginatto De Pierri
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil
| | - Ricardo Voyceik
- Federal University of Minas Gerais, Institute of Biological Sciences (ICB), Belo Horizonte, Minas Gerais, Brazil
| | | | - Mariane Gonçalves Kulik
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil
| | - Josué Oliveira Camargo
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil
| | - Aryel Marlus Repula de Oliveira
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Genetics, Curitiba, Paraná, Brazil
| | - Bruno Thiago de Lima Nichio
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil
| | | | - Antonio Camilo da Silva Filho
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Pharmaceutical Sciences, Curitiba, Paraná, Brazil
| | - Dieval Guizelini
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil
| | - J Miguel Ortega
- Federal University of Minas Gerais, Institute of Biological Sciences (ICB), Belo Horizonte, Minas Gerais, Brazil
| | - Fabio O Pedrosa
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.,Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil
| | - Roberto Tadeu Raittz
- Federal University of Paraná - SEPT, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil. .,Federal University of Minas Gerais, Institute of Biological Sciences (ICB), Belo Horizonte, Minas Gerais, Brazil. .,Federal University of Paraná, Department of Genetics, Curitiba, Paraná, Brazil.
| |
Collapse
|
25
|
Evolutionary Insight into the Trypanosomatidae Using Alignment-Free Phylogenomics of the Kinetoplast. Pathogens 2019; 8:pathogens8030157. [PMID: 31540520 PMCID: PMC6789588 DOI: 10.3390/pathogens8030157] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Revised: 09/10/2019] [Accepted: 09/13/2019] [Indexed: 12/12/2022] Open
Abstract
Advancements in next-generation sequencing techniques have led to a substantial increase in the genomic information available for analyses in evolutionary biology. As such, this data requires the exponential growth in bioinformatic methods and expertise required to understand such vast quantities of genomic data. Alignment-free phylogenomics offer an alternative approach for large-scale analyses that may have the potential to address these challenges. The evolutionary relationships between various species within the trypanosomatid family, specifically members belonging to the genera Leishmania and Trypanosoma have been extensively studies over the last 30 years. However, there is a need for a more exhaustive analysis of the Trypanosomatidae, summarising the evolutionary patterns amongst the entire family of these important protists. The mitochondrial DNA of the trypanosomatids, better known as the kinetoplast, represents a valuable taxonomic marker given its unique presence across all kinetoplastid protozoans. The aim of this study was to validate the reliability and robustness of alignment-free approaches for phylogenomic analyses and its applicability to reconstruct the evolutionary relationships between the trypanosomatid family. In the present study, alignment-free analyses demonstrated the strength of these methods, particularly when dealing with large datasets compared to the traditional phylogenetic approaches. We present a maxicircle genome phylogeny of 46 species spanning the trypanosomatid family, demonstrating the superiority of the maxicircle for the analysis and taxonomic resolution of the Trypanosomatidae.
Collapse
|
26
|
Andrade-Martínez JS, Moreno-Gallego JL, Reyes A. Defining a Core Genome for the Herpesvirales and Exploring their Evolutionary Relationship with the Caudovirales. Sci Rep 2019; 9:11342. [PMID: 31383901 PMCID: PMC6683198 DOI: 10.1038/s41598-019-47742-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2019] [Accepted: 07/19/2019] [Indexed: 12/21/2022] Open
Abstract
The order Herpesvirales encompasses a wide variety of important and broadly distributed human pathogens. During the last decades, similarities in the viral cycle and the structure of some of their proteins with those of the order Caudovirales, the tailed bacterial viruses, have brought speculation regarding the existence of an evolutionary relationship between these clades. To evaluate such hypothesis, we used over 600 Herpesvirales and 2000 Caudovirales complete genomes to search for the presence or absence of clusters of orthologous protein domains and constructed a dendrogram based on their compositional similarities. The results obtained strongly suggest an evolutionary relationship between the two orders. Furthermore, they allowed to propose a core genome for the Herpesvirales, composed of 4 proteins, including the ATPase subunit of the DNA-packaging terminase, the only protein with previously verified conservation. Accordingly, a phylogenetic tree constructed with sequences derived from the clusters associated to these proteins grouped the Herpesvirales strains accordingly to the established families and subfamilies. Overall, this work provides results supporting the hypothesis that the two orders are evolutionarily related and contributes to the understanding of the history of the Herpesvirales.
Collapse
Affiliation(s)
- Juan S Andrade-Martínez
- Research Group on Computational Biology and Microbial Ecology, Department of Biological Sciences, Universidad de los Andes, Bogota, Colombia
- Max Planck Tandem Group in Computational Biology, Universidad de los Andes, Bogota, Colombia
| | - J Leonardo Moreno-Gallego
- Research Group on Computational Biology and Microbial Ecology, Department of Biological Sciences, Universidad de los Andes, Bogota, Colombia
- Max Planck Tandem Group in Computational Biology, Universidad de los Andes, Bogota, Colombia
| | - Alejandro Reyes
- Research Group on Computational Biology and Microbial Ecology, Department of Biological Sciences, Universidad de los Andes, Bogota, Colombia.
- Max Planck Tandem Group in Computational Biology, Universidad de los Andes, Bogota, Colombia.
- Centre for Genome Sciences and Systems Biology, Department of Pathology and Immunology, Washington University in Saint Louis, Saint Louis, MO, 63108, USA.
| |
Collapse
|
27
|
Pratas D, Hosseini M, Grilo G, Pinho AJ, Silva RM, Caetano T, Carneiro J, Pereira F. Metagenomic Composition Analysis of an Ancient Sequenced Polar Bear Jawbone from Svalbard. Genes (Basel) 2018; 9:E445. [PMID: 30200636 PMCID: PMC6162538 DOI: 10.3390/genes9090445] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 09/03/2018] [Accepted: 09/03/2018] [Indexed: 12/17/2022] Open
Abstract
The sequencing of ancient DNA samples provides a novel way to find, characterize, and distinguish exogenous genomes of endogenous targets. After sequencing, computational composition analysis enables filtering of undesired sources in the focal organism, with the purpose of improving the quality of assemblies and subsequent data analysis. More importantly, such analysis allows extinct and extant species to be identified without requiring a specific or new sequencing run. However, the identification of exogenous organisms is a complex task, given the nature and degradation of the samples, and the evident necessity of using efficient computational tools, which rely on algorithms that are both fast and highly sensitive. In this work, we relied on a fast and highly sensitive tool, FALCON-meta, which measures similarity against whole-genome reference databases, to analyse the metagenomic composition of an ancient polar bear (Ursus maritimus) jawbone fossil. The fossil was collected in Svalbard, Norway, and has an estimated age of 110,000 to 130,000 years. The FASTQ samples contained 349 GB of nonamplified shotgun sequencing data. We identified and localized, relative to the FASTQ samples, the genomes with significant similarities to reference microbial genomes, including those of viruses, bacteria, and archaea, and to fungal, mitochondrial, and plastidial sequences. Among other striking features, we found significant similarities between modern-human, some bacterial and viral sequences (contamination) and the organelle sequences of wild carrot and tomato relative to the whole samples. For each exogenous candidate, we ran a damage pattern analysis, which in addition to revealing shallow levels of damage in the plant candidates, identified the source as contamination.
Collapse
Affiliation(s)
- Diogo Pratas
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
| | - Morteza Hosseini
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
- Department of Electronics, Telecommunications and Informatics, University of Aveiro, 3810-193 Aveiro, Portugal.
| | - Gonçalo Grilo
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
- Department of Electronics, Telecommunications and Informatics, University of Aveiro, 3810-193 Aveiro, Portugal.
| | - Armando J Pinho
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
- Department of Electronics, Telecommunications and Informatics, University of Aveiro, 3810-193 Aveiro, Portugal.
| | - Raquel M Silva
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
- Department of Medical Sciences, University of Aveiro, 3810-193 Aveiro, Portugal.
- Institute for Biomedicine, University of Aveiro, 3810-193 Aveiro, Portugal.
| | - Tânia Caetano
- Department of Biology, University of Aveiro, University of Aveiro, 3810-193 Aveiro, Portugal.
- Centre for Environmental and Marine Studies, University of Aveiro, 3810-193 Aveiro, Portugal.
| | - João Carneiro
- Interdisciplinary Centre of Marine and Environmental Research, University of Porto, 4450-208 Matosinhos, Portugal.
| | - Filipe Pereira
- Interdisciplinary Centre of Marine and Environmental Research, University of Porto, 4450-208 Matosinhos, Portugal.
| |
Collapse
|
28
|
Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, Sun F. Alignment-Free Sequence Analysis and Applications. Annu Rev Biomed Data Sci 2018; 1:93-114. [PMID: 31828235 PMCID: PMC6905628 DOI: 10.1146/annurev-biodatasci-080917-013431] [Citation(s) in RCA: 58] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Genome and metagenome comparisons based on large amounts of next generation sequencing (NGS) data pose significant challenges for alignment-based approaches due to the huge data size and the relatively short length of the reads. Alignment-free approaches based on the counts of word patterns in NGS data do not depend on the complete genome and are generally computationally efficient. Thus, they contribute significantly to genome and metagenome comparison. Recently, novel statistical approaches have been developed for the comparison of both long and shotgun sequences. These approaches have been applied to many problems including the comparison of gene regulatory regions, genome sequences, metagenomes, binning contigs in metagenomic data, identification of virus-host interactions, and detection of horizontal gene transfers. We provide an updated review of these applications and other related developments of word-count based approaches for alignment-free sequence analysis.
Collapse
Affiliation(s)
- Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Xin Bai
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Yang Young Lu
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Kujin Tang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, Fujian, China
| | - Gesine Reinert
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
29
|
Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes. ENTROPY 2018; 20:e20060393. [PMID: 33265483 PMCID: PMC7512912 DOI: 10.3390/e20060393] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2018] [Revised: 05/16/2018] [Accepted: 05/21/2018] [Indexed: 11/26/2022]
Abstract
An efficient DNA compressor furnishes an approximation to measure and compare information quantities present in, between and across DNA sequences, regardless of the characteristics of the sources. In this paper, we compare directly two information measures, the Normalized Compression Distance (NCD) and the Normalized Relative Compression (NRC). These measures answer different questions; the NCD measures how similar both strings are (in terms of information content) and the NRC (which, in general, is nonsymmetric) indicates the fraction of one of them that cannot be constructed using information from the other one. This leads to the problem of finding out which measure (or question) is more suitable for the answer we need. For computing both, we use a state of the art DNA sequence compressor that we benchmark with some top compressors in different compression modes. Then, we apply the compressor on DNA sequences with different scales and natures, first using synthetic sequences and then on real DNA sequences. The last include mitochondrial DNA (mtDNA), messenger RNA (mRNA) and genomic DNA (gDNA) of seven primates. We provide several insights into evolutionary acceleration rates at different scales, namely, the observation and confirmation across the whole genomes of a higher variation rate of the mtDNA relative to the gDNA. We also show the importance of relative compression for localizing similar information regions using mtDNA.
Collapse
|
30
|
Lin J, Wei J, Adjeroh D, Jiang BH, Jiang Y. SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform. BMC Bioinformatics 2018; 19:165. [PMID: 29720081 PMCID: PMC5930706 DOI: 10.1186/s12859-018-2155-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2017] [Accepted: 04/11/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Alignment-free sequence similarity analysis methods often lead to significant savings in computational time over alignment-based counterparts. RESULTS A new alignment-free sequence similarity analysis method, called SSAW is proposed. SSAW stands for Sequence Similarity Analysis using the Stationary Discrete Wavelet Transform (SDWT). It extracts k-mers from a sequence, then maps each k-mer to a complex number field. Then, the series of complex numbers formed are transformed into feature vectors using the stationary discrete wavelet transform. After these steps, the original sequence is turned into a feature vector with numeric values, which can then be used for clustering and/or classification. CONCLUSIONS Using two different types of applications, namely, clustering and classification, we compared SSAW against the the-state-of-the-art alignment free sequence analysis methods. SSAW demonstrates competitive or superior performance in terms of standard indicators, such as accuracy, F-score, precision, and recall. The running time was significantly better in most cases. These make SSAW a suitable method for sequence analysis, especially, given the rapidly increasing volumes of sequence data required by most modern applications.
Collapse
Affiliation(s)
- Jie Lin
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China
| | - Jing Wei
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China
| | - Donald Adjeroh
- Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, 26506, WV, USA
| | - Bing-Hua Jiang
- Department of Pathology, University of Iowa, Iowa city, 52242, Iowa, USA
| | - Yue Jiang
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China.
| |
Collapse
|
31
|
Triant DA, Cinel SD, Kawahara AY. Lepidoptera genomes: current knowledge, gaps and future directions. CURRENT OPINION IN INSECT SCIENCE 2018; 25:99-105. [PMID: 29602369 DOI: 10.1016/j.cois.2017.12.004] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/28/2017] [Revised: 12/18/2017] [Accepted: 12/19/2017] [Indexed: 06/08/2023]
Abstract
Butterflies and moths (Lepidoptera) are one of the most ecologically diverse and speciose insect orders. With recent advances in genomics, new Lepidoptera genomes are regularly being sequenced, and many of them are playing principal roles in genomics studies, particularly in the fields of phylo-genomics and functional genomics. Thus far, assembled genomes are only available for <10 of the 43 Lepidoptera superfamilies. Nearly all are model species, found in the speciose clade Ditrysia. Community support for Lepidoptera genomics is growing with successful management and dissemination of data and analytical tools in centralized databases. With genomic studies quickly becoming integrated with ecological and evolutionary research, the Lepidoptera community will unquestionably benefit from new high-quality reference genomes that are more evenly distributed throughout the order.
Collapse
Affiliation(s)
- Deborah A Triant
- Florida Museum of Natural History, University of Florida, Gainesville, FL 32611, USA.
| | - Scott D Cinel
- Florida Museum of Natural History, University of Florida, Gainesville, FL 32611, USA; Department of Biology, University of Florida, Gainesville, FL 32611, USA
| | - Akito Y Kawahara
- Florida Museum of Natural History, University of Florida, Gainesville, FL 32611, USA
| |
Collapse
|