1
|
Bassetti N, Caarls L, Bouwmeester K, Verbaarschot P, van Eijden E, Zwaan BJ, Bonnema G, Schranz ME, Fatouros NE. A butterfly egg-killing hypersensitive response in Brassica nigra is controlled by a single locus, PEK, containing a cluster of TIR-NBS-LRR receptor genes. Plant Cell Environ 2024; 47:1009-1022. [PMID: 37961842 DOI: 10.1111/pce.14765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 10/26/2023] [Accepted: 11/01/2023] [Indexed: 11/15/2023]
Abstract
Knowledge of plant recognition of insects is largely limited to a few resistance (R) genes against sap-sucking insects. Hypersensitive response (HR) characterizes monogenic plant traits relying on R genes in several pathosystems. HR-like cell death can be triggered by eggs of cabbage white butterflies (Pieris spp.), pests of cabbage crops (Brassica spp.), reducing egg survival and representing an effective plant resistance trait before feeding damage occurs. Here, we performed genetic mapping of HR-like cell death induced by Pieris brassicae eggs in the black mustard Brassica nigra (B. nigra). We show that HR-like cell death segregates as a Mendelian trait and identified a single dominant locus on chromosome B3, named PEK (Pieris egg- killing). Eleven genes are located in an approximately 50 kb region, including a cluster of genes encoding intracellular TIR-NBS-LRR (TNL) receptor proteins. The PEK locus is highly polymorphic between the parental accessions of our mapping populations and among B. nigra reference genomes. Our study is the first one to identify a single locus potentially involved in HR-like cell death induced by insect eggs in B. nigra. Further fine-mapping, comparative genomics and validation of the PEK locus will shed light on the role of these TNL receptors in egg-killing HR.
Collapse
Affiliation(s)
- Niccolò Bassetti
- Biosystematics Group, Wageningen University & Research, Wageningen, The Netherlands
| | - Lotte Caarls
- Biosystematics Group, Wageningen University & Research, Wageningen, The Netherlands
- Laboratory of Plant Breeding, Wageningen University & Research, Wageningen, The Netherlands
| | - Klaas Bouwmeester
- Biosystematics Group, Wageningen University & Research, Wageningen, The Netherlands
- Laboratory of Entomology, Wageningen University & Research, Wageningen, The Netherlands
| | - Patrick Verbaarschot
- Biosystematics Group, Wageningen University & Research, Wageningen, The Netherlands
| | - Ewan van Eijden
- Biosystematics Group, Wageningen University & Research, Wageningen, The Netherlands
| | - Bas J Zwaan
- Laboratory of Genetics, Wageningen University & Research, Wageningen, The Netherlands
| | - Guusje Bonnema
- Laboratory of Plant Breeding, Wageningen University & Research, Wageningen, The Netherlands
| | - M Eric Schranz
- Biosystematics Group, Wageningen University & Research, Wageningen, The Netherlands
| | - Nina E Fatouros
- Biosystematics Group, Wageningen University & Research, Wageningen, The Netherlands
| |
Collapse
|
2
|
Wang T, Yu ZG, Li J. CGRWDL: alignment-free phylogeny reconstruction method for viruses based on chaos game representation weighted by dynamical language model. Front Microbiol 2024; 15:1339156. [PMID: 38572227 PMCID: PMC10987876 DOI: 10.3389/fmicb.2024.1339156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Accepted: 02/23/2024] [Indexed: 04/05/2024] Open
Abstract
Traditional alignment-based methods meet serious challenges in genome sequence comparison and phylogeny reconstruction due to their high computational complexity. Here, we propose a new alignment-free method to analyze the phylogenetic relationships (classification) among species. In our method, the dynamical language (DL) model and the chaos game representation (CGR) method are used to characterize the frequency information and the context information of k-mers in a sequence, respectively. Then for each DNA sequence or protein sequence in a dataset, our method converts the sequence into a feature vector that represents the sequence information based on CGR weighted by the DL model to infer phylogenetic relationships. We name our method CGRWDL. Its performance was tested on both DNA and protein sequences of 8 datasets of viruses to construct the phylogenetic trees. We compared the Robinson-Foulds (RF) distance between the phylogenetic tree constructed by CGRWDL and the reference tree by other advanced methods for each dataset. The results show that the phylogenetic trees constructed by CGRWDL can accurately classify the viruses, and the RF scores between the trees and the reference trees are smaller than that with other methods.
Collapse
Affiliation(s)
- Ting Wang
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Zu-Guo Yu
- National Center for Applied Mathematics in Hunan, Xiangtan University, Xiangtan, Hunan, China
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Xiangtan, Hunan, China
| | - Jinyan Li
- School of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
3
|
Fan J, Khan J, Singh NP, Pibiri GE, Patro R. Fulgor: a fast and compact k-mer index for large-scale matching and color queries. Algorithms Mol Biol 2024; 19:3. [PMID: 38254124 PMCID: PMC10810250 DOI: 10.1186/s13015-024-00251-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 01/03/2024] [Indexed: 01/24/2024] Open
Abstract
The problem of sequence identification or matching-determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence-is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficient colored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic (i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as little as 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto-the strongest competitor in terms of index space vs. query time trade-off-Fulgor requires significantly less space (up to 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries, and is 2-6[Formula: see text] faster to construct.
Collapse
Affiliation(s)
- Jason Fan
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
| | - Jamshed Khan
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
| | - Noor Pratap Singh
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA
| | | | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, MD, 20742, USA.
| |
Collapse
|
4
|
Corut AK, Wallace JG. kGWASflow: a modular, flexible, and reproducible Snakemake workflow for k-mers-based GWAS. G3 (Bethesda) 2023; 14:jkad246. [PMID: 37976215 PMCID: PMC10755180 DOI: 10.1093/g3journal/jkad246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 10/15/2023] [Indexed: 11/19/2023]
Abstract
Genome-wide association studies (GWAS) have been widely used to identify genetic variation associated with complex traits. Despite its success and popularity, the traditional GWAS approach comes with a variety of limitations. For this reason, newer methods for GWAS have been developed, including the use of pan-genomes instead of a reference genome and the utilization of markers beyond single-nucleotide polymorphisms, such as structural variations and k-mers. The k-mers-based GWAS approach has especially gained attention from researchers in recent years. However, these new methodologies can be complicated and challenging to implement. Here, we present kGWASflow, a modular, user-friendly, and scalable workflow to perform GWAS using k-mers. We adopted an existing kmersGWAS method into an easier and more accessible workflow using management tools like Snakemake and Conda and eliminated the challenges caused by missing dependencies and version conflicts. kGWASflow increases the reproducibility of the kmersGWAS method by automating each step with Snakemake and using containerization tools like Docker. The workflow encompasses supplemental components such as quality control, read-trimming procedures, and generating summary statistics. kGWASflow also offers post-GWAS analysis options to identify the genomic location and context of trait-associated k-mers. kGWASflow can be applied to any organism and requires minimal programming skills. kGWASflow is freely available on GitHub (https://github.com/akcorut/kGWASflow) and Bioconda (https://anaconda.org/bioconda/kgwasflow).
Collapse
Affiliation(s)
- Adnan Kivanc Corut
- Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA
| | - Jason G Wallace
- Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA
- Institute of Plant Breeding, Genetics, and Genomics, University of Georgia, Athens, GA 30602, USA
- Department of Crop and Soil Sciences, University of Georgia, Athens, GA 30602, USA
| |
Collapse
|
5
|
Mouratidis I, Chantzi N, Khan U, Konnaris MA, Chan CSY, Mareboina M, Moeckel C, Georgakopoulos-Soares I. Frequentmers - a novel way to look at metagenomic next generation sequencing data and an application in detecting liver cirrhosis. BMC Genomics 2023; 24:768. [PMID: 38087204 PMCID: PMC10714505 DOI: 10.1186/s12864-023-09861-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 11/29/2023] [Indexed: 12/17/2023] Open
Abstract
Early detection of human disease is associated with improved clinical outcomes. However, many diseases are often detected at an advanced, symptomatic stage where patients are past efficacious treatment periods and can result in less favorable outcomes. Therefore, methods that can accurately detect human disease at a presymptomatic stage are urgently needed. Here, we introduce "frequentmers"; short sequences that are specific and recurrently observed in either patient or healthy control samples, but not in both. We showcase the utility of frequentmers for the detection of liver cirrhosis using metagenomic Next Generation Sequencing data from stool samples of patients and controls. We develop classification models for the detection of liver cirrhosis and achieve an AUC score of 0.91 using ten-fold cross-validation. A small subset of 200 frequentmers can achieve comparable results in detecting liver cirrhosis. Finally, we identify the microbial organisms in liver cirrhosis samples, which are associated with the most predictive frequentmer biomarkers.
Collapse
Affiliation(s)
- Ioannis Mouratidis
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA.
| | - Nikol Chantzi
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA
| | - Umair Khan
- Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA, USA
| | - Maxwell A Konnaris
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA
- Department of Statistics, Penn State, University Park, PA, USA
- Huck Institutes of the Life Sciences, Penn State, University Park, PA, USA
| | - Candace S Y Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Manvita Mareboina
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA
| | - Camille Moeckel
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA
| | - Ilias Georgakopoulos-Soares
- Department of Biochemistry and Molecular Biology, Institute for Personalized Medicine, Penn State College of Medicine, Hershey, PA, USA.
| |
Collapse
|
6
|
Ali S, Chourasia P, Tayebi Z, Bello B, Patterson M. ViralVectors: compact and scalable alignment-free virome feature generation. Med Biol Eng Comput 2023; 61:2607-2626. [PMID: 37395885 DOI: 10.1007/s11517-023-02837-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Accepted: 03/29/2023] [Indexed: 07/04/2023]
Abstract
The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus. This will continue to grow geometrically for SARS-CoV-2, and other viruses, as many countries heavily finance genomic surveillance efforts. Hence, we need methods for processing large amounts of sequence data to allow for effective yet timely decision-making. Such data will come from heterogeneous sources: aligned, unaligned, or even unassembled raw nucleotide or amino acid sequencing reads pertaining to the whole genome or regions (e.g., spike) of interest. In this work, we propose ViralVectors, a compact feature vector generation from virome sequencing data that allows effective downstream analysis. Such generation is based on minimizers, a type of lightweight "signature" of a sequence, used traditionally in assembly and read mapping - to our knowledge, the first use minimizers in this way. We validate our approach on different types of sequencing data: (a) 2.5M SARS-CoV-2 spike sequences (to show scalability); (b) 3K Coronaviridae spike sequences (to show robustness to more genomic variability); and (c) 4K raw WGS reads sets taken from nasal-swab PCR tests (to show the ability to process unassembled reads). Our results show that ViralVectors outperforms current benchmarks in most classification and clustering tasks. Graphical Abstract showing the all steps of proposed approach. We start by collecting the sequence-based data. Then Data cleaning and preprocessing is applied. After that, we generate the feature embeddings using minimizer based approach. Then Classification and clustering algorithms are applied on the resultant data and predictions are made on the test set.
Collapse
Affiliation(s)
- Sarwan Ali
- Georgia State University, Atlanta, GA, USA.
| | | | | | | | | |
Collapse
|
7
|
Ponsero AJ, Miller M, Hurwitz BL. Comparison of k-mer-based de novo comparative metagenomic tools and approaches. Microbiome Res Rep 2023; 2:27. [PMID: 38058765 PMCID: PMC10696585 DOI: 10.20517/mrr.2023.26] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Revised: 06/28/2023] [Accepted: 07/12/2023] [Indexed: 12/08/2023]
Abstract
Aim: Comparative metagenomic analysis requires measuring a pairwise similarity between metagenomes in the dataset. Reference-based methods that compute a beta-diversity distance between two metagenomes are highly dependent on the quality and completeness of the reference database, and their application on less studied microbiota can be challenging. On the other hand, de-novo comparative metagenomic methods only rely on the sequence composition of metagenomes to compare datasets. While each one of these approaches has its strengths and limitations, their comparison is currently limited. Methods: We developed sets of simulated short-reads metagenomes to (1) compare k-mer-based and taxonomy-based distances and evaluate the impact of technical and biological variables on these metrics and (2) evaluate the effect of k-mer sketching and filtering. We used a real-world metagenomic dataset to provide an overview of the currently available tools for de novo metagenomic comparative analysis. Results: Using simulated metagenomes of known composition and controlled error rate, we showed that k-mer-based distance metrics were well correlated to the taxonomic distance metric for quantitative Beta-diversity metrics, but the correlation was low for presence/absence distances. The community complexity in terms of taxa richness and the sequencing depth significantly affected the quality of the k-mer-based distances, while the impact of low amounts of sequence contamination and sequencing error was limited. Finally, we benchmarked currently available de-novo comparative metagenomic tools and compared their output on two datasets of fecal metagenomes and showed that most k-mer-based tools were able to recapitulate the data structure observed using taxonomic approaches. Conclusion: This study expands our understanding of the strength and limitations of k-mer-based de novo comparative metagenomic approaches and aims to provide concrete guidelines for researchers interested in applying these approaches to their metagenomic datasets.
Collapse
Affiliation(s)
- Alise Jany Ponsero
- Human Microbiome Research Program, University of Helsinki, Helsinki 00290, Finland
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ 85721, USA
- BIO5 Institute, The University of Arizona, Tucson, AZ 85721, USA
| | - Matthew Miller
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ 85721, USA
| | - Bonnie Louise Hurwitz
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ 85721, USA
- BIO5 Institute, The University of Arizona, Tucson, AZ 85721, USA
| |
Collapse
|
8
|
Pibiri GE. On weighted k-mer dictionaries. Algorithms Mol Biol 2023; 18:3. [PMID: 37328897 DOI: 10.1186/s13015-023-00226-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 05/13/2023] [Indexed: 06/18/2023] Open
Abstract
We consider the problem of representing a set of [Formula: see text]-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a [Formula: see text]-mer is efficient. The representation is called a weighted dictionary of [Formula: see text]-mers and finds application in numerous tasks in Bioinformatics that usually count [Formula: see text]-mers as a pre-processing step. In fact, [Formula: see text]-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri in Bioinformatics 38:185-194, 2022) to also store compactly the weights of the [Formula: see text]-mers. From a technical perspective, we exploit the order of the [Formula: see text]-mers represented in SSHash to encode runs of weights, hence allowing much better compression than the empirical entropy of the weights. We study the problem of reducing the number of runs in the weights to improve compression even further and give an optimal algorithm for this problem. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only [Formula: see text]-mer dictionary that is exact, weighted, associative, fast, and small.
Collapse
Affiliation(s)
- Giulio Ermanno Pibiri
- Department of Environmental Sciences, Informatics and Statistics (DAIS), Ca' Foscari University of Venice, Venice, Italy.
- ISTI-CNR, Pisa, Italy.
| |
Collapse
|
9
|
Fan J, Singh NP, Khan J, Pibiri GE, Patro R. Fulgor: A fast and compact k-mer index for large-scale matching and color queries. bioRxiv 2023:2023.05.09.539895. [PMID: 37214944 PMCID: PMC10197524 DOI: 10.1101/2023.05.09.539895] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
The problem of sequence identification or matching - determining the subset of references from a given collection that are likely to contain a query nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resourceefficient solution to this problem is of utmost importance. The reference collection should therefore be pre-processed into an index for fast queries. This poses the threefold challenge of designing an index that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compact colored de Bruijn graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are monochromatic (all k-mers in a unitig have the same set of references of origin, or "color"), leveraging the order-preserving property of its dictionary. In fact, k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from k-mers to their inverted lists in as little as 1+o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 Salmonella enterica genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2 - 6× faster to construct.
Collapse
Affiliation(s)
- Jason Fan
- Department of Computer Science, University of Maryland, College Park, MD 20440, USA
| | - Noor Pratap Singh
- Department of Computer Science, University of Maryland, College Park, MD 20440, USA
| | - Jamshed Khan
- Department of Computer Science, University of Maryland, College Park, MD 20440, USA
| | | | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, MD 20440, USA
| |
Collapse
|
10
|
Chen MM, Shi GH, Dai Y, Fang WX, Wu Q. Identifying genetic variants associated with amphotericin B (AMB) resistance in Aspergillus fumigatus via k-mer -based GWAS. Front Genet 2023; 14:1133593. [PMID: 37229189 PMCID: PMC10203564 DOI: 10.3389/fgene.2023.1133593] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 04/10/2023] [Indexed: 05/27/2023] Open
Abstract
Aspergillus fumigatus is one of the most common pathogenic fungi, which results in high morbidity and mortality in immunocompromised patients. Amphotericin B (AMB) is used as the core drug for the treatment of triazole-resistant A. fumigatus. Following the usage of amphotericin B drugs, the number of amphotericin B-resistant A. fumigatus isolates showed an increasing trend over the years, but the mechanism and mutations associated with amphotericin B sensitivity are not fully understood. In this study, we performed a k-mer-based genome-wide association study (GWAS) in 98 A. fumigatus isolates from public databases. Associations identified with k-mers not only recapitulate those with SNPs but also discover new associations with insertion/deletion (indel). Compared to SNP sites, the indel showed a stronger association with amphotericin B resistance, and a significant correlated indel is present in the exon region of AFUA_7G05160, encoding a fumarylacetoacetate hydrolase (FAH) family protein. Enrichment analysis revealed sphingolipid synthesis and transmembrane transport may be related to the resistance of A. fumigatus to amphotericin B. The expansion of variant types detected by the k-mer method increases opportunities to identify and exploit complex genetic variants that drive amphotericin B resistance, and these candidate variants help accelerate the selection of prospective gene markers for amphotericin B resistance screening in A. fumigatus.
Collapse
Affiliation(s)
- Meng-Meng Chen
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Guo-Hui Shi
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Yi Dai
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Wen-Xia Fang
- Guangxi Biological Sciences and Biotechnology Center, Guangxi Academy of Sciences, Nanning, Guangxi, China
| | - Qi Wu
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
11
|
Ali S, Bello B, Tayebi Z, Patterson M. Characterizing SARS-CoV-2 Spike Sequences Based on Geographical Location. J Comput Biol 2023; 30:432-445. [PMID: 36656554 DOI: 10.1089/cmb.2022.0391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
With the rapid spread of COVID-19 worldwide, viral genomic data are available in the order of millions of sequences on public databases such as GISAID. This Big Data creates a unique opportunity for analysis toward the research of effective vaccine development for current pandemics, and avoiding or mitigating future pandemics. One piece of information that comes with every such viral sequence is the geographical location where it was collected-the patterns found between viral variants and geographical location surely being an important part of this analysis. One major challenge that researchers face is processing such huge, highly dimensional data to obtain useful insights as quickly as possible. Most of the existing methods face scalability issues when dealing with the magnitude of such data. In this article, we propose an approach that first computes a numerical representation of the spike protein sequence of SARS-CoV-2 using k-mers (substrings) and then uses several machine learning models to classify the sequences based on geographical location. We show that our proposed model significantly outperforms the baselines. We also show the importance of different amino acids in the spike sequences by computing the information gain corresponding to the true class labels.
Collapse
Affiliation(s)
- Sarwan Ali
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Babatunde Bello
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Zahra Tayebi
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, Georgia, USA
| |
Collapse
|
12
|
Boddé M, Makunin A, Ayala D, Bouafou L, Diabaté A, Ekpo UF, Kientega M, Le Goff G, Makanga BK, Ngangue MF, Omitola OO, Rahola N, Tripet F, Durbin R, Lawniczak MKN. High-resolution species assignment of Anopheles mosquitoes using k-mer distances on targeted sequences. eLife 2022; 11:e78775. [PMID: 36222650 PMCID: PMC9648975 DOI: 10.7554/elife.78775] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Accepted: 10/11/2022] [Indexed: 11/13/2022] Open
Abstract
The ANOSPP amplicon panel is a genus-wide targeted sequencing panel to facilitate large-scale monitoring of Anopheles species diversity. Combining information from the 62 nuclear amplicons present in the ANOSPP panel allows for a more senstive and specific species assignment than single gene (e.g. COI) barcoding, which is desirable in the light of permeable species boundaries. Here, we present NNoVAE, a method using Nearest Neighbours (NN) and Variational Autoencoders (VAE), which we apply to k-mers resulting from the ANOSPP amplicon sequences in order to hierarchically assign species identity. The NN step assigns a sample to a species-group by comparing the k-mers arising from each haplotype's amplicon sequence to a reference database. The VAE step is required to distinguish between closely related species, and also has sufficient resolution to reveal population structure within species. In tests on independent samples with over 80% amplicon coverage, NNoVAE correctly classifies to species level 98% of samples within the An. gambiae complex and 89% of samples outside the complex. We apply NNoVAE to over two thousand new samples from Burkina Faso and Gabon, identifying unexpected species in Gabon. NNoVAE presents an approach that may be of value to other targeted sequencing panels, and is a method that will be used to survey Anopheles species diversity and Plasmodium transmission patterns through space and time on a large scale, with plans to analyse half a million mosquitoes in the next five years.
Collapse
Affiliation(s)
- Marilou Boddé
- Department of Genetics, University of CambridgeCambridgeUnited Kingdom
- Wellcome Sanger InstituteHinxtonUnited Kingdom
| | | | - Diego Ayala
- Institut de Recherche pour le Développement, MIVEGEC, Univ. Montpellier, CNRS, IRDMontpellier,France
| | - Lemonde Bouafou
- Institut de Recherche pour le Développement, MIVEGEC, Univ. Montpellier, CNRS, IRDMontpellier,France
| | - Abdoulaye Diabaté
- Institut de Recherche en Sciences de la Santé, Direction Régionale de l'OuestBobo-DioulassoBurkina Faso
| | | | - Mahamadi Kientega
- Institut de Recherche en Sciences de la Santé, Direction Régionale de l'OuestBobo-DioulassoBurkina Faso
| | - Gilbert Le Goff
- Institut de Recherche pour le Développement, MIVEGEC, Univ. Montpellier, CNRS, IRDMontpellier,France
| | | | - Marc F Ngangue
- Centre International de Recherches Medicales de FrancevilleFrancevilleGabon
| | | | - Nil Rahola
- Institut de Recherche pour le Développement, MIVEGEC, Univ. Montpellier, CNRS, IRDMontpellier,France
| | - Frederic Tripet
- Centre for Applied Entomology and Parasitology, Keele UniversityNewcastleUnited Kingdom
| | - Richard Durbin
- Department of Genetics, University of CambridgeCambridgeUnited Kingdom
- Wellcome Sanger InstituteHinxtonUnited Kingdom
| | | |
Collapse
|
13
|
Kshirsagar M, Yuan H, Ferres JL, Leslie C. BindVAE: Dirichlet variational autoencoders for de novo motif discovery from accessible chromatin. Genome Biol 2022; 23:174. [PMID: 35971180 PMCID: PMC9380350 DOI: 10.1186/s13059-022-02723-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Accepted: 06/28/2022] [Indexed: 11/10/2022] Open
Abstract
We present a novel unsupervised deep learning approach called BindVAE, based on Dirichlet variational autoencoders, for jointly decoding multiple TF binding signals from open chromatin regions. BindVAE can disentangle an input DNA sequence into distinct latent factors that encode cell-type specific in vivo binding signals for individual TFs, composite patterns for TFs involved in cooperative binding, and genomic context surrounding the binding sites. On the task of retrieving the motifs of expressed TFs in a given cell type, BindVAE is competitive with existing motif discovery approaches.
Collapse
Affiliation(s)
| | - Han Yuan
- Calico Life Sciences, South San Francisco, CA, USA
| | | | | |
Collapse
|
14
|
Becher H, Sampson J, Twyford AD. Measuring the Invisible: The Sequences Causal of Genome Size Differences in Eyebrights ( Euphrasia) Revealed by k-mers. Front Plant Sci 2022; 13:818410. [PMID: 35968114 PMCID: PMC9372453 DOI: 10.3389/fpls.2022.818410] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 06/20/2022] [Indexed: 06/15/2023]
Abstract
Genome size variation within plant taxa is due to presence/absence variation, which may affect low-copy sequences or genomic repeats of various frequency classes. However, identifying the sequences underpinning genome size variation is challenging because genome assemblies commonly contain collapsed representations of repetitive sequences and because genome skimming studies by design miss low-copy number sequences. Here, we take a novel approach based on k-mers, short sub-sequences of equal length k, generated from whole-genome sequencing data of diploid eyebrights (Euphrasia), a group of plants that have considerable genome size variation within a ploidy level. We compare k-mer inventories within and between closely related species, and quantify the contribution of different copy number classes to genome size differences. We further match high-copy number k-mers to specific repeat types as retrieved from the RepeatExplorer2 pipeline. We find genome size differences of up to 230Mbp, equivalent to more than 20% genome size variation. The largest contributions to these differences come from rDNA sequences, a 145-nt genomic satellite and a repeat associated with an Angela transposable element. We also find size differences in the low-copy number class (copy number ≤ 10×) of up to 27 Mbp, possibly indicating differences in gene space between our samples. We demonstrate that it is possible to pinpoint the sequences causing genome size variation within species without the use of a reference genome. Such sequences can serve as targets for future cytogenetic studies. We also show that studies of genome size variation should go beyond repeats if they aim to characterise the full range of genomic variants. To allow future work with other taxonomic groups, we share our k-mer analysis pipeline, which is straightforward to run, relying largely on standard GNU command line tools.
Collapse
Affiliation(s)
- Hannes Becher
- School of Biological Sciences, Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, United Kingdom
| | - Jacob Sampson
- School of Biological Sciences, Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, United Kingdom
| | - Alex D. Twyford
- School of Biological Sciences, Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, United Kingdom
- Royal Botanic Garden Edinburgh, Edinburgh, United Kingdom
| |
Collapse
|
15
|
Lo R, Dougan KE, Chen Y, Shah S, Bhattacharya D, Chan CX. Alignment-Free Analysis of Whole-Genome Sequences From Symbiodiniaceae Reveals Different Phylogenetic Signals in Distinct Regions. Front Plant Sci 2022; 13:815714. [PMID: 35557718 PMCID: PMC9087856 DOI: 10.3389/fpls.2022.815714] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 04/04/2022] [Indexed: 05/24/2023]
Abstract
Dinoflagellates of the family Symbiodiniaceae are predominantly essential symbionts of corals and other marine organisms. Recent research reveals extensive genome sequence divergence among Symbiodiniaceae taxa and high phylogenetic diversity hidden behind subtly different cell morphologies. Using an alignment-free phylogenetic approach based on sub-sequences of fixed length k (i.e. k-mers), we assessed the phylogenetic signal among whole-genome sequences from 16 Symbiodiniaceae taxa (including the genera of Symbiodinium, Breviolum, Cladocopium, Durusdinium and Fugacium) and two strains of Polarella glacialis as outgroup. Based on phylogenetic trees inferred from k-mers in distinct genomic regions (i.e. repeat-masked genome sequences, protein-coding sequences, introns and repeats) and in protein sequences, the phylogenetic signal associated with protein-coding DNA and the encoded amino acids is largely consistent with the Symbiodiniaceae phylogeny based on established markers, such as large subunit rRNA. The other genome sequences (introns and repeats) exhibit distinct phylogenetic signals, supporting the expected differential evolutionary pressure acting on these regions. Our analysis of conserved core k-mers revealed the prevalence of conserved k-mers (>95% core 23-mers among all 18 genomes) in annotated repeats and non-genic regions of the genomes. We observed 180 distinct repeat types that are significantly enriched in genomes of the symbiotic versus free-living Symbiodinium taxa, suggesting an enhanced activity of transposable elements linked to the symbiotic lifestyle. We provide evidence that representation of alignment-free phylogenies as dynamic networks enhances the ability to generate new hypotheses about genome evolution in Symbiodiniaceae. These results demonstrate the potential of alignment-free phylogenetic methods as a scalable approach for inferring comprehensive, unbiased whole-genome phylogenies of dinoflagellates and more broadly of microbial eukaryotes.
Collapse
Affiliation(s)
- Rosalyn Lo
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Katherine E. Dougan
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Yibi Chen
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Sarah Shah
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| | - Debashish Bhattacharya
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, United States
| | - Cheong Xin Chan
- Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
| |
Collapse
|
16
|
Shibuya Y, Belazzougui D, Kucherov G. Space-efficient representation of genomic k-mer count tables. Algorithms Mol Biol 2022; 17:5. [PMID: 35317833 PMCID: PMC8939220 DOI: 10.1186/s13015-022-00212-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 03/01/2022] [Indexed: 11/10/2022] Open
Abstract
Motivation k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Many of these tools produce in output k-mer count tables containing both k-mers and counts, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general. Results In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E. Coli and C. Elegans) and unassembled reads, as well as on k-mer document frequency tables for 29 E. Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k’s.
Collapse
|
17
|
Ali S, Bello B, Chourasia P, Punathil RT, Zhou Y, Patterson M. PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences. Biology (Basel) 2022; 11:418. [PMID: 35336792 DOI: 10.3390/biology11030418] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 02/24/2022] [Accepted: 03/07/2022] [Indexed: 01/14/2023]
Abstract
Simple Summary The family of coronaviruses comprises a diverse set of strains and variants which cause diseases from the common cold to COVID-19. Moreover, they infect a wide array of hosts from bats, camels, birds, to humans. Studying coronaviruses through the lens of host specificity provides a unique perspective to understanding the evolution, diversity and dynamics of this family. In particular, this can reveal groups of different hosts infected by similar strains, giving clues on strains which were more likely to have evolved to jump from one host to another. In this work, we frame host specificity as a classification task, in designing a very compact numerical representation of the spike sequences of different coronaviruses. Based on this numerical representation, classification methods are able to detect the target host with high accuracy. Such an approach can used to efficiently scale to large volumes of sequences, in order to unveil trends in the host specificity of different coronavirus strains. Abstract The study of host specificity has important connections to the question about the origin of SARS-CoV-2 in humans which led to the COVID-19 pandemic—an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona)viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating, and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is important in determining host specificity, since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among birds, bats, camels, swine, humans, and weasels, to name a few. We propose a feature embedding based on the well-known position weight matrix (PWM), which we call PWM2Vec, and we use it to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications, such as determining protein function and identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs from viral sequences to generate fixed-length feature vector representations, and use them in the context of host classification. The results on real world data show that when using PWM2Vec, machine learning classifiers are able to perform comparably to the baseline models in terms of predictive performance and runtime—in some cases, the performance is better. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus. Finally, we perform some statistical analyses on these results to show that our embedding is more compact than the embeddings of the baseline models.
Collapse
|
18
|
Blanca A, Harris RS, Koslicki D, Medvedev P. The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches. J Comput Biol 2022; 29:155-168. [PMID: 35108101 DOI: 10.1089/cmb.2021.0431] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.
Collapse
Affiliation(s)
- Antonio Blanca
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, USA
| | - Robert S Harris
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA
| | - David Koslicki
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, USA.,Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA.,Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania, USA.,Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA.,Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, USA
| |
Collapse
|
19
|
Bernardini G, Denti L, Previtali M. Alignment-Free Genotyping of Known Variations with MALVA. Methods Mol Biol 2022; 2493:247-256. [PMID: 35751819 DOI: 10.1007/978-1-0716-2293-3_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The discovery and characterization of sequence variations in human populations are crucial in genetic studies. Standard methods for addressing this problem are computationally expensive and highly time consuming, thus impractical for clinical applications, where time is often an issue. When the task is to genotype variations that have been previously annotated, alignment-free methods come to the aid. Here, we describe MALVA, an alignment-free approach for genotyping a set of known variations. MALVA is the first mapping-free tool which is able to genotype multi-allelic SNPs and indels, even in high-density genomic regions, and to effectively handle a huge number of variations.
Collapse
Affiliation(s)
| | - Luca Denti
- Department of Computational Biology, C3BI USR 3756 CNRS, Institut Pasteur, Paris, France.
| | - Marco Previtali
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan, Italy
| |
Collapse
|
20
|
Gangurde SS, Xavier A, Naik YD, Jha UC, Rangari SK, Kumar R, Reddy MSS, Channale S, Elango D, Mir RR, Zwart R, Laxuman C, Sudini HK, Pandey MK, Punnuri S, Mendu V, Reddy UK, Guo B, Gangarao NVPR, Sharma VK, Wang X, Zhao C, Thudi M. Two decades of association mapping: Insights on disease resistance in major crops. Front Plant Sci 2022; 13:1064059. [PMID: 37082513 PMCID: PMC10112529 DOI: 10.3389/fpls.2022.1064059] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/07/2022] [Accepted: 11/10/2022] [Indexed: 05/03/2023]
Abstract
Climate change across the globe has an impact on the occurrence, prevalence, and severity of plant diseases. About 30% of yield losses in major crops are due to plant diseases; emerging diseases are likely to worsen the sustainable production in the coming years. Plant diseases have led to increased hunger and mass migration of human populations in the past, thus a serious threat to global food security. Equipping the modern varieties/hybrids with enhanced genetic resistance is the most economic, sustainable and environmentally friendly solution. Plant geneticists have done tremendous work in identifying stable resistance in primary genepools and many times other than primary genepools to breed resistant varieties in different major crops. Over the last two decades, the availability of crop and pathogen genomes due to advances in next generation sequencing technologies improved our understanding of trait genetics using different approaches. Genome-wide association studies have been effectively used to identify candidate genes and map loci associated with different diseases in crop plants. In this review, we highlight successful examples for the discovery of resistance genes to many important diseases. In addition, major developments in association studies, statistical models and bioinformatic tools that improve the power, resolution and the efficiency of identifying marker-trait associations. Overall this review provides comprehensive insights into the two decades of advances in GWAS studies and discusses the challenges and opportunities this research area provides for breeding resistant varieties.
Collapse
Affiliation(s)
- Sunil S. Gangurde
- Crop Genetics and Breeding Research, United States Department of Agriculture (USDA) - Agriculture Research Service (ARS), Tifton, GA, United States
- Department of Plant Pathology, University of Georgia, Tifton, GA, United States
| | - Alencar Xavier
- Department of Agronomy, Purdue University, West Lafayette, IN, United States
| | | | - Uday Chand Jha
- Indian Council of Agricultural Research (ICAR), Indian Institute of Pulses Research (IIPR), Kanpur, Uttar Pradesh, India
| | | | - Raj Kumar
- Dr. Rajendra Prasad Central Agricultural University (RPCAU), Bihar, India
| | - M. S. Sai Reddy
- Dr. Rajendra Prasad Central Agricultural University (RPCAU), Bihar, India
| | - Sonal Channale
- Crop Health Center, University of Southern Queensland (USQ), Toowoomba, QLD, Australia
| | - Dinakaran Elango
- Department of Agronomy, Iowa State University, Ames, IA, United States
| | - Reyazul Rouf Mir
- Faculty of Agriculture, Sher-e-Kashmir University of Agricultural Sciences and Technology (SKUAST), Sopore, India
| | - Rebecca Zwart
- Crop Health Center, University of Southern Queensland (USQ), Toowoomba, QLD, Australia
| | - C. Laxuman
- Zonal Agricultural Research Station (ZARS), Kalaburagi, University of Agricultural Sciences, Raichur, Karnataka, India
| | - Hari Kishan Sudini
- International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, Telangana, India
| | - Manish K. Pandey
- Crop Health Center, University of Southern Queensland (USQ), Toowoomba, QLD, Australia
- International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, Telangana, India
| | - Somashekhar Punnuri
- College of Agriculture, Family Sciences and Technology, Dr. Fort Valley State University, Fort Valley, GA, United States
| | - Venugopal Mendu
- Department of Plant Science and Plant Pathology, Montana State University, Bozeman, MT, United States
| | - Umesh K. Reddy
- Department of Biology, West Virginia State University, West Virginia, WV, United States
| | - Baozhu Guo
- Crop Genetics and Breeding Research, United States Department of Agriculture (USDA) - Agriculture Research Service (ARS), Tifton, GA, United States
| | | | - Vinay K. Sharma
- Dr. Rajendra Prasad Central Agricultural University (RPCAU), Bihar, India
| | - Xingjun Wang
- Institute of Crop Germplasm Resources, Shandong Academy of Agricultural Sciences (SAAS), Jinan, China
| | - Chuanzhi Zhao
- Institute of Crop Germplasm Resources, Shandong Academy of Agricultural Sciences (SAAS), Jinan, China
- *Correspondence: Mahendar Thudi, ; Chuanzhi Zhao,
| | - Mahendar Thudi
- Dr. Rajendra Prasad Central Agricultural University (RPCAU), Bihar, India
- Crop Health Center, University of Southern Queensland (USQ), Toowoomba, QLD, Australia
- Institute of Crop Germplasm Resources, Shandong Academy of Agricultural Sciences (SAAS), Jinan, China
- *Correspondence: Mahendar Thudi, ; Chuanzhi Zhao,
| |
Collapse
|
21
|
Ju CJT, Jiang JY, Li R, Li Z, Wang W. TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash. Med Rev (2021) 2021; 1:114-125. [PMID: 35881666 PMCID: PMC9027990 DOI: 10.1515/mr-2021-0016] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/05/2021] [Accepted: 11/11/2021] [Indexed: 12/04/2022]
Abstract
Objectives Genomic signatures like k-mers have become one of the most prominent approaches to describe genomic data. As a result, myriad real-world applications, such as the construction of de Bruijn graphs in genome assembly, have been benefited by recognizing genomic signatures. In other words, an efficient approach of genomic signature profiling is an essential need for tackling high-throughput sequencing reads. However, most of the existing approaches only recognize fixed-size k-mers while many research studies have shown the importance of considering variable-length k-mers. Methods In this paper, we present a novel genomic signature profiling approach, TahcoRoll, by extending the Aho-Corasick algorithm (AC) for the task of profiling variable-length k-mers. We first group nucleotides into two clusters and represent each cluster with a bit. The rolling hash technique is further utilized to encode signatures and read patterns for efficient matching. Results In extensive experiments, TahcoRoll significantly outperforms the most state-of-the-art k-mer counters and has the capability of processing reads across different sequencing platforms on a budget desktop computer. Conclusions The single-thread version of TahcoRoll is as efficient as the eight-thread version of the state-of-the-art, JellyFish, while the eight-thread TahcoRoll outperforms the eight-thread JellyFish by at least four times.
Collapse
Affiliation(s)
- Chelsea J.-T. Ju
- Department of Computer Science, University of California, Los Angeles, USA
| | - Jyun-Yu Jiang
- Department of Computer Science, University of California, Los Angeles, USA
| | - Ruirui Li
- Department of Computer Science, University of California, Los Angeles, USA
| | - Zeyu Li
- Department of Computer Science, University of California, Los Angeles, USA
| | - Wei Wang
- Department of Computer Science, University of California, Los Angeles, USA
| |
Collapse
|
22
|
Gupta PK. GWAS for genetics of complex quantitative traits: Genome to pangenome and SNPs to SVs and k-mers. Bioessays 2021; 43:e2100109. [PMID: 34486143 DOI: 10.1002/bies.202100109] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 08/21/2021] [Accepted: 08/23/2021] [Indexed: 12/22/2022]
Abstract
The development of improved methods for genome-wide association studies (GWAS) for genetics of quantitative traits has been an active area of research during the last 25 years. This activity initially started with the use of mixed linear model (MLM), which was variously modified. During the last decade, however, with the availability of high throughput next generation sequencing (NGS) technology, development and use of pangenomes and novel markers including structural variations (SVs) and k-mers for GWAS has taken over as a new thrust area of research. Pangenomes and SVs are now available in humans, livestock, and a number of plant species, so that these resources along with k-mers are being used in GWAS for exploring additional genetic variation that was hitherto not available for analysis. These developments have resulted in significant improvement in GWAS methodology for detection of marker-trait associations (MTAs) that are relevant to human healthcare and crop improvement.
Collapse
Affiliation(s)
- Pushpendra K Gupta
- Department of Genetics and Plant Breeding, Ch. Charan Singh University Meerut, Meerut, Uttar Pradesh, India
| |
Collapse
|
23
|
Abstract
For identification of marker-trait associations (MTAs) for complex traits in animals and plants, thousands of genome-wide association studies (GWAS) were conducted during the past two decades. This involved regular improvement in methodology. Initially, a reference genome and SNPs were used; more recently pan-genomes and the markers structural variations (SVs)/k-mers are also being used.
Collapse
Affiliation(s)
- Pushpendra K Gupta
- Molecular Biology Laboratory, Department of Genetics and Plant Breeding, CCS University Meerut, Meerut, India.
| |
Collapse
|
24
|
Tay AP, Hosking B, Hosking C, Bauer DC, Wilson LO. INSIDER: alignment-free detection of foreign DNA sequences. Comput Struct Biotechnol J 2021; 19:3810-3816. [PMID: 34285780 PMCID: PMC8273350 DOI: 10.1016/j.csbj.2021.06.045] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 06/28/2021] [Accepted: 06/28/2021] [Indexed: 11/21/2022] Open
Abstract
External DNA sequences can be inserted into an organism's genome either through natural processes such as gene transfer, or through targeted genome engineering strategies. Being able to robustly identify such foreign DNA is a crucial capability for health and biosecurity applications, such as anti-microbial resistance (AMR) detection or monitoring gene drives. This capability does not exist for poorly characterised host genomes or with limited information about the integrated sequence. To address this, we developed the INserted Sequence Information DEtectoR (INSIDER). INSIDER analyses whole genome sequencing data and identifies segments of potentially foreign origin by their significant shift in k-mer signatures. We demonstrate the power of INSIDER to separate integrated DNA sequences from normal genomic sequences on a synthetic dataset simulating the insertion of a CRISPR-Cas gene drive into wild-type yeast. As a proof-of-concept, we use INSIDER to detect the exact AMR plasmid in whole genome sequencing data from a Citrobacter freundii patient isolate. INSIDER streamlines the process of identifying integrated DNA in poorly characterised wild species or when the insert is of unknown origin, thus enhancing the monitoring of emerging biosecurity threats.
Collapse
Affiliation(s)
- Aidan P. Tay
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, New South Wales, Sydney, Australia
| | - Brendan Hosking
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
| | - Cameron Hosking
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
| | - Denis C. Bauer
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
- Department of Biomedical Sciences, Macquarie University, New South Wales, Sydney, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, New South Wales, Sydney, Australia
| | - Laurence O.W. Wilson
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, New South Wales, Sydney, Australia
| |
Collapse
|
25
|
Wang Y, Xue H, Pourcel C, Du Y, Gautheret D. 2-kupl: mapping-free variant detection from DNA-seq data of matched samples. BMC Bioinformatics 2021; 22:304. [PMID: 34090332 PMCID: PMC8180056 DOI: 10.1186/s12859-021-04185-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 05/11/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The detection of genome variants, including point mutations, indels and structural variants, is a fundamental and challenging computational problem. We address here the problem of variant detection between two deep-sequencing (DNA-seq) samples, such as two human samples from an individual patient, or two samples from distinct bacterial strains. The preferred strategy in such a case is to align each sample to a common reference genome, collect all variants and compare these variants between samples. Such mapping-based protocols have several limitations. DNA sequences with large indels, aggregated mutations and structural variants are hard to map to the reference. Furthermore, DNA sequences cannot be mapped reliably to genomic low complexity regions and repeats. RESULTS We introduce 2-kupl, a k-mer based, mapping-free protocol to detect variants between two DNA-seq samples. On simulated and actual data, 2-kupl achieves higher accuracy than other mapping-free protocols. Applying 2-kupl to prostate cancer whole exome sequencing data, we identify a number of candidate variants in hard-to-map regions and propose potential novel recurrent variants in this disease. CONCLUSIONS We developed a mapping-free protocol for variant calling between matched DNA-seq samples. Our protocol is suitable for variant detection in unmappable genome regions or in the absence of a reference genome.
Collapse
Affiliation(s)
- Yunfeng Wang
- Institute of Integrative Cell Biology (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190 Gif-sur-Yvette, France
- Annoroad Gene Technology Co., Ltd, Beijing, 100176 China
| | - Haoliang Xue
- Institute of Integrative Cell Biology (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190 Gif-sur-Yvette, France
| | - Christine Pourcel
- Institute of Integrative Cell Biology (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190 Gif-sur-Yvette, France
| | - Yang Du
- Annoroad Gene Technology Co., Ltd, Beijing, 100176 China
| | - Daniel Gautheret
- Institute of Integrative Cell Biology (I2BC), Université Paris-Saclay, CNRS, CEA, 1 avenue de la Terrasse, 91190 Gif-sur-Yvette, France
- IHU PRISM, Gustave Roussy, 114 rue Edouard Vaillant, 94800 Villejuif, France
| |
Collapse
|
26
|
Pechlivanis N, Togkousidis A, Tsagiopoulou M, Sgardelis S, Kappas I, Psomopoulos F. A Computational Framework for Pattern Detection on Unaligned Sequences: An Application on SARS-CoV-2 Data. Front Genet 2021; 12:618170. [PMID: 34122498 PMCID: PMC8194296 DOI: 10.3389/fgene.2021.618170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2020] [Accepted: 05/04/2021] [Indexed: 11/13/2022] Open
Abstract
The exponential growth of genome sequences available has spurred research on pattern detection with the aim of extracting evolutionary signal. Traditional approaches, such as multiple sequence alignment, rely on positional homology in order to reconstruct the phylogenetic history of taxa. Yet, mining information from the plethora of biological data and delineating species on a genetic basis, still proves to be an extremely difficult problem to consider. Multiple algorithms and techniques have been developed in order to approach the problem multidimensionally. Here, we propose a computational framework for identifying potentially meaningful features based on k-mers retrieved from unaligned sequence data. Specifically, we have developed a process which makes use of unsupervised learning techniques in order to identify characteristic k-mers of the input dataset across a range of different k-values and within a reasonable time frame. We use these k-mers as features for clustering the input sequences and identifying differences between the distributions of k-mers across the dataset. The developed algorithm is part of an innovative and much promising approach both to the problem of grouping sequence data based on their inherent characteristic features, as well as for the study of changes in the distributions of k-mers, as the k-value is fluctuating within a range of values. Our framework is fully developed in Python language as an open source software licensed under the MIT License, and is freely available at https://github.com/BiodataAnalysisGroup/kmerAnalyzer.
Collapse
Affiliation(s)
- Nikolaos Pechlivanis
- Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece
- Department of Genetics, Development and Molecular Biology, School of Biology, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Anastasios Togkousidis
- Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece
| | - Maria Tsagiopoulou
- Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece
| | - Stefanos Sgardelis
- Department of Ecology, School of Biology, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Ilias Kappas
- Department of Genetics, Development and Molecular Biology, School of Biology, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Fotis Psomopoulos
- Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece
| |
Collapse
|
27
|
Abstract
de Bruijn graphs play an essential role in bioinformatics, yet they lack a universal scalable representation. Here, we introduce simplitigs as a compact, efficient, and scalable representation, and ProphAsm, a fast algorithm for their computation. For the example of assemblies of model organisms and two bacterial pan-genomes, we compare simplitigs to unitigs, the best existing representation, and demonstrate that simplitigs provide a substantial improvement in the cumulative sequence length and their number. When combined with the commonly used Burrows-Wheeler Transform index, simplitigs reduce memory, and index loading and query times, as demonstrated with large-scale examples of GenBank bacterial pan-genomes.
Collapse
Affiliation(s)
- Karel Břinda
- Department of Biomedical Informatics and Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA and Broad Institute of MIT and Harvard, Cambridge, USA.
- Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, USA.
| | - Michael Baym
- Department of Biomedical Informatics and Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA and Broad Institute of MIT and Harvard, Cambridge, USA
| | - Gregory Kucherov
- CNRS/LIGM Univ Gustave Eiffel, Marne-la-Vallée, France
- Skolkovo Institute of Science and Technology, Moscow, Russia
| |
Collapse
|
28
|
Kaplinski L, Möls M, Puurand T, Pajuste FD, Remm M. KATK: Fast genotyping of rare variants directly from unmapped sequencing reads. Hum Mutat 2021; 42:777-786. [PMID: 33715282 DOI: 10.1002/humu.24197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2021] [Revised: 03/04/2021] [Accepted: 03/05/2021] [Indexed: 11/06/2022]
Abstract
KATK is a fast and accurate software tool for calling variants directly from raw next-generation sequencing reads. It uses predefined k-mers to retrieve only the reads of interest from the FASTQ file and calls genotypes by aligning retrieved reads locally. KATK does not use data about known polymorphisms and has NC (no call) as the default genotype. The reference or variant allele is called only if there is sufficient evidence for their presence in data. Thus it is not biased against rare variants or de-novo mutations. With simulated datasets, we achieved a false-negative rate of 0.23% (sensitivity 99.77%) and a false discovery rate of 0.19%. Calling all human exonic regions with KATK requires 1-2 h, depending on sequencing coverage.
Collapse
Affiliation(s)
- Lauris Kaplinski
- Department of Bioinformatics, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Märt Möls
- Department of Bioinformatics, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Tarmo Puurand
- Department of Bioinformatics, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Fanny-Dhelia Pajuste
- Department of Bioinformatics, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Maido Remm
- Department of Bioinformatics, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| |
Collapse
|
29
|
Shokrof M, Brown CT, Mansour TA. MQF and buffered MQF: quotient filters for efficient storage of k-mers with their counts and metadata. BMC Bioinformatics 2021; 22:71. [PMID: 33593271 PMCID: PMC7885209 DOI: 10.1186/s12859-021-03996-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Accepted: 02/04/2021] [Indexed: 11/30/2022] Open
Abstract
Background Specialized data structures are required for online algorithms to efficiently handle large sequencing datasets. The counting quotient filter (CQF), a compact hashtable, can efficiently store k-mers with a skewed distribution.
Result Here, we present the mixed-counters quotient filter (MQF) as a new variant of the CQF with novel counting and labeling systems. The new counting system adapts to a wider range of data distributions for increased space efficiency and is faster than the CQF for insertions and queries in most of the tested scenarios. A buffered version of the MQF can offload storage to disk, trading speed of insertions and queries for a significant memory reduction. The labeling system provides a flexible framework for assigning labels to member items while maintaining good data locality and a concise memory representation. These labels serve as a minimal perfect hash function but are ~ tenfold faster than BBhash, with no need to re-analyze the original data for further insertions or deletions. Conclusions The MQF is a flexible and efficient data structure that extends our ability to work with high throughput sequencing data.
Collapse
Affiliation(s)
- Moustafa Shokrof
- Department of Computer Science, University of California, Davis, CA, USA
| | - C Titus Brown
- Department of Population Health and Reproduction, School of Veterinary Medicine, University of California, Davis, CA, USA
| | - Tamer A Mansour
- Department of Population Health and Reproduction, School of Veterinary Medicine, University of California, Davis, CA, USA. .,Department of Clinical Pathology, School of Medicine, University of Mansoura, Mansoura, Egypt.
| |
Collapse
|
30
|
Abstract
Minimizers are widely used to select subsets of fixed-length substrings (k-mers) from biological sequences in applications ranging from read mapping to taxonomy prediction and indexing of large datasets. The minimizer of a string of w consecutive k-mers is the k-mer with smallest value according to an ordering of all k-mers. Syncmers are defined here as a family of alternative methods which select k-mers by inspecting the position of the smallest-valued substring of length s < k within the k-mer. For example, a closed syncmer is selected if its smallest s-mer is at the start or end of the k-mer. At least one closed syncmer must be found in every window of length (k - s) k-mers. Unlike a minimizer, a syncmer is identified by its sequence alone, and is therefore synchronized in the following sense: if a given k-mer is selected from one sequence, it will also be selected from any other sequence. Also, minimizers can be deleted by mutations in flanking sequence, which cannot happen with syncmers. Experiments on minimizers with parameters used in the minimap2 read mapper and Kraken taxonomy prediction algorithm respectively show that syncmers can simultaneously achieve both lower density and higher conservation compared to minimizers.
Collapse
|
31
|
Bernard G, Stephens TG, González-Pech RA, Chan CX. Inferring Phylogenomic Relationship of Microbes Using Scalable Alignment-Free Methods. Methods Mol Biol 2021; 2242:69-76. [PMID: 33961218 DOI: 10.1007/978-1-0716-1099-2_5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
Inferring phylogenetic relationships among hundreds or thousands of microbial genomes is an increasingly common task. The conventional phylogenetic approach adopts multiple sequence alignment to compare gene-by-gene, concatenated multigene or whole-genome sequences, from which a phylogenetic tree would be inferred. These alignments follow the implicit assumption of full-length contiguity among homologous sequences. However, common events in microbial genome evolution (e.g., structural rearrangements and genetic recombination) violate this assumption. Moreover, aligning hundreds or thousands of sequences is computationally intensive and not scalable to the rate at which genome data are generated. Therefore, alignment-free methods present an attractive alternative strategy. Here we describe a scalable alignment-free strategy to infer phylogenetic relationships using complete genome sequences of bacteria and archaea, based on short, subsequences of length k (k-mers). We describe how this strategy can be extended to infer evolutionary relationships beyond a tree-like structure, to better capture both vertical and lateral signals of microbial evolution.
Collapse
|
32
|
Song K, Wright FA, Zhou YH. Systematic Comparisons for Composition Profiles, Taxonomic Levels, and Machine Learning Methods for Microbiome-Based Disease Prediction. Front Mol Biosci 2020; 7:610845. [PMID: 33392266 PMCID: PMC7772236 DOI: 10.3389/fmolb.2020.610845] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2020] [Accepted: 11/25/2020] [Indexed: 12/12/2022] Open
Abstract
Microbiome composition profiles generated from 16S rRNA sequencing have been extensively studied for their usefulness in phenotype trait prediction, including for complex diseases such as diabetes and obesity. These microbiome compositions have typically been quantified in the form of Operational Taxonomic Unit (OTU) count matrices. However, alternate approaches such as Amplicon Sequence Variants (ASV) have been used, as well as the direct use of k-mer sequence counts. The overall effect of these different types of predictors when used in concert with various machine learning methods has been difficult to assess, due to varied combinations described in the literature. Here we provide an in-depth investigation of more than 1,000 combinations of these three clustering/counting methods, in combination with varied choices for normalization and filtering, grouping at various taxonomic levels, and the use of more than ten commonly used machine learning methods for phenotype prediction. The use of short k-mers, which have computational advantages and conceptual simplicity, is shown to be effective as a source for microbiome-based prediction. Among machine-learning approaches, tree-based methods show consistent, though modest, advantages in prediction accuracy. We describe the various advantages and disadvantages of combinations in analysis approaches, and provide general observations to serve as a useful guide for future trait-prediction explorations using microbiome data.
Collapse
Affiliation(s)
- Kuncheng Song
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United States
| | - Fred A Wright
- Departments of Statistics and Biological Sciences, North Carolina State University, Raleigh, NC, United States
| | - Yi-Hui Zhou
- Department of Biological Sciences, North Carolina State University, Raleigh, NC, United States
| |
Collapse
|
33
|
Sen R, Fallmann J, Walter MEMT, Stadler PF. Are spliced ncRNA host genes distinct classes of lncRNAs? Theory Biosci 2020; 139:349-59. [PMID: 33219910 DOI: 10.1007/s12064-020-00330-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2020] [Accepted: 11/10/2020] [Indexed: 12/03/2022]
Abstract
Many small nucleolar RNAs and many of the hairpin precursors of miRNAs are processed from long non-protein-coding host genes. In contrast to their highly conserved and heavily structured payload, the host genes feature poorly conserved sequences. Nevertheless, there is mounting evidence that the host genes have biological functions beyond their primary task of carrying a ncRNA as payload. So far, no connections between the function of the host genes and the function of their payloads have been reported. Here we investigate whether there is evidence for an association of host gene function or mechanisms with the type of payload. To assess this hypothesis we test whether the miRNA host genes (MIRHGs), snoRNA host genes (SNHGs), and other lncRNA host genes can be distinguished based on sequence and/or structure features unrelated to their payload. A positive answer would imply a functional and mechanistic correlation between host genes and their payload, provided the classification does not depend on the presence and type of the payload. A negative answer would indicate that to the extent that secondary functions are acquired, they are not strongly constrained by the prior, primary function of the payload. We find that the three classes can be distinguished reliably when the classifier is allowed to extract features from the payloads. They become virtually indistinguishable, however, as soon as only sequence and structure of parts of the host gene distal from the snoRNAs or miRNA payload is used for classification. This indicates that the functions of MIRHGs and SNHGs are largely independent of the functions of their payloads. Furthermore, there is no evidence that the MIRHGs and SNHGs form coherent classes of long non-coding RNAs distinguished by features other than their payloads.
Collapse
|
34
|
Abstract
Alignment-free classification of sequences has enabled high-throughput processing of sequencing data in many bioinformatics pipelines. Much work has been done to speed up the indexing of k-mers through hash-table and other data structures. These efforts have led to very fast indexes, but because they are k-mer based, they often lack sensitivity due to sequencing errors or polymorphisms. Spaced seeds are a special type of pattern that accounts for errors or mutations. They allow to improve the sensitivity and they are now routinely used instead of k-mers in many applications. The major drawback of spaced seeds is that they cannot be efficiently hashed and thus their usage increases substantially the computational time. In this article we address the problem of efficient spaced seed hashing. We propose an iterative algorithm that combines multiple spaced seed hashes by exploiting the similarity of adjacent hash values to efficiently compute the next hash. We report a series of experiments on HTS reads hashing, with several spaced seeds. Our algorithm can compute the hashing values of spaced seeds with a speedup in range of [3.5 × -7 × ], outperforming previous methods. Software and data sets are available at Iterative Spaced Seed Hashing.
Collapse
Affiliation(s)
- Enrico Petrucci
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Laurent Noé
- CRIStAL UMR9189, Universit de Lille, Lille, France
| | - Cinzia Pizzi
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| |
Collapse
|
35
|
Panyukov VV, Kiselev SS, Ozoline ON. Unique k-mers as Strain-Specific Barcodes for Phylogenetic Analysis and Natural Microbiome Profiling. Int J Mol Sci 2020; 21:ijms21030944. [PMID: 32023871 PMCID: PMC7037511 DOI: 10.3390/ijms21030944] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2019] [Revised: 01/21/2020] [Accepted: 01/28/2020] [Indexed: 02/07/2023] Open
Abstract
The need for a comparative analysis of natural metagenomes stimulated the development of new methods for their taxonomic profiling. Alignment-free approaches based on the search for marker k-mers turned out to be capable of identifying not only species, but also strains of microorganisms with known genomes. Here, we evaluated the ability of genus-specific k-mers to distinguish eight phylogroups of Escherichia coli (A, B1, C, E, D, F, G, B2) and assessed the presence of their unique 22-mers in clinical samples from microbiomes of four healthy people and four patients with Crohn's disease. We found that a phylogenetic tree inferred from the pairwise distance matrix for unique 18-mers and 22-mers of 124 genomes was fully consistent with the topology of the tree, obtained with concatenated aligned sequences of orthologous genes. Therefore, we propose strain-specific "barcodes" for rapid phylotyping. Using unique 22-mers for taxonomic analysis, we detected microbes of all groups in human microbiomes; however, their presence in the five samples was significantly different. Pointing to the intraspecies heterogeneity of E. coli in the natural microflora, this also indicates the feasibility of further studies of the role of this heterogeneity in maintaining population homeostasis.
Collapse
Affiliation(s)
- Valery V. Panyukov
- Institute of Mathematical Problems of Biology RAS—the Branch of Keldysh Institute of Applied Mathematics of Russian Academy of Sciences, 142290 Pushchino, Russia;
- Structural and Functional Genomics Group, Federal Research Center “Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences”, 142290 Pushchino, Russia;
| | - Sergey S. Kiselev
- Structural and Functional Genomics Group, Federal Research Center “Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences”, 142290 Pushchino, Russia;
- Institute of Cell Biophysics of the Russian Academy of Sciences, 142290 Pushchino, Russia
| | - Olga N. Ozoline
- Structural and Functional Genomics Group, Federal Research Center “Pushchino Scientific Center for Biological Research of the Russian Academy of Sciences”, 142290 Pushchino, Russia;
- Institute of Cell Biophysics of the Russian Academy of Sciences, 142290 Pushchino, Russia
- Correspondence:
| |
Collapse
|
36
|
Smith KN, Miller SC, Varani G, Calabrese JM, Magnuson T. Multimodal Long Noncoding RNA Interaction Networks: Control Panels for Cell Fate Specification. Genetics 2019; 213:1093-1110. [PMID: 31796550 PMCID: PMC6893379 DOI: 10.1534/genetics.119.302661] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2019] [Accepted: 10/03/2019] [Indexed: 12/20/2022] Open
Abstract
Lineage specification in early development is the basis for the exquisitely precise body plan of multicellular organisms. It is therefore critical to understand cell fate decisions in early development. Moreover, for regenerative medicine, the accurate specification of cell types to replace damaged/diseased tissue is strongly dependent on identifying determinants of cell identity. Long noncoding RNAs (lncRNAs) have been shown to regulate cellular plasticity, including pluripotency establishment and maintenance, differentiation and development, yet broad phenotypic analysis and the mechanistic basis of their function remains lacking. As components of molecular condensates, lncRNAs interact with almost all classes of cellular biomolecules, including proteins, DNA, mRNAs, and microRNAs. With functions ranging from controlling alternative splicing of mRNAs, to providing scaffolding upon which chromatin modifiers are assembled, it is clear that at least a subset of lncRNAs are far from the transcriptional noise they were once deemed. This review highlights the diversity of lncRNA interactions in the context of cell fate specification, and provides examples of each type of interaction in relevant developmental contexts. Also highlighted are experimental and computational approaches to study lncRNAs.
Collapse
Affiliation(s)
- Keriayn N Smith
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Sarah C Miller
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Gabriele Varani
- Department of Chemistry, University of Washington, Seattle, Washington 98195
| | - J Mauro Calabrese
- Department of Pharmacology, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Terry Magnuson
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina 27599
| |
Collapse
|
37
|
Zhan ZH, Jia LN, Zhou Y, Li LP, Yi HC. BGFE: A Deep Learning Model for ncRNA-Protein Interaction Predictions Based on Improved Sequence Information. Int J Mol Sci 2019; 20:E978. [PMID: 30813451 PMCID: PMC6412311 DOI: 10.3390/ijms20040978] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2019] [Revised: 02/19/2019] [Accepted: 02/20/2019] [Indexed: 11/26/2022] Open
Abstract
The interactions between ncRNAs and proteins are critical for regulating various cellular processes in organisms, such as gene expression regulations. However, due to limitations, including financial and material consumptions in recent experimental methods for predicting ncRNA and protein interactions, it is essential to propose an innovative and practical approach with convincing performance of prediction accuracy. In this study, based on the protein sequences from a biological perspective, we put forward an effective deep learning method, named BGFE, to predict ncRNA and protein interactions. Protein sequences are represented by bi-gram probability feature extraction method from Position Specific Scoring Matrix (PSSM), and for ncRNA sequences, k-mers sparse matrices are employed to represent them. Furthermore, to extract hidden high-level feature information, a stacked auto-encoder network is employed with the stacked ensemble integration strategy. We evaluate the performance of the proposed method by using three datasets and a five-fold cross-validation after classifying the features through the random forest classifier. The experimental results clearly demonstrate the effectiveness and the prediction accuracy of our approach. In general, the proposed method is helpful for ncRNA and protein interacting predictions and it provides some serviceable guidance in future biological research.
Collapse
Affiliation(s)
- Zhao-Hui Zhan
- China University of Mining and Technology, Xuzhou 221116, China.
| | - Li-Na Jia
- College of Information Science and Engineering, Zaozhuang University, Zaozhuang 277100, Shandong, China.
| | - Yong Zhou
- China University of Mining and Technology, Xuzhou 221116, China.
| | - Li-Ping Li
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
| | - Hai-Cheng Yi
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumqi 830011, China.
| |
Collapse
|
38
|
Prodhomme C, Esselink D, Borm T, Visser RGF, van Eck HJ, Vossen JH. Comparative Subsequence Sets Analysis (CoSSA) is a robust approach to identify haplotype specific SNPs; mapping and pedigree analysis of a potato wart disease resistance gene Sen3. Plant Methods 2019; 15:60. [PMID: 31160919 PMCID: PMC6540404 DOI: 10.1186/s13007-019-0445-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/18/2019] [Accepted: 05/23/2019] [Indexed: 05/21/2023]
Abstract
BACKGROUND Standard strategies to identify genomic regions involved in a specific trait variation are often limited by time and resource consuming genotyping methods. Other limiting pre-requisites are the phenotyping of large segregating populations or of diversity panels and the availability and quality of a closely related reference genome. To overcome these limitations, we designed efficient Comparative Subsequence Sets Analysis (CoSSA) workflows to identify haplotype specific SNPs linked to a trait of interest from Whole Genome Sequencing data. RESULTS As a model, we used the resistance to Synchytrium endobioticum pathotypes 2, 6 and 18 that co-segregated in a tetraploid full sib population. Genomic DNA from both parents, pedigree genotypes, unrelated potato varieties lacking the wart resistance traits and pools of resistant and susceptible siblings were sequenced. Set algebra and depth filtering of subsequences (k-mers) were used to delete unlinked and common SNPs and to enrich for SNPs from the haplotype(s) harboring the resistance gene(s). Using CoSSA, we identified a major and a minor effect locus. Upon comparison to the reference genome, it was inferred that the major resistance locus, referred to as Sen3, was located on the north arm of chromosome 11 between 1,259,552 and 1,519,485 bp. Furthermore, we could anchor the unanchored superscaffold DMB734 from the potato reference genome to a synthenous interval. CoSSA was also successful in identifying Sen3 in a reference genome independent way thanks to the de novo assembly of paired end reads matching haplotype specific k-mers. The de novo assembly provided more R haplotype specific polymorphisms than the reference genome corresponding region. CoSSA also offers possibilities for pedigree analysis. The origin of Sen3 was traced back until Ora. Finally, the diagnostic power of the haplotype specific markers was shown using a panel of 56 tetraploid varieties. CONCLUSIONS CoSSA is an efficient, robust and versatile set of workflows for the genetic analysis of a trait of interest using WGS data. Because the WGS data are used without intermediate reads mapping, CoSSA does not require the use of a reference genome. This approach allowed the identification of Sen3 and the design of haplotype specific, diagnostic markers.
Collapse
Affiliation(s)
- Charlotte Prodhomme
- Wageningen UR Plant Breeding, Droevendaalsesteeg 1, 6708 PB Wageningen, The Netherlands
| | - Danny Esselink
- Wageningen UR Plant Breeding, Droevendaalsesteeg 1, 6708 PB Wageningen, The Netherlands
| | - Theo Borm
- Wageningen UR Plant Breeding, Droevendaalsesteeg 1, 6708 PB Wageningen, The Netherlands
| | - Richard G. F. Visser
- Wageningen UR Plant Breeding, Droevendaalsesteeg 1, 6708 PB Wageningen, The Netherlands
| | - Herman J. van Eck
- Wageningen UR Plant Breeding, Droevendaalsesteeg 1, 6708 PB Wageningen, The Netherlands
| | - Jack H. Vossen
- Wageningen UR Plant Breeding, Droevendaalsesteeg 1, 6708 PB Wageningen, The Netherlands
| |
Collapse
|
39
|
Abstract
Background Spaced-seeds, i.e. patterns in which some fixed positions are allowed to be wild-cards, play a crucial role in several bioinformatics applications involving substrings counting and indexing, by often providing better sensitivity with respect to k-mers based approaches. K-mers based approaches are usually fast, being based on efficient hashing and indexing that exploits the large overlap between consecutive k-mers. Spaced-seeds hashing is not as straightforward, and it is usually computed from scratch for each position in the input sequence. Recently, the FSH (Fast Spaced seed Hashing) approach was proposed to improve the time required for computation of the spaced seed hashing of DNA sequences with a speed-up of about 1.5 with respect to standard hashing computation. Results In this work we propose a novel algorithm, Fast Indexing for Spaced seed Hashing (FISH), based on the indexing of small blocks that can be combined to obtain the hashing of spaced-seeds of any length. The method exploits the fast computation of the hashing of runs of consecutive 1 in the spaced seeds, that basically correspond to k-mer of the length of the run. Conclusions We run several experiments, on NGS data from simulated and synthetic metagenomic experiments, to assess the time required for the computation of the hashing for each position in each read with respect to several spaced seeds. In our experiments, FISH can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.9x to 6.03x, depending on the structure of the spaced seeds. Electronic supplementary material The online version of this article (10.1186/s12859-018-2415-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Samuele Girotto
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy.
| | - Cinzia Pizzi
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy.
| |
Collapse
|
40
|
Abstract
Microbial genomes have been shaped by parent-to-offspring (vertical) descent and lateral genetic transfer. These processes can be distinguished by alignment-based inference and comparison of phylogenetic trees for individual gene families, but this approach is not scalable to whole-genome sequences, and a tree-like structure does not adequately capture how these processes impact microbial physiology. Here we adopted alignment-free approaches based on k-mer statistics to infer phylogenomic networks involving 2,783 completely sequenced bacterial and archaeal genomes and compared the contributions of rRNA, protein-coding, and plasmid sequences to these networks. Our results show that the phylogenomic signal arising from ribosomal RNAs is strong and extends broadly across all taxa, whereas that from plasmids is strong but restricted to closely related groups, particularly Proteobacteria. However, the signal from the other chromosomal regions is restricted in breadth. We show that mean k-mer similarity can correlate with taxonomic rank. We also link the implicated k-mers to genome annotation (thus, functions) and define core k-mers (thus, core functions) in specific phyletic groups. Highly conserved functions in most phyla include amino acid metabolism and transport as well as energy production and conversion. Intracellular trafficking and secretion are the most prominent core functions among Spirochaetes, whereas energy production and conversion are not highly conserved among the largely parasitic or commensal Tenericutes. These observations suggest that differential conservation of functions relates to niche specialization and evolutionary diversification of microbes. Our results demonstrate that k-mer approaches can be used to efficiently identify phylogenomic signals and conserved core functions at the multigenome scale. IMPORTANCE Genome evolution of microbes involves parent-to-offspring descent, and lateral genetic transfer that convolutes the phylogenomic signal. This study investigated phylogenomic signals among thousands of microbial genomes based on short subsequences without using multiple-sequence alignment. The signal from ribosomal RNAs is strong across all taxa, and the signal of plasmids is strong only in closely related groups, particularly Proteobacteria. However, the signal from other chromosomal regions (∼99% of the genomes) is remarkably restricted in breadth. The similarity of subsequences is found to correlate with taxonomic rank and informs on conserved and differential core functions relative to niche specialization and evolutionary diversification of microbes. These results provide a comprehensive, alignment-free view of microbial genome evolution as a network, beyond a tree-like structure.
Collapse
Affiliation(s)
- Guillaume Bernard
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Paul Greenfield
- Commonwealth Scientific and Industrial Research Organisation (CSIRO), North Ryde, NSW, Australia
| | - Mark A. Ragan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| |
Collapse
|
41
|
Mahé P, Tournoud M. Predicting bacterial resistance from whole-genome sequences using k-mers and stability selection. BMC Bioinformatics 2018; 19:383. [PMID: 30332990 PMCID: PMC6192184 DOI: 10.1186/s12859-018-2403-z] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2018] [Accepted: 10/01/2018] [Indexed: 12/29/2022] Open
Abstract
Background Several studies demonstrated the feasibility of predicting bacterial antibiotic resistance phenotypes from whole-genome sequences, the prediction process usually amounting to detecting the presence of genes involved in antibiotic resistance mechanisms, or of specific mutations, previously identified from a training panel of strains, within these genes. We address the problem from the supervised statistical learning perspective, not relying on prior information about such resistance factors. We rely on a k-mer based genotyping scheme and a logistic regression model, thereby combining several k-mers into a probabilistic model. To identify a small yet predictive set of k-mers, we rely on the stability selection approach (Meinshausen et al., J R Stat Soc Ser B 72:417–73, 2010), that consists in penalizing logistic regression models with a Lasso penalty, coupled with extensive resampling procedures. Results Using public datasets, we applied the resulting classifiers to two bacterial species and achieved predictive performance equivalent to state of the art. The models are extremely sparse, involving 1 to 8 k-mers per antibiotic, hence are remarkably easy and fast to evaluate on new genomes (from raw reads to assemblies). Conclusion Our proof of concept therefore demonstrates that stability selection is a powerful approach to investigate bacterial genotype-phenotype relationships. Electronic supplementary material The online version of this article (10.1186/s12859-018-2403-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Pierre Mahé
- bioMérieux, Chemin de l'Orme, Marcy l'Etoile, 69280, France.
| | - Maud Tournoud
- bioMérieux, Chemin de l'Orme, Marcy l'Etoile, 69280, France
| |
Collapse
|
42
|
Zhan ZH, You ZH, Li LP, Zhou Y, Yi HC. Accurate Prediction of ncRNA-Protein Interactions From the Integration of Sequence and Evolutionary Information. Front Genet 2018; 9:458. [PMID: 30349558 PMCID: PMC6186793 DOI: 10.3389/fgene.2018.00458] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2018] [Accepted: 09/19/2018] [Indexed: 12/18/2022] Open
Abstract
Non-coding RNA (ncRNA) plays a crucial role in numerous biological processes including gene expression and post-transcriptional gene regulation. The biological function of ncRNA is mostly realized by binding with related proteins. Therefore, an accurate understanding of interactions between ncRNA and protein has a significant impact on current biological research. The major challenge at this stage is the waste of a great deal of redundant time and resource consumed on classification in traditional interaction pattern prediction methods. Fortunately, an efficient classifier named LightGBM can solve this difficulty of long time consumption. In this study, we employed LightGBM as the integrated classifier and proposed a novel computational model for predicting ncRNA and protein interactions. More specifically, the pseudo-Zernike Moments and singular value decomposition algorithm are employed to extract the discriminative features from protein and ncRNA sequences. On four widely used datasets RPI369, RPI488, RPI1807, and RPI2241, we evaluated the performance of LGBM and obtained an superior performance with AUC of 0.799, 0.914, 0.989, and 0.762, respectively. The experimental results of 10-fold cross-validation shown that the proposed method performs much better than existing methods in predicting ncRNA-protein interaction patterns, which could be used as a useful tool in proteomics research.
Collapse
Affiliation(s)
- Zhao-Hui Zhan
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China
| | - Zhu-Hong You
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Li-Ping Li
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Yong Zhou
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China
| | - Hai-Cheng Yi
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| |
Collapse
|
43
|
Abstract
Genome wide association studies (GWAS) rely on microarrays, or more recently mapping of sequencing reads, to genotype individuals. The reliance on prior sequencing of a reference genome limits the scope of association studies, and also precludes mapping associations outside of the reference. We present an alignment free method for association studies of categorical phenotypes based on counting [Formula: see text]-mers in whole-genome sequencing reads, testing for associations directly between [Formula: see text]-mers and the trait of interest, and local assembly of the statistically significant [Formula: see text]-mers to identify sequence differences. An analysis of the 1000 genomes data show that sequences identified by our method largely agree with results obtained using the standard approach. However, unlike standard GWAS, our method identifies associations with structural variations and sites not present in the reference genome. We also demonstrate that population stratification can be inferred from [Formula: see text]-mers. Finally, application to an E.coli dataset on ampicillin resistance validates the approach.
Collapse
Affiliation(s)
- Atif Rahman
- Department of Electrical Engineering and Computer SciencesUniversity of California, BerkeleyBerkeleyUnited States
| | | | - Michael Eisen
- Department of Molecular and Cell BiologyUniversity of California, BerkeleyBerkeleyUnited States
- Howard Hughes Medical Institute, University of California, BerkeleyBerkeleyUnited States
| | - Lior Pachter
- Department of Electrical Engineering and Computer SciencesUniversity of California, BerkeleyBerkeleyUnited States
- Department of Molecular and Cell BiologyUniversity of California, BerkeleyBerkeleyUnited States
- Department of MathematicsUniversity of California, BerkeleyBerkeleyUnited States
| |
Collapse
|
44
|
Lin J, Wei J, Adjeroh D, Jiang BH, Jiang Y. SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform. BMC Bioinformatics 2018; 19:165. [PMID: 29720081 PMCID: PMC5930706 DOI: 10.1186/s12859-018-2155-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2017] [Accepted: 04/11/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Alignment-free sequence similarity analysis methods often lead to significant savings in computational time over alignment-based counterparts. RESULTS A new alignment-free sequence similarity analysis method, called SSAW is proposed. SSAW stands for Sequence Similarity Analysis using the Stationary Discrete Wavelet Transform (SDWT). It extracts k-mers from a sequence, then maps each k-mer to a complex number field. Then, the series of complex numbers formed are transformed into feature vectors using the stationary discrete wavelet transform. After these steps, the original sequence is turned into a feature vector with numeric values, which can then be used for clustering and/or classification. CONCLUSIONS Using two different types of applications, namely, clustering and classification, we compared SSAW against the the-state-of-the-art alignment free sequence analysis methods. SSAW demonstrates competitive or superior performance in terms of standard indicators, such as accuracy, F-score, precision, and recall. The running time was significantly better in most cases. These make SSAW a suitable method for sequence analysis, especially, given the rapidly increasing volumes of sequence data required by most modern applications.
Collapse
Affiliation(s)
- Jie Lin
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China
| | - Jing Wei
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China
| | - Donald Adjeroh
- Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, 26506, WV, USA
| | - Bing-Hua Jiang
- Department of Pathology, University of Iowa, Iowa city, 52242, Iowa, USA
| | - Yue Jiang
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China.
| |
Collapse
|
45
|
Adjeroh D, Allaga M, Tan J, Lin J, Jiang Y, Abbasi A, Zhou X. Feature-Based and String-Based Models for Predicting RNA-Protein Interaction. Molecules 2018; 23:E697. [PMID: 29562711 PMCID: PMC6017419 DOI: 10.3390/molecules23030697] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2017] [Revised: 02/17/2018] [Accepted: 02/21/2018] [Indexed: 12/13/2022] Open
Abstract
In this work, we study two approaches for the problem of RNA-Protein Interaction (RPI). In the first approach, we use a feature-based technique by combining extracted features from both sequences and secondary structures. The feature-based approach enhanced the prediction accuracy as it included much more available information about the RNA-protein pairs. In the second approach, we apply search algorithms and data structures to extract effective string patterns for prediction of RPI, using both sequence information (protein and RNA sequences), and structure information (protein and RNA secondary structures). This led to different string-based models for predicting interacting RNA-protein pairs. We show results that demonstrate the effectiveness of the proposed approaches, including comparative results against leading state-of-the-art methods.
Collapse
Affiliation(s)
- Donald Adjeroh
- Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26508, USA.
| | - Maen Allaga
- Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26508, USA.
| | - Jun Tan
- Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, WV 26508, USA.
| | - Jie Lin
- Faculty of Software, Fujian Normal University, Fuzhou 350108, China.
| | - Yue Jiang
- Faculty of Software, Fujian Normal University, Fuzhou 350108, China.
| | - Ahmed Abbasi
- McIntire School of Commerce, University of Virginia, Charlottesville, VA 22904, USA.
| | - Xiaobo Zhou
- McGovern Medical School, and School of Biomedical Informatics, The University of Texas Health Science Center at Houston (UTHealth), Houston, TX 77030, USA.
| |
Collapse
|
46
|
Amado Cattáneo RM, Diambra L, McCarthy AN. Phylogenomics of tomato chloroplasts using assembly and alignment-free method. Mitochondrial DNA A DNA Mapp Seq Anal 2018; 29:1128-1138. [PMID: 29338473 DOI: 10.1080/24701394.2017.1419214] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Phylogenetics and population genetics are central disciplines in evolutionary biology. Both are based on the comparison of single DNA sequences, or a concatenation of a number of these. However, with the advent of next-generation DNA sequencing technologies, the approaches that consider large genomic data sets are of growing importance for the elucidation of evolutionary relationships among species. Among these approaches, the assembly and alignment-free methods which allow an efficient distance computation and phylogeny reconstruction are of great importance. However, it is not yet clear under what quality conditions and abundance of genomic data such methods are able to infer phylogenies accurately. In the present study we assess the method originally proposed by Fan et al. for whole genome data, in the elucidation of Tomatoes' chloroplast phylogenetics using short read sequences. We find that this assembly and alignment-free method is capable of reproducing previous results under conditions of high coverage, given that low frequency k-mers (i.e. error prone data) are effectively filtered out. Finally, we present a complete chloroplast phylogeny for the best data quality candidates of the recently published 360 tomato genomes.
Collapse
Affiliation(s)
| | - Luis Diambra
- a Facultad de Ciencias Exactas-UNLP , CREG , La Plata , Argentina.,b CONICET , Buenos Aires , Argentina
| | - Andrés Norman McCarthy
- a Facultad de Ciencias Exactas-UNLP , CREG , La Plata , Argentina.,c CICPBA , La Plata , Argentina
| |
Collapse
|
47
|
Abstract
The size distribution of complete 16S-rRNA sequences from the SILVA-database and nucleotide shifts that might interfere with the secondary structure of the molecules were evaluated. Overall, 513,309 sequences recorded in SILVA were used to estimate the size of hypervariable regions of the gene. Redundant sequences were treated as a single sequence to achieve a better representation of the molecular diversity. Nucleotides found in each position in 95% of the sequences were considered the consensus sequences for different size-groups (consensus95). The sizes of different regions ranged from 96.7 to 283.1 nucleotides and had similar distribution patterns, except for the V3 region, which exhibited a bimodal distribution composed of 2 main peaks of 161 and 186 nt. The alignment of Consensuses95 of fractions 161 and 186 showed a high degree of similarity and conservation, except for the central positions (gap zone), where the sequence was highly variable and several deletions were observed. Structurally, the gap zone forms the central part of helix 17 (H17), and its extension was directly reflected in the size of this helix. H17 is part of a multihelix conjunction known as the 5-way junction (5 WJ), which is indispensable for 30 S ribosome assembly. However, because a drastic variation in the sequence size of V3 region occurs at a central position in loop H17 without affecting the base of the loop, it has no apparent effect on 5 WJ. Finally, considering that these differences were detected in non-redundant sequences, it can be concluded that this is not an uncommon or isolated event and that the V3 region is possibly more likely to mutate than are other regions.
Collapse
Affiliation(s)
- Francisco Vargas-Albores
- a Centro de Investigación en Alimentación y Desarrollo , Carretera a La Victoria . Hermosillo , Sonora , México
| | | | | | - Marcel Martínez-Porchas
- a Centro de Investigación en Alimentación y Desarrollo , Carretera a La Victoria . Hermosillo , Sonora , México
| |
Collapse
|
48
|
Abstract
Ernst Haeckel based his landmark Tree of Life on the supposed ontogenic recapitulation of phylogeny, i.e. that successive embryonic stages during the development of an organism re-trace the morphological forms of its ancestors over the course of evolution. Much of this idea has since been discredited. Today, phylogenies are often based on families of molecular sequences. The standard approach starts with a multiple sequence alignment, in which the sequences are arranged relative to each other in a way that maximises a measure of similarity position-by-position along their entire length. A tree (or sometimes a network) is then inferred. Rigorous multiple sequence alignment is computationally demanding, and evolutionary processes that shape the genomes of many microbes (bacteria, archaea and some morphologically simple eukaryotes) can add further complications. In particular, recombination, genome rearrangement and lateral genetic transfer undermine the assumptions that underlie multiple sequence alignment, and imply that a tree-like structure may be too simplistic. Here, using genome sequences of 143 bacterial and archaeal genomes, we construct a network of phylogenetic relatedness based on the number of shared
k-mers (subsequences at fixed length
k). Our findings suggest that the network captures not only key aspects of microbial genome evolution as inferred from a tree, but also features that are not treelike. The method is highly scalable, allowing for investigation of genome evolution across a large number of genomes. Instead of using specific regions or sequences from genome sequences, or indeed Haeckel’s idea of ontogeny, we argue that genome phylogenies can be inferred using
k-mers from whole-genome sequences. Representing these networks dynamically allows biological questions of interest to be formulated and addressed quickly and in a visually intuitive manner.
Collapse
Affiliation(s)
- Guillaume Bernard
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| | - Mark A Ragan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| |
Collapse
|
49
|
Villarroel J, Kleinheinz KA, Jurtz VI, Zschach H, Lund O, Nielsen M, Larsen MV. HostPhinder: A Phage Host Prediction Tool. Viruses 2016; 8:E116. [PMID: 27153081 PMCID: PMC4885074 DOI: 10.3390/v8050116] [Citation(s) in RCA: 78] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2015] [Revised: 04/14/2016] [Accepted: 04/19/2016] [Indexed: 01/11/2023] Open
Abstract
The current dramatic increase of antibiotic resistant bacteria has revitalised the interest in bacteriophages as alternative antibacterial treatment. Meanwhile, the development of bioinformatics methods for analysing genomic data places high-throughput approaches for phage characterization within reach. Here, we present HostPhinder, a tool aimed at predicting the bacterial host of phages by examining the phage genome sequence. Using a reference database of 2196 phages with known hosts, HostPhinder predicts the host species of a query phage as the host of the most genomically similar reference phages. As a measure of genomic similarity the number of co-occurring k-mers (DNA sequences of length k) is used. Using an independent evaluation set, HostPhinder was able to correctly predict host genus and species for 81% and 74% of the phages respectively, giving predictions for more phages than BLAST and significantly outperforming BLAST on phages for which both had predictions. HostPhinder predictions on phage draft genomes from the INTESTI phage cocktail corresponded well with the advertised targets of the cocktail. Our study indicates that for most phages genomic similarity correlates well with related bacterial hosts. HostPhinder is available as an interactive web service [1] and as a stand alone download from the Docker registry [2].
Collapse
Affiliation(s)
- Julia Villarroel
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark.
| | - Kortine Annina Kleinheinz
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark.
| | - Vanessa Isabell Jurtz
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark.
| | - Henrike Zschach
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark.
| | - Ole Lund
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark.
| | - Morten Nielsen
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark.
- Instituto de Investigaciones Biotecnológicas, Universidad de San Martín, CP(1650) San Martín, Prov. de Buenos Aires, Argentina.
| | - Mette Voldby Larsen
- Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark.
| |
Collapse
|
50
|
Karimi R, Hajdu A. HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing. Evol Bioinform Online 2016; 12:73-85. [PMID: 26884678 PMCID: PMC4750899 DOI: 10.4137/ebo.s35545] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2015] [Revised: 11/05/2015] [Accepted: 12/05/2015] [Indexed: 11/06/2022] Open
Abstract
Comprehensive effort for low-cost sequencing in the past few years has led to the growth of complete genome databases. In parallel with this effort, a strong need, fast and cost-effective methods and applications have been developed to accelerate sequence analysis. Identification is the very first step of this task. Due to the difficulties, high costs, and computational challenges of alignment-based approaches, an alternative universal identification method is highly required. Like an alignment-free approach, DNA signatures have provided new opportunities for the rapid identification of species. In this paper, we present an effective pipeline HTSFinder (high-throughput signature finder) with a corresponding k-mer generator GkmerG (genome k-mers generator). Using this pipeline, we determine the frequency of k-mers from the available complete genome databases for the detection of extensive DNA signatures in a reasonably short time. Our application can detect both unique and common signatures in the arbitrarily selected target and nontarget databases. Hadoop and MapReduce as parallel and distributed computing tools with commodity hardware are used in this pipeline. This approach brings the power of high-performance computing into the ordinary desktop personal computers for discovering DNA signatures in large databases such as bacterial genome. A considerable number of detected unique and common DNA signatures of the target database bring the opportunities to improve the identification process not only for polymerase chain reaction and microarray assays but also for more complex scenarios such as metagenomics and next-generation sequencing analysis.
Collapse
Affiliation(s)
- Ramin Karimi
- Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen, Hungary
| | - Andras Hajdu
- Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen, Hungary.; Bioinformatics Research Group, University of Debrecen, Debrecen, Hungary
| |
Collapse
|