1
|
Narechania A, Bobo D, DeSalle R, Mathema B, Kreiswirth B, Planet PJ. What Do We Gain When Tolerating Loss? The Information Bottleneck Wrings Out Recombination. Mol Biol Evol 2025; 42:msaf029. [PMID: 39899343 PMCID: PMC11890988 DOI: 10.1093/molbev/msaf029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 12/03/2024] [Accepted: 01/14/2025] [Indexed: 02/04/2025] Open
Abstract
Most microbes have the capacity to acquire genetic material from their environment. Recombination of foreign DNA yields genomes that are, at least in part, incongruent with the vertical history of their species. Dominant approaches for detecting these transfers are phylogenetic, requiring a painstaking series of analyses including alignment and tree reconstruction. But these methods do not scale. Here, we propose an unsupervised, alignment-free, and tree-free technique based on the sequential information bottleneck, an optimization procedure designed to extract some portion of relevant information from 1 random variable conditioned on another. In our case, this joint probability distribution tabulates occurrence counts of k-mers against their genomes of origin with the expectation that recombination will create a strong signal that unifies certain sets of co-occurring k-mers. We conceptualize the technique as a rate-distortion problem, measuring distortion in the relevance information as k-mers are compressed into clusters based on their co-occurrence in the source genomes. The result is fast, model-free, lossy compression of k-mers into learned groups of shared genome sequence, differentiating recombined elements from the vertically inherited core. We show that the technique yields a new recombination measure based purely on information, divorced from any biases and limitations inherent to alignment and phylogeny.
Collapse
Affiliation(s)
- Apurva Narechania
- Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA
- Section for Hologenomics, The Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Dean Bobo
- Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA
- Department of Ecology, Evolution, and Environmental Biology, Columbia University, New York, NY, USA
| | - Rob DeSalle
- Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA
| | - Barun Mathema
- Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY, USA
| | - Barry Kreiswirth
- Center for Discovery and Innovation, Hackensack Meridian Health, Nutley, NJ, USA
| | - Paul J Planet
- Institute for Comparative Genomics, American Museum of Natural History, New York, NY, USA
- Division of Infectious Diseases, Children's Hospital of Philadelphia, Philadelphia, PA, USA
- Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
2
|
Wijaya AJ, Anžel A, Richard H, Hattab G. Current state and future prospects of Horizontal Gene Transfer detection. NAR Genom Bioinform 2025; 7:lqaf005. [PMID: 39935761 PMCID: PMC11811736 DOI: 10.1093/nargab/lqaf005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2024] [Revised: 12/26/2024] [Accepted: 02/04/2025] [Indexed: 02/13/2025] Open
Abstract
Artificial intelligence (AI) has been shown to be beneficial in a wide range of bioinformatics applications. Horizontal Gene Transfer (HGT) is a driving force of evolutionary changes in prokaryotes. It is widely recognized that it contributes to the emergence of antimicrobial resistance (AMR), which poses a particularly serious threat to public health. Many computational approaches have been developed to study and detect HGT. However, the application of AI in this field has not been investigated. In this work, we conducted a review to provide information on the current trend of existing computational approaches for detecting HGT and to decipher the use of AI in this field. Here, we show a growing interest in HGT detection, characterized by a surge in the number of computational approaches, including AI-based approaches, in recent years. We organize existing computational approaches into a hierarchical structure of computational groups based on their computational methods and show how each computational group evolved. We make recommendations and discuss the challenges of HGT detection in general and the adoption of AI in particular. Moreover, we provide future directions for the field of HGT detection.
Collapse
Affiliation(s)
- Andre Jatmiko Wijaya
- Center for Artificial Intelligent in Public Health Research (ZKI-PH), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Freie Universität, Arnimallee 14, 14195 Berlin, Germany
- Genome Competence Center (MF1), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
| | - Aleksandar Anžel
- Center for Artificial Intelligent in Public Health Research (ZKI-PH), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
| | - Hugues Richard
- Genome Competence Center (MF1), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
| | - Georges Hattab
- Center for Artificial Intelligent in Public Health Research (ZKI-PH), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Freie Universität, Arnimallee 14, 14195 Berlin, Germany
| |
Collapse
|
3
|
Pimenta-Zanon MH, Kashiwabara AY, Vanzela ALL, Lopes FM. GRAMEP: an alignment-free method based on the maximum entropy principle for identifying SNPs. BMC Bioinformatics 2025; 26:66. [PMID: 40000933 PMCID: PMC11863517 DOI: 10.1186/s12859-025-06037-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2024] [Accepted: 01/06/2025] [Indexed: 02/27/2025] Open
Abstract
BACKGROUND Advances in high throughput sequencing technologies provide a huge number of genomes to be analyzed. Thus, computational methods play a crucial role in analyzing and extracting knowledge from the data generated. Investigating genomic mutations is critical because of their impact on chromosomal evolution, genetic disorders, and diseases. It is common to adopt aligning sequences for analyzing genomic variations. However, this approach can be computationally expensive and restrictive in scenarios with large datasets. RESULTS We present a novel method for identifying single nucleotide polymorphisms (SNPs) in DNA sequences from assembled genomes. This study proposes GRAMEP, an alignment-free approach that adopts the principle of maximum entropy to discover the most informative k-mers specific to a genome or set of sequences under investigation. The informative k-mers enable the detection of variant-specific mutations in comparison to a reference genome or other set of sequences. In addition, our method offers the possibility of classifying novel sequences with no need for organism-specific information. GRAMEP demonstrated high accuracy in both in silico simulations and analyses of viral genomes, including Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate SARS-CoV-2 variant identification while demonstrating a lower computational cost compared to methods with the same purpose. CONCLUSIONS GRAMEP is an open and user-friendly software based on maximum entropy that provides an efficient alignment-free approach to identifying and classifying unique genomic subsequences and SNPs with high accuracy, offering advantages over comparative methods. The instructions for use, applicability, and usability of GRAMEP are open access at https://github.com/omatheuspimenta/GRAMEP .
Collapse
Affiliation(s)
- Matheus Henrique Pimenta-Zanon
- Computer Science Department, Universidade Tecnológica Federal do Paraná (UTFPR), Alberto Carazzai, 1640, Cornélio Procópio, Paraná, 86300-000, Brazil
| | - André Yoshiaki Kashiwabara
- Computer Science Department, Universidade Tecnológica Federal do Paraná (UTFPR), Alberto Carazzai, 1640, Cornélio Procópio, Paraná, 86300-000, Brazil
| | - André Luís Laforga Vanzela
- Laboratory of Cytogenetics and Plant Diversity, Department of General Biology, Universidade Estadual de Londrina (UEL), Rodovia Celso Garcia Cid, PR-445, Km 380, Londrina, Paraná, 86057-970, Brazil
| | - Fabricio Martins Lopes
- Computer Science Department, Universidade Tecnológica Federal do Paraná (UTFPR), Alberto Carazzai, 1640, Cornélio Procópio, Paraná, 86300-000, Brazil.
| |
Collapse
|
4
|
Van Etten J, Stephens TG, Bhattacharya D. Genetic Transfer in Action: Uncovering DNA Flow in an Extremophilic Microbial Community. Environ Microbiol 2025; 27:e70048. [PMID: 39900484 PMCID: PMC11790422 DOI: 10.1111/1462-2920.70048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2024] [Revised: 01/15/2025] [Accepted: 01/18/2025] [Indexed: 02/05/2025]
Abstract
Horizontal genetic transfer (HGT) is a significant driver of genomic novelty in all domains of life. HGT has been investigated in many studies however, the focus has been on conspicuous protein-coding DNA transfers that often prove to be adaptive in recipient organisms and are therefore fixed longer-term in lineages. These results comprise a subclass of HGTs and do not represent exhaustive (coding and non-coding) DNA transfer and its impact on ecology. Uncovering exhaustive HGT can provide key insights into the connectivity of genomes in communities and how these transfers may occur. In this study, we use the term frequency-inverse document frequency (TF-IDF) technique, that has been used successfully to mine DNA transfers within real and simulated high-quality prokaryote genomes, to search for exhaustive HGTs within an extremophilic microbial community. We establish a pipeline for validating transfers identified using this approach. We find that most DNA transfers are within-domain and involve non-coding DNA. A relatively high proportion of the predicted protein-coding HGTs appear to encode transposase activity, restriction-modification system components, and biofilm formation functions. Our study demonstrates the utility of the TF-IDF approach for HGT detection and provides insights into the mechanisms of recent DNA transfer.
Collapse
Affiliation(s)
- Julia Van Etten
- Department of Biochemistry and Microbiology, RutgersThe State University of New JerseyNew BrunswickNew JerseyUSA
| | - Timothy G. Stephens
- Department of Biochemistry and Microbiology, RutgersThe State University of New JerseyNew BrunswickNew JerseyUSA
| | - Debashish Bhattacharya
- Department of Biochemistry and Microbiology, RutgersThe State University of New JerseyNew BrunswickNew JerseyUSA
| |
Collapse
|
5
|
Strzoda T, Cruz-Garcia L, Najim M, Badie C, Polanska J. A mapping-free natural language processing-based technique for sequence search in nanopore long-reads. BMC Bioinformatics 2024; 25:354. [PMID: 39538122 PMCID: PMC11562635 DOI: 10.1186/s12859-024-05980-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2024] [Accepted: 11/06/2024] [Indexed: 11/16/2024] Open
Abstract
BACKGROUND In unforeseen situations, such as nuclear power plant's or civilian radiation accidents, there is a need for effective and computationally inexpensive methods to determine the expression level of a selected gene panel, allowing for rough dose estimates in thousands of donors. The new generation in-situ mapper, fast and of low energy consumption, working at the level of single nanopore output, is in demand. We aim to create a sequence identification tool that utilizes natural language processing techniques and ensures a high level of negative predictive value (NPV) compared to the classical approach. RESULTS The training dataset consisted of RNA sequencing data from 6 samples. Multiple natural language processing models were examined, differing in the type of dictionary components (word length, step, context) as well as the encoding length and number of sequences required for algorithm training. The best configuration analyses the entire sequence and uses a word length of 3 base pairs with one-word neighbor on each side. For the considered FDXR gene, the achieved mean balanced accuracy (BACC) was 98.29% and NPV was 99.25%, compared to minimap2's performance in a cross-validation scenario. The next stage focused on exploring the dictionary components and attempting to optimize it, employing statistical techniques as well as those relying on the explainability of the decisions made. Reducing the dictionary from 1024 to 145 changed BACC to 96.49% and the NPV to 98.15%. Obtained model, validated on an external independent genome sequencing dataset, gave NPV of 99.64% for complete and 95.87% for reduced dictionary. The salmon-estimated read counts differed from the classical approach on average by 3.48% for the complete dictionary and by 5.82% for the reduced one. CONCLUSIONS We conclude that for long Oxford nanopore reads, a natural language processing-based approach can reliably replace classical mapping when there is a need for fast, reliable and energy and computationally efficient targeted mapping of a pre-defined subset of transcripts. The developed model can be easily retrained to identify selected transcripts and/or work with various long-read sequencing techniques. Our results of the study clearly demonstrate the potential of applying techniques known from classical text processing to nucleotide sequences.
Collapse
Affiliation(s)
- Tomasz Strzoda
- Department of Data Science and Engineering, Silesian University of Technology, Gliwice, Poland
| | - Lourdes Cruz-Garcia
- Cancer Mechanisms and Biomarkers Group, Centre for Radiation, Chemical and Environmental Hazards, UK Health Security Agency, Oxfordshire, OX11 0RQ, United Kingdom
| | - Mustafa Najim
- Cancer Mechanisms and Biomarkers Group, Centre for Radiation, Chemical and Environmental Hazards, UK Health Security Agency, Oxfordshire, OX11 0RQ, United Kingdom
| | - Christophe Badie
- Cancer Mechanisms and Biomarkers Group, Centre for Radiation, Chemical and Environmental Hazards, UK Health Security Agency, Oxfordshire, OX11 0RQ, United Kingdom
| | - Joanna Polanska
- Department of Data Science and Engineering, Silesian University of Technology, Gliwice, Poland.
| |
Collapse
|
6
|
Anjum N, Nabil RL, Rafi RI, Bayzid MS, Rahman MS. CD-MAWS: An Alignment-Free Phylogeny Estimation Method Using Cosine Distance on Minimal Absent Word Sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:196-205. [PMID: 34928803 DOI: 10.1109/tcbb.2021.3136792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Multiple sequence alignment has been the traditional and well established approach of sequence analysis and comparison, though it is time and memory consuming. As the scale of sequencing data is increasing day by day, the importance of faster yet accurate alignment-free methods is on the rise. Several alignment-free sequence analysis methods have been established in the literature in recent years, which extract numerical features from genomic data to analyze sequences and also to estimate phylogenetic relationship among genes and species. Minimal Absent Word (MAW) is an effective concept for representing characteristics of a sequence in an alignment-free manner. In this study, we present CD-MAWS, a distance measure based on cosine of the angle between composition vectors constructed using minimal absent words, for sequence analysis in a computationally inexpensive manner. We have benchmarked CD-MAWS using several AFProject datasets, such as Fish mtDNA, E.coli, Plants, Shigella and Yersinia datasets, and found it to perform quite well. Applied on several other biological datasets such as mammal mtDNA, bacterial genomes and viral genomes, CD-MAWS resolved phylogenetic relationships similar to or better than state-of-the-art alignment-free methods such as Mash, Skmer, Co-phylog and kSNP3.
Collapse
|
7
|
Sun J, Liu Y, Cui J, He H. Deep learning-based methods for natural hazard named entity recognition. Sci Rep 2022; 12:4598. [PMID: 35301387 PMCID: PMC8931008 DOI: 10.1038/s41598-022-08667-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Accepted: 03/09/2022] [Indexed: 12/20/2022] Open
Abstract
Natural hazard named entity recognition is a technique used to recognize natural hazard entities from a large number of texts. The method of natural hazard named entity recognition can facilitate acquisition of natural hazards information and provide reference for natural hazard mitigation. The method of named entity recognition has many challenges, such as fast change, multiple types and various forms of named entities. This can introduce difficulties in research of natural hazard named entity recognition. To address the above problem, this paper constructed a natural disaster annotated corpus for training and evaluation model, and selected and compared several deep learning methods based on word vector features. A deep learning method for natural hazard named entity recognition can automatically mine text features and reduce the dependence on manual rules. This paper compares and analyzes the deep learning models from three aspects: pretraining, feature extraction and decoding. A natural hazard named entity recognition method based on deep learning is proposed, namely XLNet-BiLSTM-CRF model. Finally, the research hotspots of natural hazards papers in the past 10 years were obtained through this model. After training, the precision of the XLNet-BilSTM-CRF model is 92.80%, the recall rate is 91.74%, and the F1-score is 92.27%. The results show that this method, which is superior to other methods, can effectively recognize natural hazard named entities.
Collapse
Affiliation(s)
- Junlin Sun
- School of Resources and Environment, Anhui Agricultural University, Hefei, 230036, China
| | - Yanrong Liu
- School of Resources and Environment, Anhui Agricultural University, Hefei, 230036, China
| | - Jing Cui
- School of Resources and Environment, Anhui Agricultural University, Hefei, 230036, China
| | - Handong He
- School of Resources and Environment, Anhui Agricultural University, Hefei, 230036, China.
| |
Collapse
|
8
|
Evolutionary Processes Driving the Rise and Fall of Staphylococcus aureus ST239, a Dominant Hybrid Pathogen. mBio 2021; 12:e0216821. [PMID: 34903061 PMCID: PMC8669471 DOI: 10.1128/mbio.02168-21] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Selection plays a key role in the spread of antibiotic resistance, but the evolutionary drivers of clinically important resistant strains remain poorly understood. Here, we use genomic analyses and competition experiments to study Staphylococcus aureus ST239, a prominent MRSA strain that is thought to have been formed by large-scale recombination between ST8 and ST30. Genomic analyses allowed us to refine the hybrid model for the origin of ST239 and to date the origin of ST239 to 1920 to 1945, which predates the clinical introduction of methicillin in 1959. Although purifying selection has dominated the evolution of ST239, parallel evolution has occurred in genes involved in antibiotic resistance and virulence, suggesting that ST239 has evolved toward an increasingly pathogenic lifestyle. Crucially, ST239 isolates have low competitive fitness relative to both ST8 and ST30 isolates, supporting the idea that fitness costs have driven the demise of this once-dominant pathogen strain.
Collapse
|
9
|
Tay AP, Hosking B, Hosking C, Bauer DC, Wilson LO. INSIDER: alignment-free detection of foreign DNA sequences. Comput Struct Biotechnol J 2021; 19:3810-3816. [PMID: 34285780 PMCID: PMC8273350 DOI: 10.1016/j.csbj.2021.06.045] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 06/28/2021] [Accepted: 06/28/2021] [Indexed: 11/21/2022] Open
Abstract
External DNA sequences can be inserted into an organism's genome either through natural processes such as gene transfer, or through targeted genome engineering strategies. Being able to robustly identify such foreign DNA is a crucial capability for health and biosecurity applications, such as anti-microbial resistance (AMR) detection or monitoring gene drives. This capability does not exist for poorly characterised host genomes or with limited information about the integrated sequence. To address this, we developed the INserted Sequence Information DEtectoR (INSIDER). INSIDER analyses whole genome sequencing data and identifies segments of potentially foreign origin by their significant shift in k-mer signatures. We demonstrate the power of INSIDER to separate integrated DNA sequences from normal genomic sequences on a synthetic dataset simulating the insertion of a CRISPR-Cas gene drive into wild-type yeast. As a proof-of-concept, we use INSIDER to detect the exact AMR plasmid in whole genome sequencing data from a Citrobacter freundii patient isolate. INSIDER streamlines the process of identifying integrated DNA in poorly characterised wild species or when the insert is of unknown origin, thus enhancing the monitoring of emerging biosecurity threats.
Collapse
Affiliation(s)
- Aidan P. Tay
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, New South Wales, Sydney, Australia
| | - Brendan Hosking
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
| | - Cameron Hosking
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
| | - Denis C. Bauer
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
- Department of Biomedical Sciences, Macquarie University, New South Wales, Sydney, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, New South Wales, Sydney, Australia
| | - Laurence O.W. Wilson
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, New South Wales, Sydney, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, New South Wales, Sydney, Australia
| |
Collapse
|
10
|
An Integrative Computational Approach for the Prediction of Human- Plasmodium Protein-Protein Interactions. BIOMED RESEARCH INTERNATIONAL 2021; 2020:2082540. [PMID: 33426052 PMCID: PMC7771252 DOI: 10.1155/2020/2082540] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 11/08/2020] [Accepted: 12/04/2020] [Indexed: 12/27/2022]
Abstract
Host-pathogen molecular cross-talks are critical in determining the pathophysiology of a specific infection. Most of these cross-talks are mediated via protein-protein interactions between the host and the pathogen (HP-PPI). Thus, it is essential to know how some pathogens interact with their hosts to understand the mechanism of infections. Malaria is a life-threatening disease caused by an obligate intracellular parasite belonging to the Plasmodium genus, of which P. falciparum is the most prevalent. Several previous studies predicted human-plasmodium protein-protein interactions using computational methods have demonstrated their utility, accuracy, and efficiency to identify the interacting partners and therefore complementing experimental efforts to characterize host-pathogen interaction networks. To predict potential putative HP-PPIs, we use an integrative computational approach based on the combination of multiple OMICS-based methods including human red blood cells (RBC) and Plasmodium falciparum 3D7 strain expressed proteins, domain-domain based PPI, similarity of gene ontology terms, structure similarity method homology identification, and machine learning prediction. Our results reported a set of 716 protein interactions involving 302 human proteins and 130 Plasmodium proteins. This work provides a list of potential human-Plasmodium interacting proteins. These findings will contribute to better understand the mechanisms underlying the molecular determinism of malaria disease and potentially to identify candidate pharmacological targets.
Collapse
|
11
|
Lozoya OA, McClelland KS, Papas BN, Li JL, Yao HHC. Patterns, Profiles, and Parsimony: Dissecting Transcriptional Signatures From Minimal Single-Cell RNA-Seq Output With SALSA. Front Genet 2020; 11:511286. [PMID: 33193599 PMCID: PMC7586319 DOI: 10.3389/fgene.2020.511286] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Accepted: 09/18/2020] [Indexed: 11/23/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) technologies have precipitated the development of bioinformatic tools to reconstruct cell lineage specification and differentiation processes with single-cell precision. However, current start-up costs and recommended data volumes for statistical analysis remain prohibitively expensive, preventing scRNA-seq technologies from becoming mainstream. Here, we introduce single-cell amalgamation by latent semantic analysis (SALSA), a versatile workflow that combines measurement reliability metrics with latent variable extraction to infer robust expression profiles from ultra-sparse sc-RNAseq data. SALSA uses a matrix focusing approach that starts by identifying facultative genes with expression levels greater than experimental measurement precision and ends with cell clustering based on a minimal set of Profiler genes, each one a putative biomarker of cluster-specific expression profiles. To benchmark how SALSA performs in experimental settings, we used the publicly available 10X Genomics PBMC 3K dataset, a pre-curated silver standard from human frozen peripheral blood comprising 2,700 single-cell barcodes, and identified 7 major cell groups matching transcriptional profiles of peripheral blood cell types and driven agnostically by < 500 Profiler genes. Finally, we demonstrate successful implementation of SALSA in a replicative scRNA-seq scenario by using previously published DropSeq data from a multi-batch mouse retina experimental design, thereby identifying 10 transcriptionally distinct cell types from > 64,000 single cells across 7 independent biological replicates based on < 630 Profiler genes. With these results, SALSA demonstrates that robust pattern detection from scRNA-seq expression matrices only requires a fraction of the accrued data, suggesting that single-cell sequencing technologies can become affordable and widespread if meant as hypothesis-generation tools to extract large-scale differential expression effects.
Collapse
Affiliation(s)
- Oswaldo A. Lozoya
- Genomic Integrity & Structural Biology Laboratory, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, United States
| | - Kathryn S. McClelland
- Reproductive and Developmental Biology Laboratory, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, United States
| | - Brian N. Papas
- Integrative Bioinformatics Support Group, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, United States
| | - Jian-Liang Li
- Integrative Bioinformatics Support Group, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, United States
| | - Humphrey H.-C. Yao
- Reproductive and Developmental Biology Laboratory, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, United States
| |
Collapse
|
12
|
Guzik M, Witko T, Steinbüchel A, Wojnarowska M, Sołtysik M, Wawak S. What Has Been Trending in the Research of Polyhydroxyalkanoates? A Systematic Review. Front Bioeng Biotechnol 2020; 8:959. [PMID: 33014998 PMCID: PMC7513618 DOI: 10.3389/fbioe.2020.00959] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Accepted: 07/24/2020] [Indexed: 12/24/2022] Open
Abstract
Over the past decades, enormous progress has been achieved with regard to research on environmentally friendly polymers. One of the most prominent families of such biopolymers are bacterially synthesized polyhydroxyalkanoates (PHAs) that have been known since the 1920s. However, only as recent as the 1990s have extensive studies sprung out exponentially in this matter. Since then, different areas of exploration of these intriguing materials have been uncovered. However, no systematic review of undertaken efforts has been conducted so far. Therefore, we have performed an unbiased search of up-to-date literature to reveal trending topics in the research of PHAs over the past three decades by data mining of 2,227 publications. This allowed us to identify eight past and current trends in this area. Our study provides a comprehensive review of these trends and speculates where PHA research is heading.
Collapse
Affiliation(s)
- Maciej Guzik
- Jerzy Haber Institute of Catalysis and Surface Chemistry Polish Academy of Sciences, Kraków, Poland
| | - Tomasz Witko
- Jerzy Haber Institute of Catalysis and Surface Chemistry Polish Academy of Sciences, Kraków, Poland
| | - Alexander Steinbüchel
- Institut für Molekulare Mikrobiologie und Biotechnologie, Westfälische Wilhelms-Universität Münster, Münster, Germany
- Environmental Sciences Department, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Magdalena Wojnarowska
- Department of Product Technology and Ecology, Cracow University of Economics, Kraków, Poland
| | - Mariusz Sołtysik
- Department of Management Process, Cracow University of Economics, Kraków, Poland
| | - Sławomir Wawak
- Department of Management Process, Cracow University of Economics, Kraków, Poland
| |
Collapse
|
13
|
Yilahun H, Abdurahman K, Imam S, Hamdulla A. Automatic extraction of Uyghur domain concepts based on multi‐feature for ontology extension. IET NETWORKS 2020. [DOI: 10.1049/iet-net.2018.5240] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Hankiz Yilahun
- College of Mathematics and System Science, Xinjiang UniversityUrumqiPeople's Republic of China
- Institute of Information Science and Engineering, Xinjiang UniversityUrumqiPeople's Republic of China
| | - Kudret Abdurahman
- Institute of Information Science and Engineering, Xinjiang UniversityUrumqiPeople's Republic of China
| | - Seyyare Imam
- Institute of Information Science and Engineering, Xinjiang UniversityUrumqiPeople's Republic of China
| | - Askar Hamdulla
- Institute of Information Science and Engineering, Xinjiang UniversityUrumqiPeople's Republic of China
| |
Collapse
|
14
|
Wawak S, Rogala P, Dahlgaard-Park SM. Research trends in quality management in years 2000-2019. INTERNATIONAL JOURNAL OF QUALITY AND SERVICE SCIENCES 2020. [DOI: 10.1108/ijqss-12-2019-0133] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Purpose
This study aims to demonstrate the suitability of text-mining toolset for the discovery of trends in quality management (QM) literature in 2000-2019. The hypothesis was formulated that as the field of study is mature, the most important trends are related to deepening and broadening of the knowledge.
Design/methodology/approach
A novel approach to trend discovery was proposed. The computer-aided analysis of full-texts of papers led to increased reliability and level of detail of the achieved results and helped significantly reduce researchers’ bias. Overall, 4,833 papers from 8 journal dedicated to QM were analysed.
Findings
Trends discovery led to the identification of 45 trends: 17 long-lasting trends, 4 declining trends, 11 emerging trends and 13 ephemeris trends. They were compared to the results of earlier studies. New trends and potential gaps were discussed.
Practical implications
The results highlight the trends that gain or lose popularity, thus they can be used to focus studies, as well as find new subjects, which are not so popular yet. The knowledge about emerging trends is also important for those quality managers who strive for improvement of their efficiency.
Originality/value
The research was designed to bypass the limitations of previous studies. The use of text mining methods and analysis of full texts of papers delivered more detailed and reliable data. Resignation from predefinition of classification criteria significantly reduced researchers’ bias and allowed the discovery of new trends, not identified in previous studies.
Collapse
|
15
|
Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, Maetschke SR, Ragan MA. Alignment-free inference of hierarchical and reticulate phylogenomic relationships. Brief Bioinform 2019; 20:426-435. [PMID: 28673025 PMCID: PMC6433738 DOI: 10.1093/bib/bbx067] [Citation(s) in RCA: 60] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Revised: 05/04/2017] [Indexed: 11/22/2022] Open
Abstract
We are amidst an ongoing flood of sequence data arising from the application of high-throughput technologies, and a concomitant fundamental revision in our understanding of how genomes evolve individually and within the biosphere. Workflows for phylogenomic inference must accommodate data that are not only much larger than before, but often more error prone and perhaps misassembled, or not assembled in the first place. Moreover, genomes of microbes, viruses and plasmids evolve not only by tree-like descent with modification but also by incorporating stretches of exogenous DNA. Thus, next-generation phylogenomics must address computational scalability while rethinking the nature of orthogroups, the alignment of multiple sequences and the inference and comparison of trees. New phylogenomic workflows have begun to take shape based on so-called alignment-free (AF) approaches. Here, we review the conceptual foundations of AF phylogenetics for the hierarchical (vertical) and reticulate (lateral) components of genome evolution, focusing on methods based on k-mers. We reflect on what seems to be successful, and on where further development is needed.
Collapse
|
16
|
Huang GD, Liu XM, Huang TL, Xia LC. The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer. Synth Syst Biotechnol 2019; 4:150-156. [PMID: 31508512 PMCID: PMC6723412 DOI: 10.1016/j.synbio.2019.08.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Revised: 07/14/2019] [Accepted: 08/05/2019] [Indexed: 12/21/2022] Open
Abstract
Alignment-based database search and sequence comparison are commonly used to detect horizontal gene transfer (HGT). However, with the rapid increase of sequencing depth, hundreds of thousands of contigs are routinely assembled from metagenomics studies, which challenges alignment-based HGT analysis by overwhelming the known reference sequences. Detecting HGT by k-mer statistics thus becomes an attractive alternative. These alignment-free statistics have been demonstrated in high performance and efficiency in whole-genome and transcriptome comparisons. To adapt k-mer statistics for HGT detection, we developed two aggregative statistics TsumS and Tsum*, which subsample metagenome contigs by their representative regions, and summarize the regional D2S and D2* metrics by their upper bounds. We systematically studied the aggregative statistics’ power at different k-mer size using simulations. Our analysis showed that, in general, the power of TsumS and Tsum* increases with sequencing coverage, and reaches a maximum power >80% at k = 6, with 5% Type-I error and the coverage ratio >0.2x. The statistical power of TsumS and Tsum* was evaluated with realistic simulations of HGT mechanism, sequencing depth, read length, and base error. We expect these statistics to be useful distance metrics for identifying HGT in metagenomic studies.
Collapse
Affiliation(s)
- Guan-Da Huang
- School of Physics and Optoelectronics, South China University of Technology, Guangzhou, 510640, China
| | - Xue-Mei Liu
- School of Physics and Optoelectronics, South China University of Technology, Guangzhou, 510640, China
| | - Tian-Lai Huang
- School of Physics and Optoelectronics, South China University of Technology, Guangzhou, 510640, China
| | - Li-C Xia
- Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
| |
Collapse
|
17
|
Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, Lau AK, Röhling S, Choi JJ, Waterman MS, Comin M, Kim SH, Vinga S, Almeida JS, Chan CX, James BT, Sun F, Morgenstern B, Karlowski WM. Benchmarking of alignment-free sequence comparison methods. Genome Biol 2019; 20:144. [PMID: 31345254 PMCID: PMC6659240 DOI: 10.1186/s13059-019-1755-7] [Citation(s) in RCA: 113] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 07/03/2019] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. RESULTS Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. CONCLUSION The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland
| | - Hani Z Girgis
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | | | - Chris-Andre Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Kujin Tang
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Anna Katharina Lau
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Sophie Röhling
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Jae Jin Choi
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Michael S Waterman
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Sung-Hou Kim
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Susana Vinga
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas S Almeida
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NIH/NCI), Bethesda, USA
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, and School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Benjamin T James
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | - Fengzhu Sun
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland.
| |
Collapse
|
18
|
Abstract
Background Pan-genome approaches afford the discovery of homology relations in a set of genomes, by determining how some gene families are distributed among a given set of genomes. The retrieval of a complete gene distribution among a class of genomes is an NP-hard problem because computational costs increase with the number of analyzed genomes, in fact, all-against-all gene comparisons are required to completely solve the problem. In presence of phylogenetically distant genomes, due to the variability introduced in gene duplication and transmission, the task of recognizing homologous genes becomes even more difficult. A challenge on this field is that of designing fast and adaptive similarity measures in order to find a suitable pan-genome structure of homology relations. Results We present PanDelos, a stand alone tool for the discovery of pan-genome contents among phylogenetic distant genomes. The methodology is based on information theory and network analysis. It is parameter-free because thresholds are automatically deduced from the context. PanDelos avoids sequence alignment by introducing a measure based on k-mer multiplicity. The k-mer length is defined according to general arguments rather than empirical considerations. Homology candidate relations are integrated into a global network and groups of homologous genes are extracted by applying a community detection algorithm. Conclusions PanDelos outperforms existing approaches, Roary and EDGAR, in terms of running times and quality content discovery. Tests were run on collections of real genomes, previously used in analogous studies, and in synthetic benchmarks that represent fully trusted golden truth. The software is available at https://github.com/GiugnoLab/PanDelos. Electronic supplementary material The online version of this article (10.1186/s12859-018-2417-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Vincenzo Bonnici
- Department of Computer Science, University of Verona, Strada le Grazie, 15, Verona, 37134, Italy.
| | - Rosalba Giugno
- Department of Computer Science, University of Verona, Strada le Grazie, 15, Verona, 37134, Italy
| | - Vincenzo Manca
- Department of Computer Science, University of Verona, Strada le Grazie, 15, Verona, 37134, Italy
| |
Collapse
|
19
|
Bernard G, Greenfield P, Ragan MA, Chan CX. k-mer Similarity, Networks of Microbial Genomes, and Taxonomic Rank. mSystems 2018; 3:e00257-18. [PMID: 30505941 PMCID: PMC6247013 DOI: 10.1128/msystems.00257-18] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Accepted: 11/02/2018] [Indexed: 01/27/2023] Open
Abstract
Microbial genomes have been shaped by parent-to-offspring (vertical) descent and lateral genetic transfer. These processes can be distinguished by alignment-based inference and comparison of phylogenetic trees for individual gene families, but this approach is not scalable to whole-genome sequences, and a tree-like structure does not adequately capture how these processes impact microbial physiology. Here we adopted alignment-free approaches based on k-mer statistics to infer phylogenomic networks involving 2,783 completely sequenced bacterial and archaeal genomes and compared the contributions of rRNA, protein-coding, and plasmid sequences to these networks. Our results show that the phylogenomic signal arising from ribosomal RNAs is strong and extends broadly across all taxa, whereas that from plasmids is strong but restricted to closely related groups, particularly Proteobacteria. However, the signal from the other chromosomal regions is restricted in breadth. We show that mean k-mer similarity can correlate with taxonomic rank. We also link the implicated k-mers to genome annotation (thus, functions) and define core k-mers (thus, core functions) in specific phyletic groups. Highly conserved functions in most phyla include amino acid metabolism and transport as well as energy production and conversion. Intracellular trafficking and secretion are the most prominent core functions among Spirochaetes, whereas energy production and conversion are not highly conserved among the largely parasitic or commensal Tenericutes. These observations suggest that differential conservation of functions relates to niche specialization and evolutionary diversification of microbes. Our results demonstrate that k-mer approaches can be used to efficiently identify phylogenomic signals and conserved core functions at the multigenome scale. IMPORTANCE Genome evolution of microbes involves parent-to-offspring descent, and lateral genetic transfer that convolutes the phylogenomic signal. This study investigated phylogenomic signals among thousands of microbial genomes based on short subsequences without using multiple-sequence alignment. The signal from ribosomal RNAs is strong across all taxa, and the signal of plasmids is strong only in closely related groups, particularly Proteobacteria. However, the signal from other chromosomal regions (∼99% of the genomes) is remarkably restricted in breadth. The similarity of subsequences is found to correlate with taxonomic rank and informs on conserved and differential core functions relative to niche specialization and evolutionary diversification of microbes. These results provide a comprehensive, alignment-free view of microbial genome evolution as a network, beyond a tree-like structure.
Collapse
Affiliation(s)
- Guillaume Bernard
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Paul Greenfield
- Commonwealth Scientific and Industrial Research Organisation (CSIRO), North Ryde, NSW, Australia
| | - Mark A. Ragan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, QLD, Australia
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| |
Collapse
|
20
|
Ren J, Bai X, Lu YY, Tang K, Wang Y, Reinert G, Sun F. Alignment-Free Sequence Analysis and Applications. Annu Rev Biomed Data Sci 2018; 1:93-114. [PMID: 31828235 PMCID: PMC6905628 DOI: 10.1146/annurev-biodatasci-080917-013431] [Citation(s) in RCA: 58] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Genome and metagenome comparisons based on large amounts of next generation sequencing (NGS) data pose significant challenges for alignment-based approaches due to the huge data size and the relatively short length of the reads. Alignment-free approaches based on the counts of word patterns in NGS data do not depend on the complete genome and are generally computationally efficient. Thus, they contribute significantly to genome and metagenome comparison. Recently, novel statistical approaches have been developed for the comparison of both long and shotgun sequences. These approaches have been applied to many problems including the comparison of gene regulatory regions, genome sequences, metagenomes, binning contigs in metagenomic data, identification of virus-host interactions, and detection of horizontal gene transfers. We provide an updated review of these applications and other related developments of word-count based approaches for alignment-free sequence analysis.
Collapse
Affiliation(s)
- Jie Ren
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Xin Bai
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Yang Young Lu
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Kujin Tang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, Fujian, China
| | - Gesine Reinert
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
21
|
Cipcigan F, Carrieri AP, Pyzer-Knapp EO, Krishna R, Hsiao YW, Winn M, Ryadnov MG, Edge C, Martyna G, Crain J. Accelerating molecular discovery through data and physical sciences: Applications to peptide-membrane interactions. J Chem Phys 2018; 148:241744. [DOI: 10.1063/1.5027261] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Affiliation(s)
- Flaviu Cipcigan
- IBM Research UK, Hartree Centre, Daresbury WA4 4AD, United Kingdom
| | | | | | - Ritesh Krishna
- IBM Research UK, Hartree Centre, Daresbury WA4 4AD, United Kingdom
| | - Ya-Wen Hsiao
- STFC Daresbury Laboratories, Daresbury WA4 4AD, United Kingdom
| | - Martyn Winn
- STFC Daresbury Laboratories, Daresbury WA4 4AD, United Kingdom
| | - Maxim G. Ryadnov
- National Physical Laboratory, Hampton Road, Teddington, United Kingdom
| | - Colin Edge
- GSK Medicines Research Centre, Stevenage SG1 2NY, United Kingdom
| | - Glenn Martyna
- IBM T. J. Watson Research Center, Yorktown Heights, New York 10598, USA
| | - Jason Crain
- IBM Research UK, Hartree Centre, Daresbury WA4 4AD, United Kingdom
- Maxwell Centre, University of Cambridge, Cambridge CB3 0HE, United Kingdom
| |
Collapse
|
22
|
Lin J, Wei J, Adjeroh D, Jiang BH, Jiang Y. SSAW: A new sequence similarity analysis method based on the stationary discrete wavelet transform. BMC Bioinformatics 2018; 19:165. [PMID: 29720081 PMCID: PMC5930706 DOI: 10.1186/s12859-018-2155-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2017] [Accepted: 04/11/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Alignment-free sequence similarity analysis methods often lead to significant savings in computational time over alignment-based counterparts. RESULTS A new alignment-free sequence similarity analysis method, called SSAW is proposed. SSAW stands for Sequence Similarity Analysis using the Stationary Discrete Wavelet Transform (SDWT). It extracts k-mers from a sequence, then maps each k-mer to a complex number field. Then, the series of complex numbers formed are transformed into feature vectors using the stationary discrete wavelet transform. After these steps, the original sequence is turned into a feature vector with numeric values, which can then be used for clustering and/or classification. CONCLUSIONS Using two different types of applications, namely, clustering and classification, we compared SSAW against the the-state-of-the-art alignment free sequence analysis methods. SSAW demonstrates competitive or superior performance in terms of standard indicators, such as accuracy, F-score, precision, and recall. The running time was significantly better in most cases. These make SSAW a suitable method for sequence analysis, especially, given the rapidly increasing volumes of sequence data required by most modern applications.
Collapse
Affiliation(s)
- Jie Lin
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China
| | - Jing Wei
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China
| | - Donald Adjeroh
- Lane Department of Computer Science and Electrical Engineering, West Virginia University, Morgantown, 26506, WV, USA
| | - Bing-Hua Jiang
- Department of Pathology, University of Iowa, Iowa city, 52242, Iowa, USA
| | - Yue Jiang
- College of Mathematics and Informatics, Fujian Normal University, Fuzhou, 350108, People's Republic of China.
| |
Collapse
|
23
|
Tang K, Lu YY, Sun F. Background Adjusted Alignment-Free Dissimilarity Measures Improve the Detection of Horizontal Gene Transfer. Front Microbiol 2018; 9:711. [PMID: 29713314 PMCID: PMC5911508 DOI: 10.3389/fmicb.2018.00711] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2018] [Accepted: 03/27/2018] [Indexed: 11/20/2022] Open
Abstract
Horizontal gene transfer (HGT) plays an important role in the evolution of microbial organisms including bacteria. Alignment-free methods based on single genome compositional information have been used to detect HGT. Currently, Manhattan and Euclidean distances based on tetranucleotide frequencies are the most commonly used alignment-free dissimilarity measures to detect HGT. By testing on simulated bacterial sequences and real data sets with known horizontal transferred genomic regions, we found that more advanced alignment-free dissimilarity measures such as CVTree and d2* that take into account the background Markov sequences can solve HGT detection problems with significantly improved performance. We also studied the influence of different factors such as evolutionary distance between host and donor sequences, size of sliding window, and host genome composition on the performances of alignment-free methods to detect HGT. Our study showed that alignment-free methods can predict HGT accurately when host and donor genomes are in different order levels. Among all methods, CVTree with word length of 3, d2* with word length 3, Markov order 1 and d2* with word length 4, Markov order 1 outperform others in terms of their highest F1-score and their robustness under the influence of different factors.
Collapse
Affiliation(s)
- Kujin Tang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, United States
| | - Yang Young Lu
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, United States
| | - Fengzhu Sun
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, United States.,Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
24
|
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017; 18:186. [PMID: 28974235 PMCID: PMC5627421 DOI: 10.1186/s13059-017-1319-7] [Citation(s) in RCA: 285] [Impact Index Per Article: 35.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
Alignment-free sequence analyses have been applied to problems ranging from whole-genome phylogeny to the classification of protein families, identification of horizontally transferred genes, and detection of recombined sequences. The strength of these methods makes them particularly useful for next-generation sequencing data processing and analysis. However, many researchers are unclear about how these methods work, how they compare to alignment-based methods, and what their potential is for use for their research. We address these questions and provide a guide to the currently available alignment-free sequence analysis tools.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland
| | - Susana Vinga
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas Almeida
- Stony Brook University (SUNY), 101 Nicolls Road, Stony Brook, NY, 11794, USA
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University in Poznan, Umultowska 89, 61-614, Poznan, Poland.
| |
Collapse
|
25
|
Cong Y, Chan YB, Phillips CA, Langston MA, Ragan MA. Robust Inference of Genetic Exchange Communities from Microbial Genomes Using TF-IDF. Front Microbiol 2017; 8:21. [PMID: 28154557 PMCID: PMC5243798 DOI: 10.3389/fmicb.2017.00021] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2016] [Accepted: 01/04/2017] [Indexed: 11/13/2022] Open
Abstract
Bacteria and archaea can exchange genetic material across lineages through processes of lateral genetic transfer (LGT). Collectively, these exchange relationships can be modeled as a network and analyzed using concepts from graph theory. In particular, densely connected regions within an LGT network have been defined as genetic exchange communities (GECs). However, it has been problematic to construct networks in which edges solely represent LGT. Here we apply term frequency-inverse document frequency (TF-IDF), an alignment-free method originating from document analysis, to infer regions of lateral origin in bacterial genomes. We examine four empirical datasets of different size (number of genomes) and phyletic breadth, varying a key parameter (word length k) within bounds established in previous work. We map the inferred lateral regions to genes in recipient genomes, and construct networks in which the nodes are groups of genomes, and the edges natively represent LGT. We then extract maximum and maximal cliques (i.e., GECs) from these graphs, and identify nodes that belong to GECs across a wide range of k. Most surviving lateral transfer has happened within these GECs. Using Gene Ontology enrichment tests we demonstrate that biological processes associated with metabolism, regulation and transport are often over-represented among the genes affected by LGT within these communities. These enrichments are largely robust to change of k.
Collapse
Affiliation(s)
- Yingnan Cong
- Institute for Molecular Bioscience and ARC Centre of Excellence in Bioinformatics, University of Queensland, St Lucia QLD, Australia
| | - Yao-Ban Chan
- School of Mathematics and Statistics, University of Melbourne, Parkville VIC, Australia
| | - Charles A Phillips
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville TN, USA
| | - Michael A Langston
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville TN, USA
| | - Mark A Ragan
- Institute for Molecular Bioscience and ARC Centre of Excellence in Bioinformatics, University of Queensland, St Lucia QLD, Australia
| |
Collapse
|
26
|
Abstract
Ernst Haeckel based his landmark Tree of Life on the supposed ontogenic recapitulation of phylogeny, i.e. that successive embryonic stages during the development of an organism re-trace the morphological forms of its ancestors over the course of evolution. Much of this idea has since been discredited. Today, phylogenies are often based on families of molecular sequences. The standard approach starts with a multiple sequence alignment, in which the sequences are arranged relative to each other in a way that maximises a measure of similarity position-by-position along their entire length. A tree (or sometimes a network) is then inferred. Rigorous multiple sequence alignment is computationally demanding, and evolutionary processes that shape the genomes of many microbes (bacteria, archaea and some morphologically simple eukaryotes) can add further complications. In particular, recombination, genome rearrangement and lateral genetic transfer undermine the assumptions that underlie multiple sequence alignment, and imply that a tree-like structure may be too simplistic. Here, using genome sequences of 143 bacterial and archaeal genomes, we construct a network of phylogenetic relatedness based on the number of shared
k-mers (subsequences at fixed length
k). Our findings suggest that the network captures not only key aspects of microbial genome evolution as inferred from a tree, but also features that are not treelike. The method is highly scalable, allowing for investigation of genome evolution across a large number of genomes. Instead of using specific regions or sequences from genome sequences, or indeed Haeckel’s idea of ontogeny, we argue that genome phylogenies can be inferred using
k-mers from whole-genome sequences. Representing these networks dynamically allows biological questions of interest to be formulated and addressed quickly and in a visually intuitive manner.
Collapse
Affiliation(s)
- Guillaume Bernard
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| | - Mark A Ragan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, The University of Queensland, Brisbane, Australia
| |
Collapse
|
27
|
Exploring lateral genetic transfer among microbial genomes using TF-IDF. Sci Rep 2016; 6:29319. [PMID: 27452976 PMCID: PMC4958990 DOI: 10.1038/srep29319] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2016] [Accepted: 06/13/2016] [Indexed: 11/17/2022] Open
Abstract
Many microbes can acquire genetic material from their environment and incorporate it into their genome, a process known as lateral genetic transfer (LGT). Computational approaches have been developed to detect genomic regions of lateral origin, but typically lack sensitivity, ability to distinguish donor from recipient, and scalability to very large datasets. To address these issues we have introduced an alignment-free method based on ideas from document analysis, term frequency-inverse document frequency (TF-IDF). Here we examine the performance of TF-IDF on three empirical datasets: 27 genomes of Escherichia coli and Shigella, 110 genomes of enteric bacteria, and 143 genomes across 12 bacterial and three archaeal phyla. We investigate the effect of k-mer size, gap size and delineation of groups on the inference of genomic regions of lateral origin, finding an interplay among these parameters and sequence divergence. Because TF-IDF identifies donor groups and delineates regions of lateral origin within recipient genomes, aggregating these regions by gene enables us to explore, for the first time, the mosaic nature of lateral genes including the multiplicity of biological sources, ancestry of transfer and over-writing by subsequent transfers. We carry out Gene Ontology enrichment tests to investigate which biological processes are potentially affected by LGT.
Collapse
|