1
|
Yang Q, Wang X, Han M, Sheng H, Sun Y, Su L, Lu W, Li M, Wang S, Chen J, Cui S, Yang BW. Bacterial genome-wide association studies: exploring the genetic variation underlying bacterial phenotypes. Appl Environ Microbiol 2025:e0251224. [PMID: 40377303 DOI: 10.1128/aem.02512-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/18/2025] Open
Abstract
With the continuous advancements in high-throughput genome sequencing technologies and the development of innovative bioinformatics tools, bacterial genome-wide association studies (BGWAS) have emerged as a transformative approach for investigating the genetic variations underlying diverse bacterial phenotypes at the population genome level. This review provides a comprehensive overview of the application of BGWAS in elucidating genetic determinants of bacterial drug resistance, pathogenicity, host specificity, biofilm formation, and probiotic fermentation characteristics. We systematically summarize the BGWAS workflow, including study design, data analysis pipelines, and the bioinformatics software employed at various stages. Furthermore, we highlight specialized tools tailored for BGWAS and discuss their unique features and applications. We also discuss confounding factors that can influence the accuracy and reliability of BGWAS results, including population structure, linkage disequilibrium, and multiple testing. By incorporating recent advancements, this review serves as a comprehensive reference for researchers utilizing BGWAS to uncover the genetic basis of bacterial phenotypes.
Collapse
Affiliation(s)
- Qiuping Yang
- College of Food Science and Engineering, Northwest A&F University, Shaanxi, China
| | - Xiaoqi Wang
- College of Food Science and Engineering, Northwest A&F University, Shaanxi, China
| | - Mengting Han
- College of Food Science and Engineering, Northwest A&F University, Shaanxi, China
| | - Huanjing Sheng
- College of Food Science and Engineering, Northwest A&F University, Shaanxi, China
| | - Yulu Sun
- College of Food Science and Engineering, Northwest A&F University, Shaanxi, China
| | - Li Su
- College of Food Science and Engineering, Northwest A&F University, Shaanxi, China
| | - Wenjing Lu
- College of Food Science and Engineering, Northwest A&F University, Shaanxi, China
| | - Mei Li
- College of Food Science and Engineering, Northwest A&F University, Shaanxi, China
| | - Siyue Wang
- College of Food Science and Engineering, Northwest A&F University, Shaanxi, China
| | - Jia Chen
- College of Chemical Technology, Shijiazhuang University, Shijiazhuang, China
| | - Shenghui Cui
- National Institutes for Food and Drug Control, Beijing, China
| | - Bao-Wei Yang
- College of Food Science and Engineering, Northwest A&F University, Shaanxi, China
| |
Collapse
|
2
|
Roberts MD, Davis O, Josephs EB, Williamson RJ. K-mer-based Approaches to Bridging Pangenomics and Population Genetics. Mol Biol Evol 2025; 42:msaf047. [PMID: 40111256 PMCID: PMC11925024 DOI: 10.1093/molbev/msaf047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2024] [Revised: 01/10/2025] [Accepted: 02/04/2025] [Indexed: 03/12/2025] Open
Abstract
Many commonly studied species now have more than one chromosome-scale genome assembly, revealing a large amount of genetic diversity previously missed by approaches that map short reads to a single reference. However, many species still lack multiple reference genomes and correctly aligning references to build pangenomes can be challenging for many species, limiting our ability to study this missing genomic variation in population genetics. Here, we argue that k-mers are a very useful but underutilized tool for bridging the reference-focused paradigms of population genetics with the reference-free paradigms of pangenomics. We review current literature on the uses of k-mers for performing three core components of most population genetics analyses: identifying, measuring, and explaining patterns of genetic variation. We also demonstrate how different k-mer-based measures of genetic variation behave in population genetic simulations according to the choice of k, depth of sequencing coverage, and degree of data compression. Overall, we find that k-mer-based measures of genetic diversity scale consistently with pairwise nucleotide diversity (π) up to values of about π=0.025 (R2=0.97) for neutrally evolving populations. For populations with even more variation, using shorter k-mers will maintain the scalability up to at least π=0.1. Furthermore, in our simulated populations, k-mer dissimilarity values can be reliably approximated from counting bloom filters, highlighting a potential avenue to decreasing the memory burden of k-mer-based genomic dissimilarity analyses. For future studies, there is a great opportunity to further develop methods to identifying selected loci using k-mers.
Collapse
Affiliation(s)
- Miles D Roberts
- Genetics and Genome Sciences Program, Michigan State University, East Lansing, MI 48824, USA
| | - Olivia Davis
- Department of Computer Science and Software Engineering, Rose-Hulman Institute of Technology, Terre Haute, IN 47803, USA
| | - Emily B Josephs
- Department of Plant Biology, Michigan State University, East Lansing, MI 48824, USA
- Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, MI 48824, USA
- Plant Resilience Institute, Michigan State University, East Lansing, MI 48824, USA
| | - Robert J Williamson
- Department of Computer Science and Software Engineering, Rose-Hulman Institute of Technology, Terre Haute, IN 47803, USA
- Department of Biology and Biomedical Engineering, Rose-Hulman Institute of Technology, Terre Haute, IN 47803, USA
| |
Collapse
|
3
|
Roberts M, Josephs EB. Previously unmeasured genetic diversity explains part of Lewontin's paradox in a k -mer-based meta-analysis of 112 plant species. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.17.594778. [PMID: 38798362 PMCID: PMC11118579 DOI: 10.1101/2024.05.17.594778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
At the molecular level, most evolution is expected to be neutral. A key prediction of this expectation is that the level of genetic diversity in a population should scale with population size. However, as was noted by Richard Lewontin in 1974 and reaffirmed by later studies, the slope of the population size-diversity relationship in nature is much weaker than expected under neutral theory. We hypothesize that one contributor to this paradox is that current methods relying on single nucleotide polymorphisms (SNPs) called from aligning short reads to a reference genome underestimate levels of genetic diversity in many species. To test this idea, we calculated nucleotide diversity ( π ) and k -mer-based metrics of genetic diversity across 112 plant species, amounting to over 205 terabases of DNA sequencing data from 27,488 individual plants. We then compared how these different metrics correlated with proxies of population size that account for both range size and population density variation across species. We found that our population size proxies scaled anywhere from about 3 to over 20 times faster with k -mer diversity than nucleotide diversity after adjusting for evolutionary history, mating system, life cycle habit, cultivation status, and invasiveness. The relationship between k -mer diversity and population size proxies also remains significant after correcting for genome size, whereas the analogous relationship for nucleotide diversity does not. These results suggest that variation not captured by common SNP-based analyses explains part of Lewontin's paradox in plants.
Collapse
Affiliation(s)
- Miles Roberts
- Genetics and Genome Sciences Program, Michigan State University, East Lansing MI
| | - Emily B. Josephs
- Department of Plant Biology, Michigan State University, East Lansing, MI
- Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, MI
- Plant Resilience Institute, Michigan State University, East Lansing, MI
| |
Collapse
|
4
|
Corut AK, Wallace JG. kGWASflow: a modular, flexible, and reproducible Snakemake workflow for k-mers-based GWAS. G3 (BETHESDA, MD.) 2023; 14:jkad246. [PMID: 37976215 PMCID: PMC10755180 DOI: 10.1093/g3journal/jkad246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 10/15/2023] [Indexed: 11/19/2023]
Abstract
Genome-wide association studies (GWAS) have been widely used to identify genetic variation associated with complex traits. Despite its success and popularity, the traditional GWAS approach comes with a variety of limitations. For this reason, newer methods for GWAS have been developed, including the use of pan-genomes instead of a reference genome and the utilization of markers beyond single-nucleotide polymorphisms, such as structural variations and k-mers. The k-mers-based GWAS approach has especially gained attention from researchers in recent years. However, these new methodologies can be complicated and challenging to implement. Here, we present kGWASflow, a modular, user-friendly, and scalable workflow to perform GWAS using k-mers. We adopted an existing kmersGWAS method into an easier and more accessible workflow using management tools like Snakemake and Conda and eliminated the challenges caused by missing dependencies and version conflicts. kGWASflow increases the reproducibility of the kmersGWAS method by automating each step with Snakemake and using containerization tools like Docker. The workflow encompasses supplemental components such as quality control, read-trimming procedures, and generating summary statistics. kGWASflow also offers post-GWAS analysis options to identify the genomic location and context of trait-associated k-mers. kGWASflow can be applied to any organism and requires minimal programming skills. kGWASflow is freely available on GitHub (https://github.com/akcorut/kGWASflow) and Bioconda (https://anaconda.org/bioconda/kgwasflow).
Collapse
Affiliation(s)
- Adnan Kivanc Corut
- Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA
| | - Jason G Wallace
- Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA
- Institute of Plant Breeding, Genetics, and Genomics, University of Georgia, Athens, GA 30602, USA
- Department of Crop and Soil Sciences, University of Georgia, Athens, GA 30602, USA
| |
Collapse
|
5
|
Lemane T, Chikhi R, Peterlongo P. k mdiff, large-scale and user-friendly differential k-mer analyses. Bioinformatics 2022; 38:5443-5445. [PMID: 36315078 PMCID: PMC9750116 DOI: 10.1093/bioinformatics/btac689] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Revised: 09/23/2022] [Accepted: 10/28/2022] [Indexed: 12/25/2022] Open
Abstract
SUMMARY Genome wide association studies elucidate links between genotypes and phenotypes. Recent studies point out the interest of conducting such experiments using k-mers as the base signal instead of single-nucleotide polymorphisms. We propose a tool, kmdiff, that performs differential k-mer analyses on large sequencing cohorts in an order of magnitude less time and memory than previously possible. AVAILABILITYAND IMPLEMENTATION https://github.com/tlemane/kmdiff. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Téo Lemane
- Univ. Rennes, Inria, CNRS, IRISA - UMR 6074, Rennes, F-35000 France
| | - Rayan Chikhi
- Institut Pasteur, Université Paris Cité, Sequence Bioinformatics, Paris, F-75015, France
| | | |
Collapse
|
6
|
Gupta PK. GWAS for genetics of complex quantitative traits: Genome to pangenome and SNPs to SVs and k-mers. Bioessays 2021; 43:e2100109. [PMID: 34486143 DOI: 10.1002/bies.202100109] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 08/21/2021] [Accepted: 08/23/2021] [Indexed: 12/22/2022]
Abstract
The development of improved methods for genome-wide association studies (GWAS) for genetics of quantitative traits has been an active area of research during the last 25 years. This activity initially started with the use of mixed linear model (MLM), which was variously modified. During the last decade, however, with the availability of high throughput next generation sequencing (NGS) technology, development and use of pangenomes and novel markers including structural variations (SVs) and k-mers for GWAS has taken over as a new thrust area of research. Pangenomes and SVs are now available in humans, livestock, and a number of plant species, so that these resources along with k-mers are being used in GWAS for exploring additional genetic variation that was hitherto not available for analysis. These developments have resulted in significant improvement in GWAS methodology for detection of marker-trait associations (MTAs) that are relevant to human healthcare and crop improvement.
Collapse
Affiliation(s)
- Pushpendra K Gupta
- Department of Genetics and Plant Breeding, Ch. Charan Singh University Meerut, Meerut, Uttar Pradesh, India
| |
Collapse
|
7
|
Mehrab Z, Mobin J, Tahmid IA, Pachter L, Rahman A. Reference-free Association Mapping from Sequencing Reads Using k-mers. Bio Protoc 2020; 10:e3815. [PMID: 33659468 PMCID: PMC7842384 DOI: 10.21769/bioprotoc.3815] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Revised: 09/21/2020] [Accepted: 10/14/2020] [Indexed: 11/02/2022] Open
Abstract
Association mapping is the process of linking phenotypes with genotypes. In genome wide association studies (GWAS), individuals are first genotyped using microarrays or by aligning sequenced reads to reference genomes. However, both these approaches rely on reference genomes which limits their application to organisms with no or incomplete reference genomes. To address this, reference free association mapping methods have been developed. Here we present the protocol of an alignment free method for association studies which is based on counting k-mers in sequenced reads, testing for associations between k-mers and the phenotype of interest, and local assembly of the k-mers of statistical significance. The method can map associations of categorical phenotypes to sequence and structural variations without requiring prior sequencing of reference genomes.
Collapse
Affiliation(s)
- Zakaria Mehrab
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Jaiaid Mobin
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Ibrahim Asadullah Tahmid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Lior Pachter
- Departments of Biology and Computing & Mathematical Sciences, California Institute of Technology, Pasadena, United States
| | - Atif Rahman
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| |
Collapse
|