1
|
Roberts MD, Davis O, Josephs EB, Williamson RJ. K-mer-based Approaches to Bridging Pangenomics and Population Genetics. Mol Biol Evol 2025; 42:msaf047. [PMID: 40111256 PMCID: PMC11925024 DOI: 10.1093/molbev/msaf047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2024] [Revised: 01/10/2025] [Accepted: 02/04/2025] [Indexed: 03/12/2025] Open
Abstract
Many commonly studied species now have more than one chromosome-scale genome assembly, revealing a large amount of genetic diversity previously missed by approaches that map short reads to a single reference. However, many species still lack multiple reference genomes and correctly aligning references to build pangenomes can be challenging for many species, limiting our ability to study this missing genomic variation in population genetics. Here, we argue that k-mers are a very useful but underutilized tool for bridging the reference-focused paradigms of population genetics with the reference-free paradigms of pangenomics. We review current literature on the uses of k-mers for performing three core components of most population genetics analyses: identifying, measuring, and explaining patterns of genetic variation. We also demonstrate how different k-mer-based measures of genetic variation behave in population genetic simulations according to the choice of k, depth of sequencing coverage, and degree of data compression. Overall, we find that k-mer-based measures of genetic diversity scale consistently with pairwise nucleotide diversity (π) up to values of about π=0.025 (R2=0.97) for neutrally evolving populations. For populations with even more variation, using shorter k-mers will maintain the scalability up to at least π=0.1. Furthermore, in our simulated populations, k-mer dissimilarity values can be reliably approximated from counting bloom filters, highlighting a potential avenue to decreasing the memory burden of k-mer-based genomic dissimilarity analyses. For future studies, there is a great opportunity to further develop methods to identifying selected loci using k-mers.
Collapse
Affiliation(s)
- Miles D Roberts
- Genetics and Genome Sciences Program, Michigan State University, East Lansing, MI 48824, USA
| | - Olivia Davis
- Department of Computer Science and Software Engineering, Rose-Hulman Institute of Technology, Terre Haute, IN 47803, USA
| | - Emily B Josephs
- Department of Plant Biology, Michigan State University, East Lansing, MI 48824, USA
- Ecology, Evolution, and Behavior Program, Michigan State University, East Lansing, MI 48824, USA
- Plant Resilience Institute, Michigan State University, East Lansing, MI 48824, USA
| | - Robert J Williamson
- Department of Computer Science and Software Engineering, Rose-Hulman Institute of Technology, Terre Haute, IN 47803, USA
- Department of Biology and Biomedical Engineering, Rose-Hulman Institute of Technology, Terre Haute, IN 47803, USA
| |
Collapse
|
2
|
Jenike KM, Campos-Domínguez L, Boddé M, Cerca J, Hodson CN, Schatz MC, Jaron KS. k-mer approaches for biodiversity genomics. Genome Res 2025; 35:219-230. [PMID: 39890468 PMCID: PMC11874746 DOI: 10.1101/gr.279452.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Accepted: 01/09/2025] [Indexed: 02/03/2025]
Abstract
The wide array of currently available genomes displays a wonderful diversity in size, composition, and structure and is quickly expanding thanks to several global biodiversity genomics initiatives. However, sequencing of genomes, even with the latest technologies, can still be challenging for both technical (e.g., small physical size, contaminated samples, or access to appropriate sequencing platforms) and biological reasons (e.g., germline-restricted DNA, variable ploidy levels, sex chromosomes, or very large genomes). In recent years, k-mer-based techniques have become popular to overcome some of these challenges. They are based on the simple process of dividing the analyzed sequences (e.g., raw reads or genomes) into a set of subsequences of length k, called k-mers, and then analyzing the frequency or sequences of those k-mers. Analyses based on k-mers allow for a rapid and intuitive assessment of complex sequencing data sets. Here, we provide a comprehensive review to the theoretical properties and practical applications of k-mers in biodiversity genomics with a special focus on genome modeling.
Collapse
Affiliation(s)
- Katharine M Jenike
- Johns Hopkins University, School of Medicine, Baltimore, Maryland 21205, USA
| | - Lucía Campos-Domínguez
- Centre for Research in Agricultural Genomics, CRAG (CSIC-IRTA-UAB-UB), Campus UAB, Cerdanyola del Vallès, 08193 Barcelona, Spain
| | - Marilou Boddé
- Tree of Life, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom
| | - José Cerca
- Center for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, 0313 Oslo, Norway
| | - Christina N Hodson
- University College London, UCL Department of Genetics, Evolution & Environment, London, WC1E 6BT, United Kingdom
| | - Michael C Schatz
- Johns Hopkins University, School of Medicine, Baltimore, Maryland 21205, USA
| | - Kamil S Jaron
- Tree of Life, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom;
| |
Collapse
|
3
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
4
|
Bouhouch Y, Aggad D, Richet N, Rehman S, Al-Jaboobi M, Kehel Z, Esmaeel Q, Hafidi M, Jacquard C, Sanchez L. Early Detection of Both Pyrenophora teres f. teres and f. maculata in Asymptomatic Barley Leaves Using Digital Droplet PCR (ddPCR). Int J Mol Sci 2024; 25:11980. [PMID: 39596050 PMCID: PMC11593351 DOI: 10.3390/ijms252211980] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2024] [Revised: 10/28/2024] [Accepted: 11/01/2024] [Indexed: 11/28/2024] Open
Abstract
Efficient early pathogen detection, before symptom apparition, is crucial for optimizing disease management. In barley, the fungal pathogen Pyrenophora teres is the causative agent of net blotch disease, which exists in two forms: P. teres f. sp. teres (Ptt), causing net-form of net blotch (NTNB), and P. teres f. sp. maculata (Ptm), responsible for spot-form of net blotch (STNB). In this study, we developed primers and a TaqMan probe to detect both Ptt and Ptm. A comprehensive k-mer based analysis was performed across a collection of P. teres genomes to identify the conserved regions that had potential as universal genetic markers. These regions were then analyzed for their prevalence and copy number across diverse Moroccan P. teres strains, using both a k-mer analysis for sequence identification and a phylogenetic assessment to establish genetic relatedness. The designed primer-probe set was successfully validated through qPCR, and early disease detection, prior to symptom development, was achieved using ddPCR. The k-mer analysis performed across the available P. teres genomes suggests the potential for these sequences to serve as universal markers for P. teres, transcending environmental variations.
Collapse
Affiliation(s)
- Yassine Bouhouch
- INRAE, RIBP, Université de Reims Champagne-Ardenne, USC 1488, BP 1039 Reims, France; (Y.B.); (N.R.); (Q.E.); (C.J.)
- Plateformes Technologiques URCATech, Plateau MOBICYTE, Université de Reims Champagne-Ardenne, BP 1039 Reims, France;
| | - Dina Aggad
- Plateformes Technologiques URCATech, Plateau MOBICYTE, Université de Reims Champagne-Ardenne, BP 1039 Reims, France;
| | - Nicolas Richet
- INRAE, RIBP, Université de Reims Champagne-Ardenne, USC 1488, BP 1039 Reims, France; (Y.B.); (N.R.); (Q.E.); (C.J.)
| | - Sajid Rehman
- Biodiversity and Crop Improvement Program, International Center for Agricultural Research in the Dry Areas, Rabat BP 6202, Morocco; (S.R.); (M.A.-J.); (Z.K.)
| | - Muamar Al-Jaboobi
- Biodiversity and Crop Improvement Program, International Center for Agricultural Research in the Dry Areas, Rabat BP 6202, Morocco; (S.R.); (M.A.-J.); (Z.K.)
| | - Zakaria Kehel
- Biodiversity and Crop Improvement Program, International Center for Agricultural Research in the Dry Areas, Rabat BP 6202, Morocco; (S.R.); (M.A.-J.); (Z.K.)
| | - Qassim Esmaeel
- INRAE, RIBP, Université de Reims Champagne-Ardenne, USC 1488, BP 1039 Reims, France; (Y.B.); (N.R.); (Q.E.); (C.J.)
| | - Majida Hafidi
- Laboratoire de Biotechnologie Végétale et de Biologie Moléculaire, Faculté des Sciences, Université Moulay Ismail, Zitoune, Meknès BP 11201, Morocco;
| | - Cédric Jacquard
- INRAE, RIBP, Université de Reims Champagne-Ardenne, USC 1488, BP 1039 Reims, France; (Y.B.); (N.R.); (Q.E.); (C.J.)
| | - Lisa Sanchez
- INRAE, RIBP, Université de Reims Champagne-Ardenne, USC 1488, BP 1039 Reims, France; (Y.B.); (N.R.); (Q.E.); (C.J.)
| |
Collapse
|
5
|
He C, Washburn JD, Schleif N, Hao Y, Kaeppler H, Kaeppler SM, Zhang Z, Yang J, Liu S. Trait association and prediction through integrative k-mer analysis. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2024; 120:833-850. [PMID: 39259496 DOI: 10.1111/tpj.17012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Revised: 08/14/2024] [Accepted: 08/22/2024] [Indexed: 09/13/2024]
Abstract
Genome-wide association study (GWAS) with single nucleotide polymorphisms (SNPs) has been widely used to explore genetic controls of phenotypic traits. Alternatively, GWAS can use counts of substrings of length k from longer sequencing reads, k-mers, as genotyping data. Using maize cob and kernel color traits, we demonstrated that k-mer GWAS can effectively identify associated k-mers. Co-expression analysis of kernel color k-mers and genes directly found k-mers from known causal genes. Analyzing complex traits of kernel oil and leaf angle resulted in k-mers from both known and candidate genes. A gene encoding a MADS transcription factor was functionally validated by showing that ectopic expression of the gene led to less upright leaves. Evolution analysis revealed most k-mers positively correlated with kernel oil were strongly selected against in maize populations, while most k-mers for upright leaf angle were positively selected. In addition, genomic prediction of kernel oil, leaf angle, and flowering time using k-mer data resulted in a similarly high prediction accuracy to the standard SNP-based method. Collectively, we showed k-mer GWAS is a powerful approach for identifying trait-associated genetic elements. Further, our results demonstrated the bridging role of k-mers for data integration and functional gene discovery.
Collapse
Affiliation(s)
- Cheng He
- Department of Plant Pathology, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Jacob D Washburn
- Plant Genetics Research Unit, USDA-ARS, Columbia, Missouri, 65211, USA
| | - Nathaniel Schleif
- Department of Agronomy, University of Wisconsin-Madison, Madison, Wisconsin, 53706, USA
| | - Yangfan Hao
- Department of Plant Pathology, Kansas State University, Manhattan, Kansas, 66506, USA
| | - Heidi Kaeppler
- Department of Agronomy, University of Wisconsin-Madison, Madison, Wisconsin, 53706, USA
| | - Shawn M Kaeppler
- Department of Agronomy, University of Wisconsin-Madison, Madison, Wisconsin, 53706, USA
| | - Zhiwu Zhang
- Department of Crop and Soil Sciences, Washington State University, Pullman, Washington, 99164, USA
| | - Jinliang Yang
- Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, Nebraska, 68583-0915, USA
- Center for Plant Science Innovation, University of Nebraska-Lincoln, Lincoln, Nebraska, 68583, USA
| | - Sanzhen Liu
- Department of Plant Pathology, Kansas State University, Manhattan, Kansas, 66506, USA
| |
Collapse
|
6
|
Santoro D, Pellegrina L, Comin M, Vandin F. SPRISS: approximating frequent k-mers by sampling reads, and applications. Bioinformatics 2022; 38:3343-3350. [PMID: 35583271 PMCID: PMC9237683 DOI: 10.1093/bioinformatics/btac180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 02/25/2022] [Accepted: 05/16/2022] [Indexed: 11/29/2022] Open
Abstract
MOTIVATION The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis. RESULTS In this work, we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS uses a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset. AVAILABILITY AND IMPLEMENTATION SPRISS [a preliminary version (Santoro et al., 2021) of this work was presented at RECOMB 2021] is available at https://github.com/VandinLab/SPRISS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Diego Santoro
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Leonardo Pellegrina
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Fabio Vandin
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| |
Collapse
|
7
|
Repetitive Sequence Barcode Probe for Karyotype Analysis in Tripidium arundinaceum. Int J Mol Sci 2022; 23:ijms23126726. [PMID: 35743180 PMCID: PMC9224303 DOI: 10.3390/ijms23126726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2022] [Revised: 06/10/2022] [Accepted: 06/14/2022] [Indexed: 11/17/2022] Open
Abstract
The barcode probe is a convenient and efficient tool for molecular cytogenetics. Tripidium arundinaceum, as a polyploid wild allied genus of Saccharum, is a useful genetic resource that confers biotic and abiotic stress resistance for sugarcane breeding. Unfortunately, the basic cytogenetic information is still unclear due to the complex genome. We constructed the Cot-20 library for screening moderately and highly repetitive sequences from T. arundinaceum, and the chromosomal distribution of these repetitive sequences was explored. We used the barcode of repetitive sequence probes to distinguish the ten chromosome types of T. arundinaceum by fluorescence in situ hybridization (FISH) with Ea-0907, Ea-0098, and 45S rDNA. Furthermore, the distinction among homology chromosomes based on repetitive sequences was constructed in T. arundinaceum by the repeated FISH using the barcode probes including Ea-0663, Ea-0267, EaCent, 5S rDNA, Ea-0265, Ea-0070, and 45S rDNA. We combined these probes to distinguish 37 different chromosome types, suggesting that the repetitive sequences may have different distributions on homologous chromosomes of T. arundinaceum. In summary, this method provide a basis for the development of similar applications for cytogenetic analysis in other species.
Collapse
|
8
|
Lin G, He C, Zheng J, Koo DH, Le H, Zheng H, Tamang TM, Lin J, Liu Y, Zhao M, Hao Y, McFraland F, Wang B, Qin Y, Tang H, McCarty DR, Wei H, Cho MJ, Park S, Kaeppler H, Kaeppler SM, Liu Y, Springer N, Schnable PS, Wang G, White FF, Liu S. Chromosome-level genome assembly of a regenerable maize inbred line A188. Genome Biol 2021; 22:175. [PMID: 34108023 PMCID: PMC8188678 DOI: 10.1186/s13059-021-02396-x] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2020] [Accepted: 05/28/2021] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND The maize inbred line A188 is an attractive model for elucidation of gene function and improvement due to its high embryogenic capacity and many contrasting traits to the first maize reference genome, B73, and other elite lines. The lack of a genome assembly of A188 limits its use as a model for functional studies. RESULTS Here, we present a chromosome-level genome assembly of A188 using long reads and optical maps. Comparison of A188 with B73 using both whole-genome alignments and read depths from sequencing reads identify approximately 1.1 Gb of syntenic sequences as well as extensive structural variation, including a 1.8-Mb duplication containing the Gametophyte factor1 locus for unilateral cross-incompatibility, and six inversions of 0.7 Mb or greater. Increased copy number of carotenoid cleavage dioxygenase 1 (ccd1) in A188 is associated with elevated expression during seed development. High ccd1 expression in seeds together with low expression of yellow endosperm 1 (y1) reduces carotenoid accumulation, accounting for the white seed phenotype of A188. Furthermore, transcriptome and epigenome analyses reveal enhanced expression of defense pathways and altered DNA methylation patterns of the embryonic callus. CONCLUSIONS The A188 genome assembly provides a high-resolution sequence for a complex genome species and a foundational resource for analyses of genome variation and gene function in maize. The genome, in comparison to B73, contains extensive intra-species structural variations and other genetic differences. Expression and network analyses identify discrete profiles for embryonic callus and other tissues.
Collapse
Affiliation(s)
- Guifang Lin
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS, 66506-5502, USA
| | - Cheng He
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS, 66506-5502, USA
| | - Jun Zheng
- Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Dal-Hoe Koo
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS, 66506-5502, USA
| | - Ha Le
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS, 66506-5502, USA
| | - Huakun Zheng
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS, 66506-5502, USA
| | - Tej Man Tamang
- Department of Horticulture and Natural Resources, Kansas State University, Manhattan, KS, 66506-5502, USA
| | - Jinguang Lin
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS, 66506-5502, USA
- Present Address, Corvallis, OR, 97330, USA
| | - Yan Liu
- Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Mingxia Zhao
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS, 66506-5502, USA
| | - Yangfan Hao
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS, 66506-5502, USA
| | - Frank McFraland
- Department of Agronomy, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Bo Wang
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
| | - Yang Qin
- Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Haibao Tang
- Center for Genomics and Biotechnology and Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Fujian Agriculture and Forestry University, Fuzhou, 350002, Fujian, China
| | - Donald R McCarty
- Department of Horticulture, University of Florida, Gainesville, FL, 32611-0680, USA
| | - Hairong Wei
- College of Forest Resources and Environmental Science, Michigan Technological University, Houghton, MI, 49931, USA
| | - Myeong-Je Cho
- Innovative Genomics Institute, University of California-Berkeley, Sunnyvale, CA, 94704, USA
| | - Sunghun Park
- Department of Horticulture and Natural Resources, Kansas State University, Manhattan, KS, 66506-5502, USA
| | - Heidi Kaeppler
- Department of Agronomy, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Shawn M Kaeppler
- Department of Agronomy, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Yunjun Liu
- Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Nathan Springer
- Department of Plant Biology, University of Minnesota, Saint Paul, MN, 55108, USA
| | - Patrick S Schnable
- Department of Agronomy, Iowa State University, Ames, IA, 50011-3605, USA
| | - Guoying Wang
- Institute of Crop Sciences, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Frank F White
- Department of Plant Pathology, University of Florida, Gainesville, FL, 32611-0680, USA
| | - Sanzhen Liu
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS, 66506-5502, USA.
| |
Collapse
|
9
|
He C, Lin G, Wei H, Tang H, White FF, Valent B, Liu S. Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences. NAR Genom Bioinform 2020; 2:lqaa075. [PMID: 33575622 PMCID: PMC7671381 DOI: 10.1093/nargab/lqaa075] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2020] [Revised: 08/02/2020] [Accepted: 09/01/2020] [Indexed: 12/25/2022] Open
Abstract
Genome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists, but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as k-mer abundance difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Collectively, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses.
Collapse
Affiliation(s)
- Cheng He
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS 66506-5502, USA
| | - Guifang Lin
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS 66506-5502, USA
| | - Hairong Wei
- College of Forest Resources and Environmental Science, Michigan Technological University, Houghton, MI 49931, USA
| | - Haibao Tang
- Center for Genomics and Biotechnology and Fujian Provincial Key Laboratory of Haixia Applied Plant Systems Biology, Fujian Agriculture and Forestry University, Fujian 350002, China
| | - Frank F White
- Department of Plant Pathology, University of Florida, Gainesville, FL 32611-0680, USA
| | - Barbara Valent
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS 66506-5502, USA
| | - Sanzhen Liu
- Department of Plant Pathology, Kansas State University, 4024 Throckmorton Center, Manhattan, KS 66506-5502, USA
| |
Collapse
|
10
|
Beier S, Ulpinnis C, Schwalbe M, Münch T, Hoffie R, Koeppel I, Hertig C, Budhagatapalli N, Hiekel S, Pathi KM, Hensel G, Grosse M, Chamas S, Gerasimova S, Kumlehn J, Scholz U, Schmutzer T. Kmasker plants - a tool for assessing complex sequence space in plant species. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2020; 102:631-642. [PMID: 31823436 DOI: 10.1111/tpj.14645] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/25/2019] [Revised: 11/27/2019] [Accepted: 11/28/2019] [Indexed: 06/10/2023]
Abstract
Many plant genomes display high levels of repetitive sequences. The assembly of these complex genomes using short high-throughput sequence reads is still a challenging task. Underestimation or disregard of repeat complexity in these datasets can easily misguide downstream analysis. Detection of repetitive regions by k-mer counting methods has proved to be reliable. Easy-to-use applications utilizing k-mer counting are in high demand, especially in the domain of plants. We present Kmasker plants, a tool that uses k-mer count information as an assistant throughout the analytical workflow of genome data that is provided as a command-line and web-based solution. Beside its core competence to screen and mask repetitive sequences, we have integrated features that enable comparative studies between different cultivars or closely related species and methods that estimate target specificity of guide RNAs for application of site-directed mutagenesis using Cas9 endonuclease. In addition, we have set up a web service for Kmasker plants that maintains pre-computed indices for 10 of the economically most important cultivated plants. Source code for Kmasker plants has been made publically available at https://github.com/tschmutzer/kmasker. The web service is accessible at https://kmasker.ipk-gatersleben.de.
Collapse
Affiliation(s)
- Sebastian Beier
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Chris Ulpinnis
- Leibniz Institute of Plant Biochemistry, Bioinformatics and Scientific Data, 06120, Halle, Germany
| | - Markus Schwalbe
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Thomas Münch
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Robert Hoffie
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Iris Koeppel
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Christian Hertig
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Nagaveni Budhagatapalli
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Stefan Hiekel
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Krishna M Pathi
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Goetz Hensel
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Martin Grosse
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Sindy Chamas
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Sophia Gerasimova
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Jochen Kumlehn
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Uwe Scholz
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, 06466, Seeland, Germany
| | - Thomas Schmutzer
- Department of Natural Sciences III, Institute for Agricultural and Nutritional Sciences, Martin Luther University Halle-Wittenberg, 06120, Halle, Germany
| |
Collapse
|
11
|
On the Close Relatedness of Two Rice-Parasitic Root-Knot Nematode Species and the Recent Expansion of Meloidogyne graminicola in Southeast Asia. Genes (Basel) 2019; 10:genes10020175. [PMID: 30823612 PMCID: PMC6410229 DOI: 10.3390/genes10020175] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2018] [Revised: 02/13/2019] [Accepted: 02/20/2019] [Indexed: 12/20/2022] Open
Abstract
Meloidogyne graminicola is a facultative meiotic parthenogenetic root-knot nematode (RKN) that seriously threatens agriculture worldwide. We have little understanding of its origin, genomic structure, and intraspecific diversity. Such information would offer better knowledge of how this nematode successfully damages rice in many different environments. Previous studies on nuclear ribosomal DNA (nrDNA) suggested a close phylogenetic relationship between M. graminicola and Meloidogyne oryzae, despite their different modes of reproduction and geographical distribution. In order to clarify the evolutionary history of these two species and explore their molecular intraspecific diversity, we sequenced the genome of 12 M. graminicola isolates, representing populations of worldwide origins, and two South American isolates of M. oryzae. k-mer analysis of their nuclear genome and the detection of divergent homologous genomic sequences indicate that both species show a high proportion of heterozygous sites (ca. 1–2%), which had never been previously reported in facultative meiotic parthenogenetic RKNs. These analyses also point to a distinct ploidy level in each species, compatible with a diploid M. graminicola and a triploid M. oryzae. Phylogenetic analyses of mitochondrial genomes and three nuclear genomic sequences confirm close relationships between these two species, with M. graminicola being a putative parent of M. oryzae. In addition, comparative mitogenomics of those 12 M. graminicola isolates with a Chinese published isolate reveal only 15 polymorphisms that are phylogenetically non-informative. Eight mitotypes are distinguished, the most common one being shared by distant populations from Asia and America. This low intraspecific diversity, coupled with a lack of phylogeographic signal, suggests a recent worldwide expansion of M. graminicola.
Collapse
|
12
|
Hoang PNT, Michael TP, Gilbert S, Chu P, Motley ST, Appenroth KJ, Schubert I, Lam E. Generating a high-confidence reference genome map of the Greater Duckweed by integration of cytogenomic, optical mapping, and Oxford Nanopore technologies. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2018; 96:670-684. [PMID: 30054939 DOI: 10.1111/tpj.14049] [Citation(s) in RCA: 45] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/16/2018] [Revised: 06/29/2018] [Accepted: 07/06/2018] [Indexed: 06/08/2023]
Abstract
Duckweeds are the fastest growing angiosperms and have the potential to become a new generation of sustainable crops. Although a seed plant, Spirodela polyrhiza clones rarely flower and multiply mainly through vegetative propagation. Whole-genome sequencing using different approaches and clones yielded two reference maps. One for clone 9509, supported in its assembly by optical mapping of single DNA molecules, and one for clone 7498, supported by cytogenetic assignment of 96 fingerprinted bacterial artificial chromosomes (BACs) to its 20 chromosomes. However, these maps differ in the composition of several individual chromosome models. We validated both maps further to resolve these differences and addressed whether they could be due to chromosome rearrangements in different clones. For this purpose, we applied sequential multicolor fluorescence in situ hybridization (mcFISH) to seven S. polyrhiza clones, using 106 BACs that were mapped onto the 39 pseudomolecules for clone 7498. Furthermore we integrated high-depth Oxford Nanopore (ON) sequence data for clone 9509 to validate and revise the previously assembled chromosome models. We found no major structural rearrangements between these seven clones, identified seven chimeric pseudomolecules and Illumina assembly errors in the previous maps, respectively. A new S. polyrhiza genome map with high contiguity was produced with the ON sequence data and genome-wide synteny analysis supported the occurrence of two Whole Genome Duplication events during its evolution. This work generated a high confidence genome map for S. polyrhiza at the chromosome scale, and illustrates the complementarity of independent approaches to produce whole-genome assemblies in the absence of a genetic map.
Collapse
Affiliation(s)
- Phuong N T Hoang
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Stadt Seeland, D-06466, Germany
- Dalat University, Lamdong Province, Vietnam
| | | | - Sarah Gilbert
- Department of Plant Biology, Rutgers the State University of New Jersey, New Brunswick, NJ, 08901, USA
| | - Philomena Chu
- Department of Plant Biology, Rutgers the State University of New Jersey, New Brunswick, NJ, 08901, USA
| | | | - Klaus J Appenroth
- Department of Plant Physiology, Matthias-Schleiden-Institute, Friedrich-Schiller- University of Jena, Jena, D-07743, Germany
| | - Ingo Schubert
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Stadt Seeland, D-06466, Germany
| | - Eric Lam
- Department of Plant Biology, Rutgers the State University of New Jersey, New Brunswick, NJ, 08901, USA
| |
Collapse
|
13
|
Hu Y, Ren J, Peng Z, Umana AA, Le H, Danilova T, Fu J, Wang H, Robertson A, Hulbert SH, White FF, Liu S. Analysis of Extreme Phenotype Bulk Copy Number Variation (XP-CNV) Identified the Association of rp1 with Resistance to Goss's Wilt of Maize. FRONTIERS IN PLANT SCIENCE 2018; 9:110. [PMID: 29479358 PMCID: PMC5812337 DOI: 10.3389/fpls.2018.00110] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/02/2017] [Accepted: 01/19/2018] [Indexed: 05/19/2023]
Abstract
Goss's wilt (GW) of maize is caused by the Gram-positive bacterium Clavibacter michiganensis subsp. nebraskensis (Cmn) and has spread in recent years throughout the Great Plains, posing a threat to production. The genetic basis of plant resistance is unknown. Here, a simple method for quantifying disease symptoms was developed and used to select cohorts of highly resistant and highly susceptible lines known as extreme phenotypes (XP). Copy number variation (CNV) analyses using whole genome sequences of bulked XP revealed 141 genes containing CNV between the two XP groups. The CNV genes include the previously identified common rust resistant locus rp1. Multiple Rp1 accessions with distinct rp1 haplotypes in an otherwise susceptible accession exhibited hypersensitive responses upon inoculation. GW provides an excellent system for the genetic dissection of diseases caused by closely related subspecies of C. michiganesis. Further work will facilitate breeding strategies to control GW and provide needed insight into the resistance mechanism of important related diseases such as bacterial canker of tomato and bacterial ring rot of potato.
Collapse
Affiliation(s)
- Ying Hu
- Department of Plant Pathology, Kansas State University, Manhattan, KS, United States
| | - Jie Ren
- Department of Plant Pathology, Kansas State University, Manhattan, KS, United States
| | - Zhao Peng
- Department of Plant Pathology, University of Florida, Gainesville, FL, United States
| | - Arnoldo A. Umana
- Department of Plant Pathology, Kansas State University, Manhattan, KS, United States
| | - Ha Le
- Department of Plant Pathology, Kansas State University, Manhattan, KS, United States
| | - Tatiana Danilova
- Department of Plant Pathology, Kansas State University, Manhattan, KS, United States
| | - Junjie Fu
- Institute of Crop Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Haiyan Wang
- Department of Statistics, Kansas State University, Manhattan, KS, United States
| | - Alison Robertson
- Department of Plant Pathology and Microbiology, Iowa State University, Ames, IA, United States
| | - Scot H. Hulbert
- Department of Plant Pathology, Washington State University, Pullman, WA, United States
| | - Frank F. White
- Department of Plant Pathology, University of Florida, Gainesville, FL, United States
| | - Sanzhen Liu
- Department of Plant Pathology, Kansas State University, Manhattan, KS, United States
| |
Collapse
|
14
|
|