1
|
van Westerhoven A, Fokkens L, Wissink K, Kema GJ, Rep M, Seidl M. Reference-free identification and pangenome analysis of accessory chromosomes in a major fungal plant pathogen. NAR Genom Bioinform 2025; 7:lqaf034. [PMID: 40176926 PMCID: PMC11963757 DOI: 10.1093/nargab/lqaf034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2024] [Revised: 02/19/2025] [Accepted: 03/14/2025] [Indexed: 04/05/2025] Open
Abstract
Accessory chromosomes, found in some but not all individuals of a species, play an important role in pathogenicity and host specificity in fungal plant pathogens. However, their variability complicates reference-based analysis, especially when these chromosomes are missing in the reference genome. Pangenome variation graphs offer a reference-free alternative for studying these chromosomes. Here, we constructed a pangenome variation graph for 73 diverse Fusarium oxysporum genomes, a major fungal plant pathogen with a compartmentalized genome that includes conserved core as well as variable accessory chromosomes. To obtain insights into accessory chromosome dynamics, we first constructed a chromosome similarity network using all-vs-all similarity mapping. We identified eleven core chromosomes conserved across all strains and a substantial number of highly variable accessory chromosomes. Some of these accessory chromosomes are host-specific and likely play a role in determining host range. Using a k-mer based approach, we further identified the presence of these accessory chromosomes in all available (581) F. oxysporum assemblies and corroborated the occurrence of host-specific accessory chromosomes. To further analyze the evolution of chromosomes in F. oxysporum, we constructed a pangenome variation graph per group of homologous chromosomes. This reveals that accessory chromosomes are composed of different stretches of accessory regions, and possibly rearrangements between accessory regions gave rise to these mosaic accessory chromosomes. Furthermore, we show that accessory chromosomes are likely horizontally transferred in natural populations. Our findings demonstrate that a pangenome variation graph is a powerful approach to elucidate the evolutionary dynamics of accessory chromosomes in F. oxysporum, which is not only a useful resource for Fusarium but also provides a framework for similar analyses in other species containing accessory chromosomes.
Collapse
Affiliation(s)
- Anouk C van Westerhoven
- Theoretical Biology and Bioinformatics, Utrecht University, Padualaan 8, 3583CH, Utrecht, the Netherlands
- Laboratory of Phytopathology, Wageningen University & Research, Droevendaalsesteeg 1, 6708PB, Wageningen, the Netherlands
| | - Like Fokkens
- Laboratory of Phytopathology, Wageningen University & Research, Droevendaalsesteeg 1, 6708PB, Wageningen, the Netherlands
| | - Kyran Wissink
- Theoretical Biology and Bioinformatics, Utrecht University, Padualaan 8, 3583CH, Utrecht, the Netherlands
| | - Gert H J Kema
- Laboratory of Phytopathology, Wageningen University & Research, Droevendaalsesteeg 1, 6708PB, Wageningen, the Netherlands
| | - Martijn Rep
- Molecular Plant Pathology, Swammerdam Institute of Life Sciences, University of Amsterdam,1090GE, Amsterdam, the Netherlands
| | - Michael F Seidl
- Theoretical Biology and Bioinformatics, Utrecht University, Padualaan 8, 3583CH, Utrecht, the Netherlands
| |
Collapse
|
2
|
Cicherski A, Lisiecka A, Dojer N. AlfaPang: alignment free algorithm for pangenome graph construction. Algorithms Mol Biol 2025; 20:7. [PMID: 40375333 DOI: 10.1186/s13015-025-00277-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Accepted: 04/09/2025] [Indexed: 05/18/2025] Open
Abstract
The success of pangenome-based approaches to genomics analysis depends largely on the existence of efficient methods for constructing pangenome graphs that are applicable to large genome collections. In the current paper we present AlfaPang, a new pangenome graph building algorithm. AlfaPang is based on a novel alignment-free approach that allows to construct pangenome graphs using significantly less computational resources than state-of-the-art tools. The code of AlfaPang is freely available at https://github.com/AdamCicherski/AlfaPang .
Collapse
Affiliation(s)
- Adam Cicherski
- Institute of Informatics, University of Warsaw, Banacha 2, 02-097, Warsaw, Poland.
| | - Anna Lisiecka
- Institute of Informatics, University of Warsaw, Banacha 2, 02-097, Warsaw, Poland.
| | - Norbert Dojer
- Institute of Informatics, University of Warsaw, Banacha 2, 02-097, Warsaw, Poland.
| |
Collapse
|
3
|
Quah FX, Almeida MV, Blumer M, Yuan CU, Fischer B, See K, Jackson B, Zatha R, Rusuwa B, Turner GF, Santos ME, Svardal H, Hemberg M, Durbin R, Miska E. Lake Malawi cichlid pangenome graph reveals extensive structural variation driven by transposable elements. Genome Res 2025; 35:1094-1107. [PMID: 40210437 PMCID: PMC12047535 DOI: 10.1101/gr.279674.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2024] [Accepted: 02/06/2025] [Indexed: 04/12/2025]
Abstract
Pangenome methods have the potential to uncover hitherto undiscovered sequences missing from established reference genomes, making them useful to study evolutionary and speciation processes in diverse organisms. The cichlid fishes of the East African Rift Lakes represent one of nature's most phenotypically diverse vertebrate radiations, but single-nucleotide polymorphism (SNP)-based studies have revealed little sequence difference, with 0.1%-0.25% pairwise divergence between Lake Malawi species. These were based on aligning short reads to a single linear reference genome and ignored the contribution of larger-scale structural variants (SVs). We constructed a pangenome graph that integrates six new and two existing long-read genome assemblies of Lake Malawi haplochromine cichlids. This graph intuitively represents complex and nested variation between the genomes and reveals that the SV landscape is dominated by large insertions, many exclusive to individual assemblies. The graph incorporates a substantial amount of extra sequence across seven species, the total size of which is 33.1% longer than that of a single cichlid genome. Approximately 4.73% to 9.86% of the assembly lengths are estimated as interspecies structural variation between cichlids, suggesting substantial genomic diversity underappreciated in SNP studies. Although coding regions remain highly conserved, our analysis uncovers a significant proportion of SV sequences as transposable element (TE) insertions, especially DNA, LINE, and LTR TEs. These findings underscore that the cichlid genome is shaped both by small-nucleotide mutations and large, TE-derived sequence alterations, both of which merit study to understand their interplay in cichlid evolution.
Collapse
Affiliation(s)
- Fu Xiang Quah
- Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, United Kingdom;
- Department of Genetics, University of Cambridge, Cambridge CB2 3EH, United Kingdom
| | | | - Moritz Blumer
- Department of Genetics, University of Cambridge, Cambridge CB2 3EH, United Kingdom
| | - Chengwei Ulrika Yuan
- Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, United Kingdom
- Department of Genetics, University of Cambridge, Cambridge CB2 3EH, United Kingdom
| | - Bettina Fischer
- Department of Genetics, University of Cambridge, Cambridge CB2 3EH, United Kingdom
| | - Kirsten See
- Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, United Kingdom
| | - Ben Jackson
- Department of Genetics, University of Cambridge, Cambridge CB2 3EH, United Kingdom
| | - Richard Zatha
- Department of Biological Sciences, University of Malawi, P.O. Box 280, Zomba, Malawi
| | - Bosco Rusuwa
- Department of Biological Sciences, University of Malawi, P.O. Box 280, Zomba, Malawi
| | - George F Turner
- School of Environmental and Natural Sciences, Bangor University, Bangor, Gwynedd LL57 2TH, United Kingdom
| | - M Emília Santos
- Department of Zoology, University of Cambridge, Cambridge CB2 3EJ, United Kingdom
| | - Hannes Svardal
- Department of Biology, University of Antwerp, 2610 Wilrijk, Belgium
| | - Martin Hemberg
- The Gene Lay Institute of Immunology and Inflammation, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Richard Durbin
- Department of Genetics, University of Cambridge, Cambridge CB2 3EH, United Kingdom
| | - Eric Miska
- Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, United Kingdom;
- Department of Genetics, University of Cambridge, Cambridge CB2 3EH, United Kingdom
| |
Collapse
|
4
|
Abstract
A single reference genome does not fully capture species diversity. By contrast, a pangenome incorporates multiple genomes to capture the entire set of nonredundant genes in a given species, along with its genome diversity. New sequencing technologies enable researchers to produce multiple high-quality genome sequences and catalog diverse genetic variations with better precision. Pangenomic studies have detected structural variants in plant genomes, dissected the genetic architecture of agronomic traits, and helped unravel molecular underpinnings and evolutionary origins of plant phenotypes. The pangenome concept has further evolved into a so-called super-pangenome that includes wild relatives within a genus or clade and shifted to graph-based reference systems. Nevertheless, building pangenomes and representing complex structural variants remain challenging in many crops. Standardized computing pipelines and common data structures are needed to compare and interpret pangenomes. The growing body of plant pangenomics data requires new algorithms, huge data storage capacity, and training to help researchers and breeders take advantage of newly discovered genes and genetic variants.
Collapse
Affiliation(s)
- Murukarthick Jayakodi
- Department of Soil and Crop Sciences, Texas A&M University, College Station, Texas, USA;
- Texas A&M AgriLife Research Center at Dallas, Texas A&M University System, Dallas, Texas, USA
| | - Hyeonah Shim
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Seeland, Germany
| | - Martin Mascher
- German Centre for Integrative Biodiversity Research (iDiv), Halle-Jena-Leipzig, Leipzig, Germany;
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Seeland, Germany
| |
Collapse
|
5
|
Mahmoud M, Agustinho DP, Sedlazeck FJ. A Hitchhiker's Guide to long-read genomic analysis. Genome Res 2025; 35:545-558. [PMID: 40228901 PMCID: PMC12047252 DOI: 10.1101/gr.279975.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/16/2025]
Abstract
Over the past decade, long-read sequencing has evolved into a pivotal technology for uncovering the hidden and complex regions of the genome. Significant cost efficiency, scalability, and accuracy advancements have driven this evolution. Concurrently, novel analytical methods have emerged to harness the full potential of long reads. These advancements have enabled milestones such as the first fully completed human genome, enhanced identification and understanding of complex genomic variants, and deeper insights into the interplay between epigenetics and genomic variation. This mini-review provides a comprehensive overview of the latest developments in long-read DNA sequencing analysis, encompassing reference-based and de novo assembly approaches. We explore the entire workflow, from initial data processing to variant calling and annotation, focusing on how these methods improve our ability to interpret a wide array of genomic variants. Additionally, we discuss the current challenges, limitations, and future directions in the field, offering a detailed examination of the state-of-the-art bioinformatics methods for long-read sequencing.
Collapse
Affiliation(s)
- Medhat Mahmoud
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Daniel P Agustinho
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA;
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| |
Collapse
|
6
|
Milia S, Leonard AS, Mapel XM, Bernal Ulloa SM, Drögemüller C, Pausch H. Taurine pangenome uncovers a segmental duplication upstream of KIT associated with depigmentation in white-headed cattle. Genome Res 2025; 35:1041-1052. [PMID: 39694857 PMCID: PMC12047182 DOI: 10.1101/gr.279064.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Accepted: 12/02/2024] [Indexed: 12/20/2024]
Abstract
Cattle have been selectively bred for coat color, spotting, and depigmentation patterns. The assumed autosomal dominant inherited genetic variants underlying the characteristic white head of Fleckvieh, Simmental, and Hereford cattle have not been identified yet, although the contribution of structural variation upstream of the KIT gene has been proposed. Here, we construct a graph pangenome from 24 haplotype assemblies representing seven taurine cattle breeds to identify and characterize the white-head-associated locus for the first time based on long-read sequencing data and pangenome analyses. We introduce a pangenome-wide association mapping approach that examines assembly path similarities within the graph to reveal an association between two most likely serial alleles of a complex structural variant (SV) 66 kb upstream of KIT and facial depigmentation. The complex SV contains a variable number of tandemly duplicated 14.3 kb repeats, consisting of LTRs, LINEs, and other repetitive elements, leading to misleading alignments of short and long reads when using a linear reference. We align 250 short-read sequencing samples spanning 15 cattle breeds to the pangenome graph, further validating that the alleles of the SV segregate with head depigmentation. We estimate an increased count of repeats in Hereford relative to Simmental and other white-headed cattle breeds from the graph alignment coverage, suggesting a large under-assembly in the current Hereford-based cattle reference genome, which had fewer copies. Our work shows that exploiting assembly path similarities within graph pangenomes can reveal trait-associated complex SVs.
Collapse
Affiliation(s)
- Sotiria Milia
- Animal Genomics, ETH Zurich, Zurich 8092, Switzerland
| | | | | | | | - Cord Drögemüller
- Institute of Genetics, Vetsuisse Faculty, University of Bern, Bern 3012, Switzerland
| | - Hubert Pausch
- Animal Genomics, ETH Zurich, Zurich 8092, Switzerland;
| |
Collapse
|
7
|
Cheng L, Wang N, Bao Z, Zhou Q, Guarracino A, Yang Y, Wang P, Zhang Z, Tang D, Zhang P, Wu Y, Zhou Y, Zheng Y, Hu Y, Lian Q, Ma Z, Lassois L, Zhang C, Lucas WJ, Garrison E, Stein N, Städler T, Zhou Y, Huang S. Leveraging a phased pangenome for haplotype design of hybrid potato. Nature 2025; 640:408-417. [PMID: 39843749 PMCID: PMC11981936 DOI: 10.1038/s41586-024-08476-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Accepted: 12/02/2024] [Indexed: 01/24/2025]
Abstract
The tetraploid genome and clonal propagation of the cultivated potato (Solanum tuberosum L.)1,2 dictate a slow, non-accumulative breeding mode of the most important tuber crop. Transitioning potato breeding to a seed-propagated hybrid system based on diploid inbred lines has the potential to greatly accelerate its improvement3. Crucially, the development of inbred lines is impeded by manifold deleterious variants; explaining their nature and finding ways to eliminate them is the current focus of hybrid potato research4-10. However, most published diploid potato genomes are unphased, concealing crucial information on haplotype diversity and heterozygosity11-13. Here we develop a phased potato pangenome graph of 60 haplotypes from cultivated diploids and the ancestral wild species, and find evidence for the prevalence of transposable elements in generating structural variants. Compared with the linear reference, the graph pangenome represents a broader diversity (3,076 Mb versus 742 Mb). Notably, we observe enhanced heterozygosity in cultivated diploids compared with wild ones (14.0% versus 9.5%), indicating extensive hybridization during potato domestication. Using conservative criteria, we identify 19,625 putatively deleterious structural variants (dSVs) and reveal a biased accumulation of deleterious single nucleotide polymorphisms (dSNPs) around dSVs in coupling phase. Based on the graph pangenome, we computationally design ideal potato haplotypes with minimal dSNPs and dSVs. These advances provide critical insights into the genomic basis of clonal propagation and will guide breeders to develop a suite of promising inbred lines.
Collapse
Affiliation(s)
- Lin Cheng
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
- Plant Genetics and Rhizosphere Processes Laboratory, TERRA Teaching and Research Center, Gembloux Agro-Bio Tech, University of Liège, Gembloux, Belgium
| | - Nan Wang
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
- National Key Laboratory of Tropical Crop Breeding, Chinese Academy of Tropical Agricultural Sciences, Haikou, China
| | - Zhigui Bao
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany
| | - Qian Zhou
- School of Agriculture and Biotechnology, Sun Yat-Sen University, Shenzhen, China
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Yuting Yang
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Pei Wang
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Zhiyang Zhang
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Dié Tang
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
- Department of Genetics, Yale University School of Medicine, New Haven, CT, USA
| | - Pingxian Zhang
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Yaoyao Wu
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
- College of Horticulture, Nanjing Agricultural University, Nanjing, China
| | - Yao Zhou
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
- Key Laboratory of Plant Molecular Physiology, Institute of Botany, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Yi Zheng
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Yong Hu
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Qun Lian
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Zhaoxu Ma
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Ludivine Lassois
- Plant Genetics and Rhizosphere Processes Laboratory, TERRA Teaching and Research Center, Gembloux Agro-Bio Tech, University of Liège, Gembloux, Belgium
| | - Chunzhi Zhang
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - William J Lucas
- Department of Plant Biology, College of Biological Sciences, University of California, Davis, Davis, CA, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Nils Stein
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
- Crop Plant Genetics, Institute of Agricultural and Nutritional Sciences, Martin-Luther-University of Halle-Wittenberg, Halle (Saale), Germany
| | - Thomas Städler
- Institute of Integrative Biology and Zurich-Basel Plant Science Center, ETH Zurich, Zurich, Switzerland
| | - Yongfeng Zhou
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
- National Key Laboratory of Tropical Crop Breeding, Chinese Academy of Tropical Agricultural Sciences, Haikou, China
| | - Sanwen Huang
- National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China.
- National Key Laboratory of Tropical Crop Breeding, Chinese Academy of Tropical Agricultural Sciences, Haikou, China.
| |
Collapse
|
8
|
Jiang M, Qian Q, Lu M, Chen M, Fan Z, Shang Y, Bu C, Du Z, Song S, Zeng J, Xiao J. PlantPan: A comprehensive multi-species plant pan-genome database. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2025; 122:e70144. [PMID: 40219973 DOI: 10.1111/tpj.70144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/22/2024] [Revised: 02/17/2025] [Accepted: 03/24/2025] [Indexed: 04/14/2025]
Abstract
The pan-genome represents the complete genomic diversity of specific species, serving as a valuable resource for studying species evolution, crop domestication, and guiding crop breeding and improvement. While there are several single-species-specific plant pan-genome databases, the availability of multi-species pan-genome databases is limited. Additionally, variations in methods and data types used for plant pan-genome analysis across different databases hinder the comparison and integration of pan-genome information from various projects at multi-species or single-species levels. To tackle this challenge, we introduce PlantPan, a comprehensive database housing the results of pan-genome analysis for 195 genomes from 11 plant species. PlantPan aims to provide extensive information, including gene-centric and sequence-centric pan-genome information, graph-based pan-genome, pan-genome openness profiles, gene functions and its variation characteristics, homologous genes, and gene clusters across different species. Statistically, PlantPan incorporates 9 163 011 genes, 694 191 gene clusters, 526 973 370 genome variations, and 1 616 089 non-redundant genome variation groups at the species level, 33 455,098 genome synteny, and 177 827 non-redundant genome synteny groups at the species level. Regarding functional genes, PlantPan contains 5 222 720 genes related to transcription factors, 395 247 literature-reported resistance genes, 455 748 predicted microbial/disease resistance genes, and 1 612 112 genes related to molecular pathways. In summary, PlantPan is a vital platform for advancing the application of pan-genomes in molecular breeding for crops and evolutionary research for plants.
Collapse
Affiliation(s)
- Meiye Jiang
- National Genomics Data Center, China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Qiheng Qian
- National Genomics Data Center, China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Mingming Lu
- National Genomics Data Center, China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Meili Chen
- National Genomics Data Center, China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
| | - Zhuojing Fan
- National Genomics Data Center, China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
| | - Yunfei Shang
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Congfan Bu
- National Genomics Data Center, China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
| | - ZhengLin Du
- National Genomics Data Center, China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
| | - Shuhui Song
- National Genomics Data Center, China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
| | - Jingyao Zeng
- National Genomics Data Center, China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
| | - Jingfa Xiao
- National Genomics Data Center, China National Center for Bioinformation, Beijing, 100101, China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| |
Collapse
|
9
|
Li L, Wu Z, Guarracino A, Villani F, Kong D, Mancieri A, Zhang A, Saba L, Chen H, Brozka H, Vales K, Senko AN, Kempermann G, Stuchlik A, Pravenec M, Lechner J, Prins P, Mathur R, Lu L, Yang K, Peng J, Williams RW, Wang X. Genetic modulation of protein expression in rat brain. iScience 2025; 28:112079. [PMID: 40124499 PMCID: PMC11930185 DOI: 10.1016/j.isci.2025.112079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2024] [Revised: 09/05/2024] [Accepted: 02/18/2025] [Indexed: 03/25/2025] Open
Abstract
Genetic variations in protein expression are implicated in a broad spectrum of common diseases and complex traits but remain less explored compared to mRNA and classical phenotypes. This study systematically analyzed brain proteomes in a rat family using tandem mass tag (TMT)-based quantitative mass spectrometry. We quantified 8,119 proteins across two parental strains (SHR/Olalpcv and BN-Lx/Cub) and 29 HXB/BXH recombinant inbred (RI) strains, identifying 597 proteins with differential expression and 464 proteins linked to cis-acting quantitative trait loci (pQTLs). Proteogenomics identified 95 variant peptides, and sex-specific analyses revealed both shared and distinct cis-pQTLs. We improved the ability to pinpoint candidate genes underlying pQTLs by utilizing the rat pangenome and explored the connections between pQTLs in rats and human disorders. Collectively, this study highlights the value of large proteo-genetic datasets in elucidating protein modulation in the brain and its links to complex central nervous system (CNS) traits.
Collapse
Affiliation(s)
- Ling Li
- Department of Neurology, University of Tennessee Health Science Center, Memphis, TN 38163, USA
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Zhiping Wu
- Department of Structural Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
- Department of Developmental Neurobiology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
- Human Technopole, Viale Rita Levi-Montalcini, 20157 Milan, Italy
| | - Flavia Villani
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Dehui Kong
- Department of Neurology, University of Tennessee Health Science Center, Memphis, TN 38163, USA
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Ariana Mancieri
- Department of Structural Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
- Department of Developmental Neurobiology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Aijun Zhang
- Department of Neurology, University of Tennessee Health Science Center, Memphis, TN 38163, USA
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Laura Saba
- Department of Pharmaceutical Sciences, University of Colorado Denver, Aurora, CO 80045, USA
| | - Hao Chen
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN 38103, USA
| | - Hana Brozka
- Institute of Physiology of the Czech Academy of Sciences, Prague 14200, Czech Republic
| | - Karel Vales
- Institute of Physiology of the Czech Academy of Sciences, Prague 14200, Czech Republic
| | - Anna N. Senko
- Genomics of Regeneration of the Central Nervous System, Center for Regenerative Therapies Dresden, Dresden University of Technology, 01307 Dresden, Germany
| | - Gerd Kempermann
- Genomics of Regeneration of the Central Nervous System, Center for Regenerative Therapies Dresden, Dresden University of Technology, 01307 Dresden, Germany
| | - Ales Stuchlik
- Institute of Physiology of the Czech Academy of Sciences, Prague 14200, Czech Republic
| | - Michal Pravenec
- Institute of Physiology of the Czech Academy of Sciences, Prague 14200, Czech Republic
| | - Joseph Lechner
- Department of Pediatrics and the Herman B Wells Center for Pediatric Research, Indiana University School of Medicine, Indianapolis, IN 46202, USA
- Department of Microbiology and Immunology, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Ramkumar Mathur
- Department of Geriatrics, School of Medicine and Health Sciences, University of North Dakota, Grand Forks, ND 58202, USA
| | - Lu Lu
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Kai Yang
- Department of Pediatrics and the Herman B Wells Center for Pediatric Research, Indiana University School of Medicine, Indianapolis, IN 46202, USA
- Department of Microbiology and Immunology, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | - Junmin Peng
- Department of Structural Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
- Department of Developmental Neurobiology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
| | - Robert W. Williams
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Xusheng Wang
- Department of Neurology, University of Tennessee Health Science Center, Memphis, TN 38163, USA
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| |
Collapse
|
10
|
Zytnicki M. Assessing genome conservation on pangenome graphs with PanSel. BIOINFORMATICS ADVANCES 2025; 5:vbaf018. [PMID: 40092526 PMCID: PMC11908644 DOI: 10.1093/bioadv/vbaf018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 12/21/2024] [Accepted: 02/03/2025] [Indexed: 03/19/2025]
Abstract
Motivation With more and more telomere-to-telomere genomes assembled, pangenomes make it possible to capture the genomic diversity of a species. Because they introduce less biases, pangenomes, represented as graphs, tend to supplant the usual linear representation of a reference genome, augmented with variations. However, this major change requires new tools adapted to this data structure. Among the numerous questions that can be addressed to a pangenome graph is the search for conserved or divergent genes. Results In this article, we present a new tool, named PanSel, which computes a conservation score for each segment of the genome, and finds genomic regions that are significantly conserved, or divergent. PanSel can be used on prokaryotes and eukaryotes, with a sequence identity not less than 98%. Availability and implementation PanSel, written in C++11 with no dependency, is available at https://github.com/mzytnicki/pansel.
Collapse
Affiliation(s)
- Matthias Zytnicki
- Unité de Mathématiques et Informatique Appliquées, INRAE, 31 326 Castanet-Tolosan, France
| |
Collapse
|
11
|
MacNish TR, Al‐Mamun HA, Bayer PE, McPhan C, Fernandez CGT, Upadhyaya SR, Liu S, Batley J, Parkin IAP, Sharpe AG, Edwards D. Brassica Panache: A multi-species graph pangenome representing presence absence variation across forty-one Brassica genomes. THE PLANT GENOME 2025; 18:e20535. [PMID: 39648684 PMCID: PMC11730171 DOI: 10.1002/tpg2.20535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/03/2024] [Revised: 10/20/2024] [Accepted: 11/01/2024] [Indexed: 12/10/2024]
Abstract
Brassicas are an economically important crop species that provide a source of healthy oil and vegetables. With the rising population and the impact of climate change on agriculture, there is an increasing need to improve agronomically important traits of crops such as Brassica. The genomes of plant species have significant sequence presence absence variation (PAV), which is a source of genetic variation that can be used for crop improvement, and this species variation can be captured through the construction of pangenomes. Graph pangenomes are a recent reference format that represent the genomic variation with a species or population as alternate paths in a sequence graph. Graph pangenomes contain information on alignment, PAV, and annotation. Here we present the first multi-species graph pangenome for Brassica visualized with pangenome analyzer with chromosomal exploration (Panache).
Collapse
Affiliation(s)
- Tessa R. MacNish
- School of Biological SciencesThe University of Western AustraliaPerthWestern AustraliaAustralia
- Center for Applied BioinformaticsThe University of Western AustraliaPerthWestern AustraliaAustralia
| | - Hawlader A. Al‐Mamun
- School of Biological SciencesThe University of Western AustraliaPerthWestern AustraliaAustralia
- Center for Applied BioinformaticsThe University of Western AustraliaPerthWestern AustraliaAustralia
| | - Philipp E. Bayer
- School of Biological SciencesThe University of Western AustraliaPerthWestern AustraliaAustralia
- Center for Applied BioinformaticsThe University of Western AustraliaPerthWestern AustraliaAustralia
- Minderoo FoundationPerthWestern AustraliaAustralia
| | - Connor McPhan
- School of Biological SciencesThe University of Western AustraliaPerthWestern AustraliaAustralia
- Center for Applied BioinformaticsThe University of Western AustraliaPerthWestern AustraliaAustralia
| | - Cassandria G. Tay Fernandez
- School of Biological SciencesThe University of Western AustraliaPerthWestern AustraliaAustralia
- Center for Applied BioinformaticsThe University of Western AustraliaPerthWestern AustraliaAustralia
| | - Shriprabha R. Upadhyaya
- School of Biological SciencesThe University of Western AustraliaPerthWestern AustraliaAustralia
- Center for Applied BioinformaticsThe University of Western AustraliaPerthWestern AustraliaAustralia
| | - Shengyi Liu
- Oil Crops Research Institute, CAASWuhanChina
| | - Jacqueline Batley
- School of Biological SciencesThe University of Western AustraliaPerthWestern AustraliaAustralia
| | | | | | - David Edwards
- School of Biological SciencesThe University of Western AustraliaPerthWestern AustraliaAustralia
- Center for Applied BioinformaticsThe University of Western AustraliaPerthWestern AustraliaAustralia
| |
Collapse
|
12
|
Villani F, Guarracino A, Ward RR, Green T, Emms M, Pravenec M, Sharp B, Prins P, Garrison E, Williams RW, Chen H, Colonna V. Pangenome reconstruction in rats enhances genotype-phenotype mapping and variant discovery. iScience 2025; 28:111835. [PMID: 40034122 PMCID: PMC11875200 DOI: 10.1016/j.isci.2025.111835] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Revised: 04/16/2024] [Accepted: 01/15/2025] [Indexed: 03/05/2025] Open
Abstract
The HXB/BXH family of recombinant inbred rat strains is a unique genetic resource that has been extensively phenotyped over 25 years, resulting in a vast dataset of quantitative molecular and physiological phenotypes. We built a pangenome graph from 10x Genomics Linked-Read data for 31 recombinant inbred rats to study genetic variation and association mapping. The pangenome includes 0.2Gb of sequence that is not present the reference mRatBN7.2, confirming the capture of substantial additional variation. We validated variants in challenging regions, including complex structural variants resolving into multiple haplotypes. Phenome-wide association analysis of validated SNPs uncovered variants associated with glucose/insulin levels and hippocampal gene expression. We propose an interaction between Pirl1l1, chromogranin expression, TNF-α levels, and insulin regulation. This study demonstrates the utility of linked-read pangenomes for comprehensive variant detection and mapping phenotypic diversity in a widely used rat genetic reference panel.
Collapse
Affiliation(s)
- Flavia Villani
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Rachel R. Ward
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Tomomi Green
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Madeleine Emms
- Institute of Genetics and Biophysics, National Research Council, 80111 Naples, Italy
| | - Michal Pravenec
- Institute of Physiology, Czech Academy of Sciences, 14200 Prague, Czech Republic
| | - Burt Sharp
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Pjotr Prins
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Erik Garrison
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Robert W. Williams
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Hao Chen
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Vincenza Colonna
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Institute of Genetics and Biophysics, National Research Council, 80111 Naples, Italy
| |
Collapse
|
13
|
Miao Z, Yue JX. Interactive visualization and interpretation of pangenome graphs by linear reference-based coordinate projection and annotation integration. Genome Res 2025; 35:296-310. [PMID: 39805704 PMCID: PMC11874961 DOI: 10.1101/gr.279461.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2024] [Accepted: 01/08/2025] [Indexed: 01/16/2025]
Abstract
With the increasing availability of high-quality genome assemblies, pangenome graphs emerged as a new paradigm in the genomic field for identifying, encoding, and presenting genomic variation at both the population and species level. However, it remains challenging to truly dissect and interpret pangenome graphs via biologically informative visualization. To facilitate better exploration and understanding of pangenome graphs toward novel biological insights, here we present a web-based interactive visualization and interpretation framework for linear reference-projected pangenome graphs (VRPG). VRPG provides efficient and intuitive support for exploring and annotating pangenome graphs along a linear-genome-based coordinate system (e.g., that of a primary linear reference genome). Moreover, VRPG offers many unique features such as in-graph path highlighting for graph-constituent input assemblies, copy number characterization for graph-embedding nodes, and graph-based mapping for query sequences, all of which are highly valuable for researchers working with pangenome graphs. Additionally, VRPG enables side-by-side visualization between the graph-based pangenome representation and the conventional primary linear reference genome-based feature annotations, therefore seamlessly bridging the graph and linear genomic contexts. To further demonstrate its functionality and scalability, we applied VRPG to the cutting-edge yeast and human reference pangenome graphs derived from hundreds of high-quality genome assemblies via a dedicated web portal and examined their local genome diversity in the graph contexts.
Collapse
Affiliation(s)
- Zepu Miao
- State Key Laboratory of Oncology in South China, Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Guangdong Provincial Clinical Research Center for Cancer, Sun Yat-sen University Cancer Center, Guangzhou 510060, China
| | - Jia-Xing Yue
- State Key Laboratory of Oncology in South China, Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Guangdong Provincial Clinical Research Center for Cancer, Sun Yat-sen University Cancer Center, Guangzhou 510060, China
| |
Collapse
|
14
|
Edwards SV, Fang B, Khost D, Kolyfetis GE, Cheek RG, DeRaad DA, Chen N, Fitzpatrick JW, McCormack JE, Funk WC, Ghalambor CK, Garrison E, Guarracino A, Li H, Sackton TB. Comparative population pangenomes reveal unexpected complexity and fitness effects of structural variants. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.11.637762. [PMID: 39990470 PMCID: PMC11844517 DOI: 10.1101/2025.02.11.637762] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 02/25/2025]
Abstract
Structural variants (SVs) are widespread in vertebrate genomes, yet their evolutionary dynamics remain poorly understood. Using 45 long-read de novo genome assemblies and pangenome tools, we analyze SVs within three closely related species of North American jays (Aphelocoma, scrub-jays) displaying a 60-fold range in effective population size. We find rapid evolution of genome architecture, including ~100 Mb variation in genome size driven by dynamic satellite landscapes with unexpectedly long (> 10 kb) repeat units and widespread variation in gene content, influencing gene expression. SVs exhibit slightly deleterious dynamics modulated by variant length and population size, with strong evidence of adaptive fixation only in large populations. Our results demonstrate how population size shapes the distribution of SVs and the importance of pangenomes to characterizing genomic diversity.
Collapse
Affiliation(s)
- Scott V. Edwards
- Department of Organismic and Evolutionary Biology, Harvard University, 26 Oxford Street, Cambridge, MA, 2138, USA
- Museum of Comparative Zoology, Harvard University, 26 Oxford Street, Cambridge, MA, 2138, USA
| | - Bohao Fang
- Department of Organismic and Evolutionary Biology, Harvard University, 26 Oxford Street, Cambridge, MA, 2138, USA
- Museum of Comparative Zoology, Harvard University, 26 Oxford Street, Cambridge, MA, 2138, USA
| | - Danielle Khost
- Informatics Group, Harvard University, 52 Oxford St, Cambridge, MA, 2138, USA
| | - George E Kolyfetis
- Department of Organismic and Evolutionary Biology, Harvard University, 26 Oxford Street, Cambridge, MA, 2138, USA
| | - Rebecca G Cheek
- Department of Biology, Graduate Degree Program in Ecology, Colorado State University, 1878 Campus Delivery, Fort Collins, CO, 80523, USA
| | - Devon A DeRaad
- Moore Laboratory of Zoology, Occidental College, 1600 Campus Rd, Los Angeles, CA, 90041, USA
| | - Nancy Chen
- Department of Biology, University of Rochester, 477 Hutchison Hall, Box 270211, Rochester, NY, 14627, USA
| | - John W Fitzpatrick
- Cornell Lab of Ornithology, Cornell University, 159 Sapsucker Woods Rd, Ithaca, NY, 14850, USA
| | - John E. McCormack
- Moore Laboratory of Zoology, Occidental College, 1600 Campus Rd, Los Angeles, CA, 90041, USA
| | - W. Chris Funk
- Department of Biology, Graduate Degree Program in Ecology, Colorado State University, 1878 Campus Delivery, Fort Collins, CO, 80523, USA
| | - Cameron K Ghalambor
- Department of Biology, Norwegian University of Science and Technology, Høgskoleringen 5, Realfagbygget D1-137, Trondheim, 7491, Norway
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, 71 S. Manassas Street, Memphis, TN, 38163, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, 71 S. Manassas Street, Memphis, TN, 38163, USA
| | - Heng Li
- Department of Data Science, Dana-Farber Cancer Institute, 450 Brookline Ave, Mailstop: CLSB 11007, Boston, MA, 2215
| | - Timothy B Sackton
- Informatics Group, Harvard University, 52 Oxford St, Cambridge, MA, 2138, USA
| |
Collapse
|
15
|
Ruperao P, Rangan P, Shah T, Sharma V, Rathore A, Mayes S, Pandey MK. Developing pangenomes for large and complex plant genomes and their representation formats. J Adv Res 2025:S2090-1232(25)00071-2. [PMID: 39894347 DOI: 10.1016/j.jare.2025.01.052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 01/27/2025] [Accepted: 01/27/2025] [Indexed: 02/04/2025] Open
Abstract
BACKGROUND The development of pangenomes has revolutionized genomic studies by capturing the complete genetic diversity within a species. Pangenome assembly integrates data from multiple individuals to construct a comprehensive genomic landscape, revealing both core and accessory genomic elements. This approach enables the identification of novel genes, structural variations, and gene presence-absence variations, providing insights into species evolution, adaptation, and trait variation. Representing pangenomes requires innovative visualization formats that effectively convey the complex genomic structures and variations. AIM This review delves into contemporary methodologies and recent advancements in constructing pangenomes, particularly in plant genomes. It examines the structure of pangenome representation, including format comparison, conversion, visualization techniques, and their implications for enhancing crop improvement strategies. KEY SCIENTIFIC CONCEPTS OF REVIEW Earlier comparative studies have illuminated novel gene sequences, copy number variations, and presence-absence variations across diverse crop species. The concept of a pan-genome, which captures multiple genetic variations from a broad spectrum of genotypes, offers a holistic perspective of a species' genetic makeup. However, constructing a pan-genome for plants with larger genomes poses challenges, including managing vast genome sequence data and comprehending the genetic variations within the germplasm. To address these challenges, researchers have explored cost-effective alternatives to encapsulate species diversity in a single assembly known as a pangenome. This involves reducing the volume of genome sequences while focusing on genetic variations. With the growing prominence of the pan-genome concept in plant genomics, several software tools have emerged to facilitate pangenome construction. This review sheds light on developing and utilizing software tools tailored for constructing pan-genomes in plants. It also discusses representation formats suitable for downstream analyses, offering valuable insights into the genetic landscape and evolutionary dynamics of plant species. In summary, this review underscores the significance of pan-genome construction and representation formats in resolving the genetic architecture of plants, particularly those with complex genomes. It provides a comprehensive overview of recent advancements, aiding in exploring and understanding plant genetic diversity.
Collapse
Affiliation(s)
- Pradeep Ruperao
- Center of Excellence in Genomics and Systems Biology (CEGSB) and Center for Pre-Breeding Research (CPBR), International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India.
| | - Parimalan Rangan
- ICAR-National Bureau of Plant Genetic Resources (NBPGR), New Delhi, India; Queensland Alliance for Agriculture and Food Innovation, The University of Queensland, St Lucia, Australia
| | - Trushar Shah
- International Institute of Tropical Agriculture (IITA), Nairobi, Kenya
| | - Vinay Sharma
- Center of Excellence in Genomics and Systems Biology (CEGSB) and Center for Pre-Breeding Research (CPBR), International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India
| | - Abhishek Rathore
- International Maize and Wheat Improvement Center (CIMMYT), Nairobi, Kenya
| | - Sean Mayes
- Center of Excellence in Genomics and Systems Biology (CEGSB) and Center for Pre-Breeding Research (CPBR), International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India
| | - Manish K Pandey
- Center of Excellence in Genomics and Systems Biology (CEGSB) and Center for Pre-Breeding Research (CPBR), International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, India.
| |
Collapse
|
16
|
Hu H, Zhao J, Thomas WJW, Batley J, Edwards D. The role of pangenomics in orphan crop improvement. Nat Commun 2025; 16:118. [PMID: 39746989 PMCID: PMC11696220 DOI: 10.1038/s41467-024-55260-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Accepted: 12/05/2024] [Indexed: 01/04/2025] Open
Abstract
Global food security depends heavily on a few staple crops, while orphan crops, despite being less studied, offer the potential benefits of environmental adaptation and enhanced nutritional traits, especially in a changing climate. Major crops have benefited from genomics-based breeding, initially using single genomes and later pangenomes. Recent advances in DNA sequencing have enabled pangenome construction for several orphan crops, offering a more comprehensive understanding of genetic diversity. Orphan crop research has now entered the pangenomics era and applying these pangenomes with advanced selection methods and genome editing technologies can transform these neglected species into crops of broader agricultural significance.
Collapse
Affiliation(s)
- Haifei Hu
- Rice Research Institute, Guangdong Academy of Agricultural Sciences & Key Laboratory of Genetics and Breeding of High Quality Rice in Southern China (Co-construction by Ministry and Province), Ministry of Agriculture and Rural Affairs & Guangdong Key Laboratory of Rice Science and Technology, Guangzhou, China
| | - Junliang Zhao
- Rice Research Institute, Guangdong Academy of Agricultural Sciences & Key Laboratory of Genetics and Breeding of High Quality Rice in Southern China (Co-construction by Ministry and Province), Ministry of Agriculture and Rural Affairs & Guangdong Key Laboratory of Rice Science and Technology, Guangzhou, China
| | - William J W Thomas
- School of Biological Sciences, University of Western Australia, Perth, WA, Australia
| | - Jacqueline Batley
- School of Biological Sciences, University of Western Australia, Perth, WA, Australia
| | - David Edwards
- School of Biological Sciences, University of Western Australia, Perth, WA, Australia.
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA, Australia.
| |
Collapse
|
17
|
Secomandi S, Gallo GR, Rossi R, Rodríguez Fernandes C, Jarvis ED, Bonisoli-Alquati A, Gianfranceschi L, Formenti G. Pangenome graphs and their applications in biodiversity genomics. Nat Genet 2025; 57:13-26. [PMID: 39779953 DOI: 10.1038/s41588-024-02029-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Accepted: 11/08/2024] [Indexed: 01/11/2025]
Abstract
Complete datasets of genetic variants are key to biodiversity genomic studies. Long-read sequencing technologies allow the routine assembly of highly contiguous, haplotype-resolved reference genomes. However, even when complete, reference genomes from a single individual may bias downstream analyses and fail to adequately represent genetic diversity within a population or species. Pangenome graphs assembled from aligned collections of high-quality genomes can overcome representation bias by integrating sequence information from multiple genomes from the same population, species or genus into a single reference. Here, we review the available tools and data structures to build, visualize and manipulate pangenome graphs while providing practical examples and discussing their applications in biodiversity and conservation genomics across the tree of life.
Collapse
Affiliation(s)
- Simona Secomandi
- Laboratory of Neurogenetics of Language, the Rockefeller University, New York, NY, USA
| | | | - Riccardo Rossi
- Department of Biotechnology and Biosciences, University of Milano-Bicocca, Milan, Italy
| | - Carlos Rodríguez Fernandes
- Centre for Ecology, Evolution and Environmental Changes (CE3C) and CHANGE, Global Change and Sustainability Institute, Departamento de Biologia Animal, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal
- Faculdade de Psicologia, Universidade de Lisboa, Lisboa, Portugal
| | - Erich D Jarvis
- Laboratory of Neurogenetics of Language, the Rockefeller University, New York, NY, USA
- The Vertebrate Genome Laboratory, New York, NY, USA
| | - Andrea Bonisoli-Alquati
- Department of Biological Sciences, California State Polytechnic University, Pomona, Pomona, CA, USA
| | | | | |
Collapse
|
18
|
Vorbrugg S, Bezrukov I, Bao Z, Weigel D. Gretl-variation GRaph Evaluation TooLkit. Bioinformatics 2024; 41:btae755. [PMID: 39719064 PMCID: PMC11729725 DOI: 10.1093/bioinformatics/btae755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2024] [Revised: 11/15/2024] [Accepted: 12/21/2024] [Indexed: 12/26/2024] Open
Abstract
MOTIVATION As genome graphs are powerful data structures for representing the genetic diversity within populations, they can help identify genomic variations that traditional linear references miss, but their complexity and size makes the analysis of genome graphs challenging. We sought to develop a genome graph analysis tool that helps these analyses to become more accessible by addressing the limitations of existing tools. Specifically, we improve scalability and user-friendliness, and we provide many new statistics tailored to variation graphs for graph evaluation, including sample-specific features. RESULTS We developed an efficient, comprehensive, and integrated tool, gretl, to analyze genome graphs and gain insights into their structure and composition by providing a wide range of statistics. gretl can be utilized to evaluate different graphs, compare the output of graph construction pipelines with different parameters, as well as perform an in-depth analysis of individual graphs, including sample-specific analysis. With the assistance of gretl, novel patterns of genetic variation and potential regions of interest can be identified, for later, more detailed inspection. We demonstrate that gretl outperforms other tools in terms of speed, particularly for larger genome graphs. AVAILABILITY AND IMPLEMENTATION Commented Rust source code and documentation is available under MIT license at https://github.com/MoinSebi/gretl together with Python scripts and step-by-step usage examples. The package is available at Bioconda for easy installation.
Collapse
Affiliation(s)
- Sebastian Vorbrugg
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany
| | - Ilja Bezrukov
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany
| | - Zhigui Bao
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany
| | - Detlef Weigel
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany
| |
Collapse
|
19
|
van Westerhoven AC, Dijkstra J, Aznar Palop JL, Wissink K, Bell J, Kema GHJ, Seidl MF. Frequent genetic exchanges revealed by a pan-mitogenome graph of a fungal plant pathogen. mBio 2024; 15:e0275824. [PMID: 39535230 PMCID: PMC11633160 DOI: 10.1128/mbio.02758-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Accepted: 10/15/2024] [Indexed: 11/16/2024] Open
Abstract
Mitochondria are present in almost all eukaryotic lineages. The mitochondrial genomes (mitogenomes) evolve separately from nuclear genomes, and they can therefore provide relevant insights into the evolution of their host species. Fusarium oxysporum is a major fungal plant pathogen that is assumed to reproduce clonally. However, horizontal chromosome transfer between strains can occur through heterokaryon formation, and recently, signs of sexual recombination have been observed. Similarly, signs of recombination in F. oxysporum mitogenomes challenged the prevailing assumption of clonal reproduction in this species. Here, we construct, to our knowledge, the first fungal pan-mitogenome graph of nearly 500 F. oxysporum mitogenome assemblies to uncover the variation and evolution. In general, the gene order of fungal mitogenomes is not well conserved, yet the mitogenome of F. oxysporum and related species are highly colinear. We observed two strikingly contrasting regions in the F. oxysporum pan-mitogenome, comprising a highly conserved core mitogenome and a long variable region (6-16 kb in size), of which we identified three distinct types. The pan-mitogenome graph reveals that only five intron insertions occurred in the core mitogenome and that the long variable regions drive the difference between mitogenomes. Moreover, we observed that their evolution is neither concurrent with the core mitogenome nor with the nuclear genome. Our large-scale analysis of long variable regions uncovers frequent recombination between mitogenomes, even between strains that belong to different taxonomic clades. This challenges the common assumption of incompatibility between genetically diverse F. oxysporum strains and provides new insights into the evolution of this fungal species.IMPORTANCEInsights into plant pathogen evolution is essential for the understanding and management of disease. Fusarium oxysporum is a major fungal pathogen that can infect many economically important crops. Pathogenicity can be transferred between strains by the horizontal transfer of pathogenicity chromosomes. The fungus has been thought to evolve clonally, yet recent evidence suggests active sexual recombination between related isolates, which could at least partially explain the horizontal transfer of pathogenicity chromosomes. By constructing a pan-genome graph of nearly 500 mitochondrial genomes, we describe the genetic variation of mitochondria in unprecedented detail and demonstrate frequent mitochondrial recombination. Importantly, recombination can occur between genetically diverse isolates from distinct taxonomic clades and thus can shed light on genetic exchange between fungal strains.
Collapse
Affiliation(s)
- Anouk C. van Westerhoven
- Theoretical Biology and Bioinformatics, Utrecht University, Utrecht, Netherlands
- Laboratory of Phytopathology, Wageningen University and Research, Wageningen, Netherlands
| | - Jelmer Dijkstra
- Laboratory of Phytopathology, Wageningen University and Research, Wageningen, Netherlands
| | - Jose L. Aznar Palop
- Theoretical Biology and Bioinformatics, Utrecht University, Utrecht, Netherlands
| | - Kyran Wissink
- Theoretical Biology and Bioinformatics, Utrecht University, Utrecht, Netherlands
| | - Jasper Bell
- Theoretical Biology and Bioinformatics, Utrecht University, Utrecht, Netherlands
| | - Gert H. J. Kema
- Laboratory of Phytopathology, Wageningen University and Research, Wageningen, Netherlands
| | - Michael F. Seidl
- Theoretical Biology and Bioinformatics, Utrecht University, Utrecht, Netherlands
| |
Collapse
|
20
|
Guo M, Bi G, Wang H, Ren H, Chen J, Lian Q, Wang X, Fang W, Zhang J, Dong Z, Pang Y, Zhang Q, Huang S, Yan J, Zhao X. Genomes of autotetraploid wild and cultivated Ziziphus mauritiana reveal polyploid evolution and crop domestication. PLANT PHYSIOLOGY 2024; 196:2701-2720. [PMID: 39325737 DOI: 10.1093/plphys/kiae512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Revised: 08/28/2024] [Accepted: 09/12/2024] [Indexed: 09/28/2024]
Abstract
Indian jujube (Ziziphus mauritiana) holds a prominent position in the global fruit and pharmaceutical markets. Here, we report the assemblies of haplotype-resolved, telomere-to-telomere genomes of autotetraploid wild and cultivated Indian jujube plants using a 2-stage assembly strategy. The generation of these genomes permitted in-depth investigations into the divergence and evolutionary history of this important fruit crop. Using a graph-based pan-genome constructed from 8 monoploid genomes, we identified structural variation (SV)-FST hotspots and SV hotspots. Gap-free genomes provide a means to obtain a global view of centromere structures. We identified presence-absence variation-related genes in 4 monoploid genomes (cI, cIII, wI, and wIII) and resequencing populations. We also present the population structure and domestication trajectory of the Indian jujube based on the resequencing of 73 wild and cultivated accessions. Metabolomic and transcriptomic analyses of mature fruits of wild and cultivated accessions unveiled the genetic basis underlying loss of fruit astringency during domestication of Indian jujube. This study reveals mechanisms underlying the divergence, evolution, and domestication of the autotetraploid Indian jujube and provides rich and reliable genetic resources for future research.
Collapse
Affiliation(s)
- Mingxin Guo
- College of Life Sciences, Luoyang Normal University, Luoyang 471934, China
| | - Guiqi Bi
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
| | - Huan Wang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
- Guangdong Province Key Laboratory of Microbial Signals and Disease Control, Integrative Microbiology Research Centre, and College of Plant Protection, South China Agricultural University, Guangzhou 510642, China
| | - Hui Ren
- Horticultural Research Institute, Guangxi Academy of Agricultural Sciences, Nanning 530007, China
| | - Jiaying Chen
- South Subtropical Crops Research Institute, Chinese Academy of Tropical Agricultural Sciences, Zhanjiang 524000, China
| | - Qun Lian
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
| | - Xiaomei Wang
- Horticultural Research Institute, Guangxi Academy of Agricultural Sciences, Nanning 530007, China
| | - Weikuan Fang
- Horticultural Research Institute, Guangxi Academy of Agricultural Sciences, Nanning 530007, China
| | - Jiangjiang Zhang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
| | - Zhaonian Dong
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
| | - Yi Pang
- College of Life Sciences, Luoyang Normal University, Luoyang 471934, China
| | - Quanling Zhang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
| | - Sanwen Huang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
| | - Jianbin Yan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
| | - Xusheng Zhao
- College of Life Sciences, Luoyang Normal University, Luoyang 471934, China
| |
Collapse
|
21
|
Kalbfleisch TS, Smith ML, Ciosek JL, Li K, Doris PA. Three decades of rat genomics: approaching the finish(ed) line. Physiol Genomics 2024; 56:807-818. [PMID: 39348459 PMCID: PMC11573253 DOI: 10.1152/physiolgenomics.00110.2024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Revised: 09/11/2024] [Accepted: 09/26/2024] [Indexed: 10/02/2024] Open
Abstract
The rat, Rattus norvegicus, has provided an important model for investigation of a range of characteristics of biomedical importance. Here we survey the origins of this species, its introduction into laboratory research, and the emergence of genetic and genomic methods that utilize this model organism. Genomic studies have yielded important progress and provided new insight into several biologically important traits. However, some studies have been impeded by the lack of a complete and accurate reference genome for this species. New sequencing and genome assembly methods applied to the rat have resulted in a new reference genome assembly, GRCr8, which is a near telomere-to-telomere assembly of high base-level accuracy that incorporates several elements not captured in prior assemblies. As genome assembly methods continue to advance and production costs become a less significant obstacle, genome assemblies for multiple inbred rat strains are emerging. These assemblies will allow a rat pangenome assembly to be constructed that captures all the genetic variations in strains selected for their utility in research and will overcome reference bias, a limitation associated with reliance on a single reference assembly. By this means, the full utility of this model organism to genomic studies will begin to be revealed.
Collapse
Affiliation(s)
- Theodore S Kalbfleisch
- Gluck Equine Research Center, University of Kentucky, Lexington, Kentucky, United States
| | - Melissa L Smith
- Department of Biochemistry and Molecular Biology, University of Louisville School of Medicine, Louisville, Kentucky, United States
| | - Julia L Ciosek
- Gluck Equine Research Center, University of Kentucky, Lexington, Kentucky, United States
| | - Kai Li
- Gluck Equine Research Center, University of Kentucky, Lexington, Kentucky, United States
| | - Peter A Doris
- Center for Human Genetics, Brown Foundation Institute of Molecular Medicine, McGovern Medical School, University of Texas Health Science Center, Houston, Texas, United States
| |
Collapse
|
22
|
Jayakodi M, Lu Q, Pidon H, Rabanus-Wallace MT, Bayer M, Lux T, Guo Y, Jaegle B, Badea A, Bekele W, Brar GS, Braune K, Bunk B, Chalmers KJ, Chapman B, Jørgensen ME, Feng JW, Feser M, Fiebig A, Gundlach H, Guo W, Haberer G, Hansson M, Himmelbach A, Hoffie I, Hoffie RE, Hu H, Isobe S, König P, Kale SM, Kamal N, Keeble-Gagnère G, Keller B, Knauft M, Koppolu R, Krattinger SG, Kumlehn J, Langridge P, Li C, Marone MP, Maurer A, Mayer KFX, Melzer M, Muehlbauer GJ, Murozuka E, Padmarasu S, Perovic D, Pillen K, Pin PA, Pozniak CJ, Ramsay L, Pedas PR, Rutten T, Sakuma S, Sato K, Schüler D, Schmutzer T, Scholz U, Schreiber M, Shirasawa K, Simpson C, Skadhauge B, Spannagl M, Steffenson BJ, Thomsen HC, Tibbits JF, Nielsen MTS, Trautewig C, Vequaud D, Voss C, Wang P, Waugh R, Westcott S, Rasmussen MW, Zhang R, Zhang XQ, Wicker T, Dockter C, Mascher M, Stein N. Structural variation in the pangenome of wild and domesticated barley. Nature 2024; 636:654-662. [PMID: 39537924 PMCID: PMC11655362 DOI: 10.1038/s41586-024-08187-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Accepted: 10/09/2024] [Indexed: 11/16/2024]
Abstract
Pangenomes are collections of annotated genome sequences of multiple individuals of a species1. The structural variants uncovered by these datasets are a major asset to genetic analysis in crop plants2. Here we report a pangenome of barley comprising long-read sequence assemblies of 76 wild and domesticated genomes and short-read sequence data of 1,315 genotypes. An expanded catalogue of sequence variation in the crop includes structurally complex loci that are rich in gene copy number variation. To demonstrate the utility of the pangenome, we focus on four loci involved in disease resistance, plant architecture, nutrient release and trichome development. Novel allelic variation at a powdery mildew resistance locus and population-specific copy number gains in a regulator of vegetative branching were found. Expansion of a family of starch-cleaving enzymes in elite malting barleys was linked to shifts in enzymatic activity in micro-malting trials. Deletion of an enhancer motif is likely to change the developmental trajectory of the hairy appendages on barley grains. Our findings indicate that allelic diversity at structurally complex loci may have helped crop plants to adapt to new selective regimes in agricultural ecosystems.
Collapse
Affiliation(s)
- Murukarthick Jayakodi
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
- Department of Soil and Crop Sciences, Texas A&M AgriLife Research-Dallas, Dallas, TX, USA
| | - Qiongxian Lu
- Carlsberg Research Laboratory, Copenhagen, Denmark
| | - Hélène Pidon
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
- IPSiM, University of Montpellier, CNRS, INRAE, Institut Agro, Montpellier, France
| | | | | | - Thomas Lux
- PGSB-Plant Genome and Systems Biology, Helmholtz Center Munich-German Research Center for Environmental Health, Neuherberg, Germany
| | - Yu Guo
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Benjamin Jaegle
- Department of Plant and Microbial Biology, University of Zurich, Zurich, Switzerland
| | - Ana Badea
- Brandon Research and Development Centre, Agriculture et Agri-Food Canada, Brandon, Manitoba, Canada
| | - Wubishet Bekele
- Ottawa Research and Development Centre, Agriculture et Agri-Food Canada, Ottawa, Ontario, Canada
| | - Gurcharn S Brar
- Faculty of Land and Food Systems, The University of British Columbia, Vancouver, British Columbia, Canada
- Faculty of Agricultural, Life and Environmental Sciences (ALES), University of Alberta, Edmonton, Alberta, Canada
| | | | - Boyke Bunk
- DSMZ-German Collection of Microorganisms and Cell Cultures GmbH, Braunschweig, Germany
| | - Kenneth J Chalmers
- School of Agriculture, Food and Wine, University of Adelaide, Urrbrae, South Australia, Australia
| | - Brett Chapman
- Western Crop Genetics Alliance, Food Futures Institute/School of Agriculture, Murdoch University, Murdoch, Western Australia, Australia
| | | | - Jia-Wu Feng
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Manuel Feser
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Anne Fiebig
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Heidrun Gundlach
- PGSB-Plant Genome and Systems Biology, Helmholtz Center Munich-German Research Center for Environmental Health, Neuherberg, Germany
| | | | - Georg Haberer
- PGSB-Plant Genome and Systems Biology, Helmholtz Center Munich-German Research Center for Environmental Health, Neuherberg, Germany
| | - Mats Hansson
- Department of Biology, Lund University, Lund, Sweden
| | - Axel Himmelbach
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Iris Hoffie
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Robert E Hoffie
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Haifei Hu
- Western Crop Genetics Alliance, Food Futures Institute/School of Agriculture, Murdoch University, Murdoch, Western Australia, Australia
- Rice Research Institute, Guangdong Academy of Agricultural Sciences, Guangzhou, China
| | | | - Patrick König
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Sandip M Kale
- Carlsberg Research Laboratory, Copenhagen, Denmark
- Department of Agroecology, Aarhus University, Slagelse, Denmark
| | - Nadia Kamal
- PGSB-Plant Genome and Systems Biology, Helmholtz Center Munich-German Research Center for Environmental Health, Neuherberg, Germany
| | - Gabriel Keeble-Gagnère
- Agriculture Victoria, Department of Jobs, Precincts and Regions, Agribio, La Trobe University, Bundoora, Victoria, Australia
| | - Beat Keller
- Department of Plant and Microbial Biology, University of Zurich, Zurich, Switzerland
| | - Manuela Knauft
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Ravi Koppolu
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Simon G Krattinger
- Plant Science Program, Biological and Environmental Science and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Jochen Kumlehn
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Peter Langridge
- School of Agriculture, Food and Wine, University of Adelaide, Urrbrae, South Australia, Australia
| | - Chengdao Li
- Western Crop Genetics Alliance, Food Futures Institute/School of Agriculture, Murdoch University, Murdoch, Western Australia, Australia
- Department of Primary Industry and Regional Development, Government of Western Australia, Perth, Western Australia, Australia
- College of Agriculture, Yangtze University, Jingzhou, China
| | - Marina P Marone
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Andreas Maurer
- Institute of Agricultural and Nutritional Sciences, Martin Luther University Halle-Wittenberg, Halle, Germany
| | - Klaus F X Mayer
- PGSB-Plant Genome and Systems Biology, Helmholtz Center Munich-German Research Center for Environmental Health, Neuherberg, Germany
- School of Life Sciences Weihenstephan, Technical University Munich, Freising, Germany
| | - Michael Melzer
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Gary J Muehlbauer
- Department of Agronomy and Plant Genetics, University of Minnesota, St. Paul, MN, USA
| | | | - Sudharsan Padmarasu
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Dragan Perovic
- Institute for Resistance Research and Stress Tolerance, Julius Kuehn-Institute (JKI), Federal Research Centre for Cultivated Plants, Quedlinburg, Germany
| | - Klaus Pillen
- Institute of Agricultural and Nutritional Sciences, Martin Luther University Halle-Wittenberg, Halle, Germany
| | | | - Curtis J Pozniak
- Department of Plant Sciences and Crop Development Centre, University of Saskatchewan, Saskatoon, Saskatchewan, Canada
| | | | | | - Twan Rutten
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Shun Sakuma
- Faculty of Agriculture, Tottori University, Tottori, Japan
| | - Kazuhiro Sato
- Kazusa DNA Research Institute, Kisarazu, Japan
- Institute of Plant Science and Resources, Okayama University, Kurashiki, Japan
| | - Danuta Schüler
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Thomas Schmutzer
- Institute of Agricultural and Nutritional Sciences, Martin Luther University Halle-Wittenberg, Halle, Germany
| | - Uwe Scholz
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | | | | | | | | | - Manuel Spannagl
- PGSB-Plant Genome and Systems Biology, Helmholtz Center Munich-German Research Center for Environmental Health, Neuherberg, Germany
| | - Brian J Steffenson
- Department of Plant Pathology, University of Minnesota, St. Paul, MN, USA
| | | | - Josquin F Tibbits
- Agriculture Victoria, Department of Jobs, Precincts and Regions, Agribio, La Trobe University, Bundoora, Victoria, Australia
| | | | - Corinna Trautewig
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | | | - Cynthia Voss
- Carlsberg Research Laboratory, Copenhagen, Denmark
| | - Penghao Wang
- Western Crop Genetics Alliance, Food Futures Institute/School of Agriculture, Murdoch University, Murdoch, Western Australia, Australia
| | - Robbie Waugh
- The James Hutton Institute, Dundee, UK
- School of Life Sciences, University of Dundee, Dundee, UK
| | - Sharon Westcott
- Western Crop Genetics Alliance, Food Futures Institute/School of Agriculture, Murdoch University, Murdoch, Western Australia, Australia
| | | | | | - Xiao-Qi Zhang
- Western Crop Genetics Alliance, Food Futures Institute/School of Agriculture, Murdoch University, Murdoch, Western Australia, Australia
| | - Thomas Wicker
- Department of Plant and Microbial Biology, University of Zurich, Zurich, Switzerland.
| | | | - Martin Mascher
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany.
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany.
| | - Nils Stein
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany.
- Institute of Agricultural and Nutritional Sciences, Martin Luther University Halle-Wittenberg, Halle, Germany.
| |
Collapse
|
23
|
Parmigiani L, Garrison E, Stoye J, Marschall T, Doerr D. Panacus: fast and exact pangenome growth and core size estimation. Bioinformatics 2024; 40:btae720. [PMID: 39626271 PMCID: PMC11665632 DOI: 10.1093/bioinformatics/btae720] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 10/28/2024] [Accepted: 11/27/2024] [Indexed: 12/11/2024] Open
Abstract
MOTIVATION Using a single linear reference genome poses a limitation to exploring the full genomic diversity of a species. The release of a draft human pangenome underscores the increasing relevance of pangenomics to overcome these limitations. Pangenomes are commonly represented as graphs, which can represent billions of base pairs of sequence. Presently, there is a lack of scalable software able to perform key tasks on pangenomes, such as quantifying universally shared sequence across genomes (the core genome) and measuring the extent of genomic variability as a function of sample size (pangenome growth). RESULTS We introduce Panacus (pangenome-abacus), a tool designed to rapidly perform these tasks and visualize the results in interactive plots. Panacus can process GFA files, the accepted standard for pangenome graphs, and is able to analyze a human pangenome graph with 110 million nodes in <1 h. AVAILABILITY AND IMPLEMENTATION Panacus is implemented in Rust and is published as Open Source software under the MIT license. The source code and documentation are available at https://github.com/marschall-lab/panacus. Panacus can be installed via Bioconda at https://bioconda.github.io/recipes/panacus/README.html.
Collapse
Affiliation(s)
- Luca Parmigiani
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld 33615, Germany
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, United States
| | - Jens Stoye
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld 33615, Germany
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf 40225, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf 40225, Germany
| | - Daniel Doerr
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf 40225, Germany
- Department for Endocrinology and Diabetology, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf 40225, Germany
- German Diabetes Center (DDZ), Leibniz Institute for Diabetes Research, Düsseldorf 40225, Germany
| |
Collapse
|
24
|
Fang B, Edwards SV. Fitness consequences of structural variation inferred from a House Finch pangenome. Proc Natl Acad Sci U S A 2024; 121:e2409943121. [PMID: 39531493 PMCID: PMC11588099 DOI: 10.1073/pnas.2409943121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Accepted: 10/03/2024] [Indexed: 11/16/2024] Open
Abstract
Genomic structural variants (SVs) play a crucial role in adaptive evolution, yet their average fitness effects and characterization with pangenome tools are understudied in wild animal populations. We constructed a pangenome for House Finches (Haemorhous mexicanus), a model for studies of host-pathogen coevolution, using long-read sequence data on 16 individuals (32 de novo-assembled haplotypes) and one outgroup. We identified 887,118 SVs larger than 50 base pairs, mostly (60%) involving repetitive elements, with reduced SV diversity in the eastern US as a result of its introduction by humans. The distribution of fitness effects of genome-wide SVs was estimated using maximum likelihood approaches and revealed that SVs in both coding and noncoding regions were on average more deleterious than smaller indels or single nucleotide polymorphisms. The reference-free pangenome facilitated identification of a > 10-My-old, 11-megabase-long pericentric inversion on chromosome 1. We found that the genotype frequencies of the inversion, estimated from 135 birds widely sampled temporally and geographically, increased steadily over the 25 y since House Finches were first exposed to the bacterial pathogen Mycoplasma gallisepticum and showed signatures of balancing selection, capturing genes related to immunity and telomerase activity. We also observed shorter telomeres in populations with a greater number of years exposure to Mycoplasma. Our study illustrates the utility of long-read sequencing and pangenome methods for understanding wild animal populations, estimating fitness effects of genome-wide SVs, and advancing our understanding of adaptive evolution through structural variation.
Collapse
Affiliation(s)
- Bohao Fang
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA02138
- Museum of Comparative Zoology, Harvard University, Cambridge, MA02138
| | - Scott V. Edwards
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA02138
- Museum of Comparative Zoology, Harvard University, Cambridge, MA02138
| |
Collapse
|
25
|
Avila Cartes J, Bonizzoni P, Ciccolella S, Della Vedova G, Denti L. PangeBlocks: customized construction of pangenome graphs via maximal blocks. BMC Bioinformatics 2024; 25:344. [PMID: 39497039 PMCID: PMC11533710 DOI: 10.1186/s12859-024-05958-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Accepted: 10/16/2024] [Indexed: 11/06/2024] Open
Abstract
BACKGROUND The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling. RESULTS In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase. CONCLUSION We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.
Collapse
Affiliation(s)
- Jorge Avila Cartes
- Department of Informatics, Systems, and Communications, University of Milano - Bicocca, Viale Sarca, 20126, Milano, Italy
| | - Paola Bonizzoni
- Department of Informatics, Systems, and Communications, University of Milano - Bicocca, Viale Sarca, 20126, Milano, Italy.
| | - Simone Ciccolella
- Department of Informatics, Systems, and Communications, University of Milano - Bicocca, Viale Sarca, 20126, Milano, Italy
| | - Gianluca Della Vedova
- Department of Informatics, Systems, and Communications, University of Milano - Bicocca, Viale Sarca, 20126, Milano, Italy
| | - Luca Denti
- Department of Informatics, Systems, and Communications, University of Milano - Bicocca, Viale Sarca, 20126, Milano, Italy
- Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Mlynská dolina F1, Bratislava, 84248, Slovakia
| |
Collapse
|
26
|
Garrison E, Guarracino A, Heumos S, Villani F, Bao Z, Tattini L, Hagmann J, Vorbrugg S, Marco-Sola S, Kubica C, Ashbrook DG, Thorell K, Rusholme-Pilcher RL, Liti G, Rudbeck E, Golicz AA, Nahnsen S, Yang Z, Mwaniki MN, Nobrega FL, Wu Y, Chen H, de Ligt J, Sudmant PH, Huang S, Weigel D, Soranzo N, Colonna V, Williams RW, Prins P. Building pangenome graphs. Nat Methods 2024; 21:2008-2012. [PMID: 39433878 DOI: 10.1038/s41592-024-02430-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2023] [Accepted: 08/26/2024] [Indexed: 10/23/2024]
Abstract
Pangenome graphs can represent all variation between multiple reference genomes, but current approaches to build them exclude complex sequences or are based upon a single reference. In response, we developed the PanGenome Graph Builder, a pipeline for constructing pangenome graphs without bias or exclusion. The PanGenome Graph Builder uses all-to-all alignments to build a variation graph in which we can identify variation, measure conservation, detect recombination events and infer phylogenetic relationships.
Collapse
Affiliation(s)
- Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA.
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Human Technopole, Milan, Italy
| | - Simon Heumos
- Quantitative Biology Center (QBiC) Tübingen, University of Tübingen, Tübingen, Germany
- Biomedical Data Science, Dept. of Computer Science, University of Tübingen, Tübingen, Germany
- M3 Research Center, University Hospital Tübingen, Tübingen, Germany
| | - Flavia Villani
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Zhigui Bao
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Lorenzo Tattini
- Université Côte d'Azur, CNRS, INSERM, IRCAN, Nice, France
- Data Science Department, EURECOM, Biot, France
| | | | - Sebastian Vorbrugg
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany
| | - Santiago Marco-Sola
- Computer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain
- Department of Computer Science, Universitat Politècnica de Catalunya, Barcelona, Spain
| | - Christian Kubica
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany
| | - David G Ashbrook
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Kaisa Thorell
- Chemistry and Molecular Biology, Faculty of Science, University of Gothenburg, Gothenburg, Sweden
| | | | - Gianni Liti
- Université Côte d'Azur, CNRS, INSERM, IRCAN, Nice, France
| | - Emilio Rudbeck
- Clinical Genomics Gothenburg, Bioinformatics and Data Centre, University of Gothenburg, Gothenburg, Sweden
| | - Agnieszka A Golicz
- Department of Plant Breeding, Justus Liebig University Giessen, Giessen, Germany
| | - Sven Nahnsen
- Quantitative Biology Center (QBiC) Tübingen, University of Tübingen, Tübingen, Germany
- Biomedical Data Science, Dept. of Computer Science, University of Tübingen, Tübingen, Germany
- M3 Research Center, University Hospital Tübingen, Tübingen, Germany
- Institute for Bioinformatics and Medical Informatics (IBMI), Eberhard-Karls University of Tübingen, Tübingen, Germany
| | - Zuyu Yang
- The Institute of Environmental Science and Research, Wellington, New Zealand
| | | | - Franklin L Nobrega
- School of Biological Sciences, Faculty of Environmental and Life Sciences, University of Southampton, Southampton, UK
| | - Yi Wu
- School of Biological Sciences, Faculty of Environmental and Life Sciences, University of Southampton, Southampton, UK
| | - Hao Chen
- Department of Pharmacology, Addiction Science and Toxicology, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Joep de Ligt
- Hartwig Medical Foundation, Amsterdam, the Netherlands
| | - Peter H Sudmant
- Department of Integrative Biology, University of California Berkeley, Berkeley, CA, USA
| | - Sanwen Huang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Detlef Weigel
- Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany
- Institute for Bioinformatics and Medical Informatics, University Tübingen, Tübingen, Germany
| | - Nicole Soranzo
- Human Technopole, Milan, Italy
- Wellcome Sanger Institute, Genome Campus, Hinxton, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge, UK
- Department of Haematology, Cambridge Biomedical Campus, Cambridge, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, UK
| | - Vincenza Colonna
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Institute of Genetics and Biophysics, National Research Council, Naples, Italy
| | - Robert W Williams
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| |
Collapse
|
27
|
Heumos S, Heuer ML, Hanssen F, Heumos L, Guarracino A, Heringer P, Ehmele P, Prins P, Garrison E, Nahnsen S. Cluster-efficient pangenome graph construction with nf-core/pangenome. Bioinformatics 2024; 40:btae609. [PMID: 39400346 PMCID: PMC11568064 DOI: 10.1093/bioinformatics/btae609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 09/16/2024] [Accepted: 10/10/2024] [Indexed: 10/15/2024] Open
Abstract
MOTIVATION Pangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. However, current construction methods often introduce biases, excluding complex sequences or relying on references. The PanGenome Graph Builder (PGGB) addresses these issues. To date, though, there is no state-of-the-art pipeline allowing for easy deployment, efficient and dynamic use of available resources, and scalable usage at the same time. RESULTS To overcome these limitations, we present nf-core/pangenome, a reference-unbiased approach implemented in Nextflow following nf-core's best practices. Leveraging biocontainers ensures portability and seamless deployment in High-Performance Computing (HPC) environments. Unlike PGGB, nf-core/pangenome distributes alignments across cluster nodes, enabling scalability. Demonstrating its efficiency, we constructed pangenome graphs for 1000 human chromosome 19 haplotypes and 2146 Escherichia coli sequences, achieving a two to threefold speedup compared to PGGB without increasing greenhouse gas emissions. AVAILABILITY AND IMPLEMENTATION nf-core/pangenome is released under the MIT open-source license, available on GitHub and Zenodo, with documentation accessible at https://nf-co.re/pangenome/docs/usage.
Collapse
Affiliation(s)
- Simon Heumos
- Quantitative Biology Center (QBiC) Tübingen, University of Tübingen, Tübingen, 72076, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, 72076, Germany
- M3 Research Center, University Hospital Tübingen, Tübingen, 72076, Germany
- Institute for Bioinformatics and Medical Informatics (IBMI), Eberhard-Karls University of Tübingen, Tübingen, 72076, Germany
| | - Michael L Heuer
- University of California, Berkeley, Berkeley, CA 94720, United States
| | - Friederike Hanssen
- Quantitative Biology Center (QBiC) Tübingen, University of Tübingen, Tübingen, 72076, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, 72076, Germany
- M3 Research Center, University Hospital Tübingen, Tübingen, 72076, Germany
- Institute for Bioinformatics and Medical Informatics (IBMI), Eberhard-Karls University of Tübingen, Tübingen, 72076, Germany
| | - Lukas Heumos
- Department of Computational Health, Institute of Computational Biology, Helmholtz Munich, Munich, 85764, Germany
- Comprehensive Pneumology Center with the CPC-M bioArchive, Helmholtz Zentrum Munich, Member of the German Center for Lung Research (DZL), Munich, 81377, Germany
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, 81377, Germany
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, United States
- Human Technopole, Milan 20157, Italy
| | - Peter Heringer
- Quantitative Biology Center (QBiC) Tübingen, University of Tübingen, Tübingen, 72076, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, 72076, Germany
- M3 Research Center, University Hospital Tübingen, Tübingen, 72076, Germany
- Institute for Bioinformatics and Medical Informatics (IBMI), Eberhard-Karls University of Tübingen, Tübingen, 72076, Germany
| | - Philipp Ehmele
- Department of Computational Health, Institute of Computational Biology, Helmholtz Munich, Munich, 85764, Germany
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, United States
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, United States
| | - Sven Nahnsen
- Quantitative Biology Center (QBiC) Tübingen, University of Tübingen, Tübingen, 72076, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen, 72076, Germany
- M3 Research Center, University Hospital Tübingen, Tübingen, 72076, Germany
- Institute for Bioinformatics and Medical Informatics (IBMI), Eberhard-Karls University of Tübingen, Tübingen, 72076, Germany
| |
Collapse
|
28
|
Kaur H, Shannon LM, Samac DA. A stepwise guide for pangenome development in crop plants: an alfalfa (Medicago sativa) case study. BMC Genomics 2024; 25:1022. [PMID: 39482604 PMCID: PMC11526573 DOI: 10.1186/s12864-024-10931-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Accepted: 10/21/2024] [Indexed: 11/03/2024] Open
Abstract
BACKGROUND The concept of pangenomics and the importance of structural variants is gaining recognition within the plant genomics community. Due to advancements in sequencing and computational technology, it has become feasible to sequence the entire genome of numerous individuals of a single species at a reasonable cost. Pangenomes have been constructed for many major diploid crops, including rice, maize, soybean, sorghum, pearl millet, peas, sunflower, grapes, and mustards. However, pangenomes for polyploid species are relatively scarce and are available in only few crops including wheat, cotton, rapeseed, and potatoes. MAIN BODY In this review, we explore the various methods used in crop pangenome development, discussing the challenges and implications of these techniques based on insights from published pangenome studies. We offer a systematic guide and discuss the tools available for constructing a pangenome and conducting downstream analyses. Alfalfa, a highly heterozygous, cross pollinated and autotetraploid forage crop species, is used as an example to discuss the concerns and challenges offered by polyploid crop species. We conducted a comparative analysis using linear and graph-based methods by constructing an alfalfa graph pangenome using three publicly available genome assemblies. To illustrate the intricacies captured by pangenome graphs for a complex crop genome, we used five different gene sequences and aligned them against the three graph-based pangenomes. The comparison of the three graph pangenome methods reveals notable variations in the genomic variation captured by each pipeline. CONCLUSION Pangenome resources are proving invaluable by offering insights into core and dispensable genes, novel gene discovery, and genome-wide patterns of variation. Developing user-friendly online portals for linear pangenome visualization has made these resources accessible to the broader scientific and breeding community. However, challenges remain with graph-based pangenomes including compatibility with other tools, extraction of sequence for regions of interest, and visualization of genetic variation captured in pangenome graphs. These issues necessitate further refinement of tools and pipelines to effectively address the complexities of polyploid, highly heterozygous, and cross-pollinated species.
Collapse
Affiliation(s)
- Harpreet Kaur
- Department of Horticultural Science, University of Minnesota, St. Paul, MN, 55108, USA.
| | - Laura M Shannon
- Department of Horticultural Science, University of Minnesota, St. Paul, MN, 55108, USA
| | - Deborah A Samac
- USDA-ARS, Plant Science Research Unit, St. Paul, MN, 55108, USA
| |
Collapse
|
29
|
Jamsandekar M, Ferreira MS, Pettersson ME, Farrell ED, Davis BW, Andersson L. The origin and maintenance of supergenes contributing to ecological adaptation in Atlantic herring. Nat Commun 2024; 15:9136. [PMID: 39443489 PMCID: PMC11499932 DOI: 10.1038/s41467-024-53079-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Accepted: 09/26/2024] [Indexed: 10/25/2024] Open
Abstract
Chromosomal inversions are associated with local adaptation in many species. However, questions regarding how they are formed, maintained and impact various other evolutionary processes remain elusive. Here, using a large genomic dataset of long-read and short-read sequencing, we ask these questions in one of the most abundant vertebrates on Earth, the Atlantic herring. This species has four megabase-sized inversions associated with ecological adaptation that correlate with water temperature. The S and N inversion alleles at these four loci dominate in the southern and northern parts, respectively, of the species distribution in the North Atlantic Ocean. By determining breakpoint coordinates of the four inversions and the structural variations surrounding them, we hypothesize that these inversions are formed by ectopic recombination between duplicated sequences immediately outside of the inversions. We show that these are old inversions (>1 MY), albeit formed after the split between the Atlantic herring and its sister species, the Pacific herring. There is evidence for extensive gene flux between inversion alleles at all four loci. The large Ne of herring combined with the common occurrence of opposite homozygotes across the species distribution has allowed effective purifying selection to prevent the accumulation of genetic load and repeats within the inversions.
Collapse
Affiliation(s)
- Minal Jamsandekar
- Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, USA
| | - Mafalda S Ferreira
- Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
| | - Mats E Pettersson
- Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
| | | | - Brian W Davis
- Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, USA
| | - Leif Andersson
- Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, USA.
- Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden.
| |
Collapse
|
30
|
Zdąbłasz K, Lisiecka A, Dojer N. Sequence Flow: interactive web application for visualizing partial order alignments. BMC Genomics 2024; 25:973. [PMID: 39415087 PMCID: PMC11483981 DOI: 10.1186/s12864-024-10886-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Accepted: 10/09/2024] [Indexed: 10/18/2024] Open
Abstract
BACKGROUND Multiple sequence alignment (MSA) has proven extremely useful in computational biology, especially in inferring evolutionary relationships via phylogenetic analysis and providing insight into protein structure and function. An alternative to the standard MSA model is partial order alignment (POA), in which aligned sequences are represented as paths in a graph rather than rows in a matrix. While the POA model has proven useful in several applications (e.g. sequencing reads assembly and pangenome structure exploration), we lack efficient visualization tools that could highlight its advantages. RESULTS We propose Sequence Flow - a web application designed to address the above problem. Sequence Flow presents the POA as a Sankey diagram, a kind of graph visualisation typically used for graphs representing flowcharts. Sequence Flow enables interactive alignment exploration, including fragment selection, highlighting a selected group of sequences, modification of the position of graph nodes, structure simplification etc. After adjustment, the visualization can be saved as a high-quality graphic file. Thanks to the use of SanKEY.js - a JavaScript library for creating Sankey diagrams, designed specifically to visualize POAs, Sequence Flow provides satisfactory performance even with large alignments. CONCLUSIONS We provide Sankey diagram-based POA visualization tools for both end users (Sequence Flow) and bioinformatic software developers (SanKEY.js). Sequence Flow webservice is available at https://sequenceflow.mimuw.edu.pl/ . The source code for SanKEY.js is available at https://github.com/Krzysiekzd/SanKEY.js and for Sequence Flow at https://github.com/Krzysiekzd/SequenceFlow .
Collapse
Affiliation(s)
- Krzysztof Zdąbłasz
- Institute of Informatics, University of Warsaw, Banacha 2, Warszawa, 02-097, Poland
| | - Anna Lisiecka
- Institute of Informatics, University of Warsaw, Banacha 2, Warszawa, 02-097, Poland
| | - Norbert Dojer
- Institute of Informatics, University of Warsaw, Banacha 2, Warszawa, 02-097, Poland.
| |
Collapse
|
31
|
Novak AM, Chung D, Hickey G, Djebali S, Yokoyama TT, Garrison E, Narzisi G, Paten B, Monlong J. Efficient indexing and querying of annotations in a pangenome graph. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.12.618009. [PMID: 39464141 PMCID: PMC11507721 DOI: 10.1101/2024.10.12.618009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/29/2024]
Abstract
The current reference genome is the backbone of diverse and rich annotations. Simple text formats, like VCF or BED, have been widely adopted and helped the critical exchange of genomic information. There is a dire need for tools and formats enabling pangenomic annotation to facilitate such enrichment of pangenomic references. The Graph Alignment Format (GAF) is a text format, tab-delimited like BED/VCF files, which was proposed to represent alignments. GAF could also be used to store paths representing annotations in a pangenome graph, but there are no tools to index and query them efficiently. Here, we present extensions to vg and HTSlib that provide efficient sorting, indexing, and querying for GAF files. With this approach, annotations overlapping a subgraph can be extracted quickly. Paths are sorted based on the IDs of traversed nodes, compressed with BGZIP, and indexed with HTSlib/tabix via our extensions for the GAF format. Compared to the binary GAM format, GAF files are easier to edit or inspect because they are plain text, and we show that they are twice as fast to sort and half as large on disk. In addition, we updated vg annotate, which takes BED or GFF3 annotation files relative to linear sequences and projects them into the pangenome. It can now produce GAF files representing these annotations' paths through the pangenome. We showcase these new tools on several applications. We projected annotations for all Human Pangenome Reference Consortium Year 1 haplotypes, including genes, segmental duplications, tandem repeats and repeats annotations, into the Minigraph-Cactus pangenome (GRCh38-based v1.1). We also projected known variants from the GWAS Catalog and expression QTLs from the GTEx project into the pangenome. Finally, we reanalyzed ATAC-seq data from ENCODE to demonstrate what a coverage track could look like in a pangenome graph. These rich annotations can be quickly queried with vg and visualized using existing tools like the Sequence Tube Map or Bandage.
Collapse
Affiliation(s)
- Adam M. Novak
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Dickson Chung
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Glenn Hickey
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Sarah Djebali
- IRSD - Digestive Health Research Institute, University of Toulouse, INSERM, INRAE, ENVT, UPS, Toulouse, France
| | - Toshiyuki T. Yokoyama
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | | | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA
| | - Jean Monlong
- IRSD - Digestive Health Research Institute, University of Toulouse, INSERM, INRAE, ENVT, UPS, Toulouse, France
| |
Collapse
|
32
|
Bolognini D, Halgren A, Lou RN, Raveane A, Rocha JL, Guarracino A, Soranzo N, Chin CS, Garrison E, Sudmant PH. Recurrent evolution and selection shape structural diversity at the amylase locus. Nature 2024; 634:617-625. [PMID: 39232174 PMCID: PMC11485256 DOI: 10.1038/s41586-024-07911-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 08/06/2024] [Indexed: 09/06/2024]
Abstract
The adoption of agriculture triggered a rapid shift towards starch-rich diets in human populations1. Amylase genes facilitate starch digestion, and increased amylase copy number has been observed in some modern human populations with high-starch intake2, although evidence of recent selection is lacking3,4. Here, using 94 long-read haplotype-resolved assemblies and short-read data from approximately 5,600 contemporary and ancient humans, we resolve the diversity and evolutionary history of structural variation at the amylase locus. We find that amylase genes have higher copy numbers in agricultural populations than in fishing, hunting and pastoral populations. We identify 28 distinct amylase structural architectures and demonstrate that nearly identical structures have arisen recurrently on different haplotype backgrounds throughout recent human history. AMY1 and AMY2A genes each underwent multiple duplication/deletion events with mutation rates up to more than 10,000-fold the single-nucleotide polymorphism mutation rate, whereas AMY2B gene duplications share a single origin. Using a pangenome-based approach, we infer structural haplotypes across thousands of humans identifying extensively duplicated haplotypes at higher frequency in modern agricultural populations. Leveraging 533 ancient human genomes, we find that duplication-containing haplotypes (with more gene copies than the ancestral haplotype) have rapidly increased in frequency over the past 12,000 years in West Eurasians, suggestive of positive selection. Together, our study highlights the potential effects of the agricultural revolution on human genomes and the importance of structural variation in human adaptation.
Collapse
Affiliation(s)
| | - Alma Halgren
- Department of Integrative Biology, University of California Berkeley, Berkeley, CA, USA
| | - Runyang Nicolas Lou
- Department of Integrative Biology, University of California Berkeley, Berkeley, CA, USA
| | | | - Joana L Rocha
- Department of Integrative Biology, University of California Berkeley, Berkeley, CA, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Nicole Soranzo
- Human Technopole, Milan, Italy
- Wellcome Sanger Institute, Hinxton, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge, UK
- Department of Haematology, Cambridge Biomedical Campus, Cambridge, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, UK
| | - Chen-Shan Chin
- Foundation for Biological Data Science, Belmont, CA, USA
| | - Erik Garrison
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA.
| | - Peter H Sudmant
- Department of Integrative Biology, University of California Berkeley, Berkeley, CA, USA.
- Center for Computational Biology, University of California Berkeley, Berkeley, CA, USA.
| |
Collapse
|
33
|
Bonnici V, Chicco D. Seven quick tips for gene-focused computational pangenomic analysis. BioData Min 2024; 17:28. [PMID: 39227987 PMCID: PMC11370085 DOI: 10.1186/s13040-024-00380-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Accepted: 08/12/2024] [Indexed: 09/05/2024] Open
Abstract
Pangenomics is a relatively new scientific field which investigates the union of all the genomes of a clade. The word pan means everything in ancient Greek; the term pangenomics originally regarded genomes of bacteria and was later intended to refer to human genomes as well. Modern bioinformatics offers several tools to analyze pangenomics data, paving the way to an emerging field that we can call computational pangenomics. Current computational power available for the bioinformatics community has made computational pangenomic analyses easy to perform, but this higher accessibility to pangenomics analysis also increases the chances to make mistakes and to produce misleading or inflated results, especially by beginners. To handle this problem, we present here a few quick tips for efficient and correct computational pangenomic analyses with a focus on bacterial pangenomics, by describing common mistakes to avoid and experienced best practices to follow in this field. We believe our recommendations can help the readers perform more robust and sound pangenomic analyses and to generate more reliable results.
Collapse
Affiliation(s)
- Vincenzo Bonnici
- Dipartimento di Scienze Matematiche Fisiche e Informatiche, Università di Parma, Parma, Italy.
| | - Davide Chicco
- Dipartimento di Informatica Sistemistica e Comunicazione, Università di Milano-Bicocca, Milan, Italy.
- Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada.
| |
Collapse
|
34
|
Du D, Zhong F, Liu L. Enhancing recognition and interpretation of functional phenotypic sequences through fine-tuning pre-trained genomic models. J Transl Med 2024; 22:756. [PMID: 39135093 PMCID: PMC11318145 DOI: 10.1186/s12967-024-05567-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 08/03/2024] [Indexed: 08/16/2024] Open
Abstract
BACKGROUND Decoding human genomic sequences requires comprehensive analysis of DNA sequence functionality. Through computational and experimental approaches, researchers have studied the genotype-phenotype relationship and generate important datasets that help unravel complicated genetic blueprints. Thus, the recently developed artificial intelligence methods can be used to interpret the functions of those DNA sequences. METHODS This study explores the use of deep learning, particularly pre-trained genomic models like DNA_bert_6 and human_gpt2-v1, in interpreting and representing human genome sequences. Initially, we meticulously constructed multiple datasets linking genotypes and phenotypes to fine-tune those models for precise DNA sequence classification. Additionally, we evaluate the influence of sequence length on classification results and analyze the impact of feature extraction in the hidden layers of our model using the HERV dataset. To enhance our understanding of phenotype-specific patterns recognized by the model, we perform enrichment, pathogenicity and conservation analyzes of specific motifs in the human endogenous retrovirus (HERV) sequence with high average local representation weight (ALRW) scores. RESULTS We have constructed multiple genotype-phenotype datasets displaying commendable classification performance in comparison with random genomic sequences, particularly in the HERV dataset, which achieved binary and multi-classification accuracies and F1 values exceeding 0.935 and 0.888, respectively. Notably, the fine-tuning of the HERV dataset not only improved our ability to identify and distinguish diverse information types within DNA sequences but also successfully identified specific motifs associated with neurological disorders and cancers in regions with high ALRW scores. Subsequent analysis of these motifs shed light on the adaptive responses of species to environmental pressures and their co-evolution with pathogens. CONCLUSIONS These findings highlight the potential of pre-trained genomic models in learning DNA sequence representations, particularly when utilizing the HERV dataset, and provide valuable insights for future research endeavors. This study represents an innovative strategy that combines pre-trained genomic model representations with classical methods for analyzing the functionality of genome sequences, thereby promoting cross-fertilization between genomics and artificial intelligence.
Collapse
Affiliation(s)
- Duo Du
- School of Basic Medical Sciences and Intelligent Medicine Institute, Fudan University, Shanghai, 200032, China
| | - Fan Zhong
- School of Basic Medical Sciences and Intelligent Medicine Institute, Fudan University, Shanghai, 200032, China.
| | - Lei Liu
- School of Basic Medical Sciences and Intelligent Medicine Institute, Fudan University, Shanghai, 200032, China.
- Shanghai Institute of Stem Cell Research and Clinical Translation, Shanghai, 200120, China.
| |
Collapse
|
35
|
van den Brandt A, Jonkheer EM, van Workum DJM, van de Wetering H, Smit S, Vilanova A. PanVA: Pangenomic Variant Analysis. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2024; 30:4895-4909. [PMID: 37267130 DOI: 10.1109/tvcg.2023.3282364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Genomics researchers increasingly use multiple reference genomes to comprehensively explore genetic variants underlying differences in detectable characteristics between organisms. Pangenomes allow for an efficient data representation of multiple related genomes and their associated metadata. However, current visual analysis approaches for exploring these complex genotype-phenotype relationships are often based on single reference approaches or lack adequate support for interpreting the variants in the genomic context with heterogeneous (meta)data. This design study introduces PanVA, a visual analytics design for pangenomic variant analysis developed with the active participation of genomics researchers. The design uniquely combines tailored visual representations with interactions such as sorting, grouping, and aggregation, allowing users to navigate and explore different perspectives on complex genotype-phenotype relations. Through evaluation in the context of plants and pathogen research, we show that PanVA helps researchers explore variants in genes and generate hypotheses about their role in phenotypic variation.
Collapse
|
36
|
Seersholm FV, Sjögren KG, Koelman J, Blank M, Svensson EM, Staring J, Fraser M, Pinotti T, McColl H, Gaunitz C, Ruiz-Bedoya T, Granehäll L, Villegas-Ramirez B, Fischer A, Price TD, Allentoft ME, Iversen AKN, Axelsson T, Ahlström T, Götherström A, Storå J, Kristiansen K, Willerslev E, Jakobsson M, Malmström H, Sikora M. Repeated plague infections across six generations of Neolithic Farmers. Nature 2024; 632:114-121. [PMID: 38987589 PMCID: PMC11291285 DOI: 10.1038/s41586-024-07651-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Accepted: 06/03/2024] [Indexed: 07/12/2024]
Abstract
In the period between 5,300 and 4,900 calibrated years before present (cal. BP), populations across large parts of Europe underwent a period of demographic decline1,2. However, the cause of this so-called Neolithic decline is still debated. Some argue for an agricultural crisis resulting in the decline3, others for the spread of an early form of plague4. Here we use population-scale ancient genomics to infer ancestry, social structure and pathogen infection in 108 Scandinavian Neolithic individuals from eight megalithic graves and a stone cist. We find that the Neolithic plague was widespread, detected in at least 17% of the sampled population and across large geographical distances. We demonstrate that the disease spread within the Neolithic community in three distinct infection events within a period of around 120 years. Variant graph-based pan-genomics shows that the Neolithic plague genomes retained ancestral genomic variation present in Yersinia pseudotuberculosis, including virulence factors associated with disease outcomes. In addition, we reconstruct four multigeneration pedigrees, the largest of which consists of 38 individuals spanning six generations, showing a patrilineal social organization. Lastly, we document direct genomic evidence for Neolithic female exogamy in a woman buried in a different megalithic tomb than her brothers. Taken together, our findings provide a detailed reconstruction of plague spread within a large patrilineal kinship group and identify multiple plague infections in a population dated to the beginning of the Neolithic decline.
Collapse
Affiliation(s)
- Frederik Valeur Seersholm
- Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen, Denmark.
| | - Karl-Göran Sjögren
- Department of Historical Studies, University of Gothenburg, Gothenburg, Sweden
| | - Julia Koelman
- Human Evolution, Department of Organismal Biology, Uppsala University, Uppsala, Sweden
| | - Malou Blank
- Department of Historical Studies, University of Gothenburg, Gothenburg, Sweden
| | - Emma M Svensson
- Human Evolution, Department of Organismal Biology, Uppsala University, Uppsala, Sweden
| | | | - Magdalena Fraser
- Human Evolution, Department of Organismal Biology, Uppsala University, Uppsala, Sweden
| | - Thomaz Pinotti
- Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen, Denmark
- Laboratório de Biodiversidade e Evolução Molecular (LBEM), Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Hugh McColl
- Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Charleen Gaunitz
- Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen, Denmark
| | - Tatiana Ruiz-Bedoya
- Human Evolution, Department of Organismal Biology, Uppsala University, Uppsala, Sweden
- Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, Canada
| | - Lena Granehäll
- Human Evolution, Department of Organismal Biology, Uppsala University, Uppsala, Sweden
- Institute for Mummy Studies Eurac Research, Bolzano, Italy
| | | | | | - T Douglas Price
- Department of Historical Studies, University of Gothenburg, Gothenburg, Sweden
| | - Morten E Allentoft
- Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen, Denmark
- Trace and Environmental DNA (TrEnD) Laboratory, School of Molecular and Life Sciences, Curtin University, Perth, Western Australia, Australia
| | - Astrid K N Iversen
- Nuffield Department of Clinical Neurosciences, Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK
| | - Tony Axelsson
- Department of Historical Studies, University of Gothenburg, Gothenburg, Sweden
| | - Torbjörn Ahlström
- Department of Archaeology and Ancient History, Lund University, Lund, Sweden
| | - Anders Götherström
- Centre for Palaeogenetics, Stockholm University and the Swedish Museum of Natural History, Stockholm, Sweden
- Department of Archaeology and Classical Studies, Stockholm University, Stockholm, Sweden
| | - Jan Storå
- Department of Archaeology and Classical Studies, Stockholm University, Stockholm, Sweden
| | - Kristian Kristiansen
- Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen, Denmark
- Department of Historical Studies, University of Gothenburg, Gothenburg, Sweden
| | - Eske Willerslev
- Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen, Denmark
- Department of Zoology, University of Cambridge, Cambridge, UK
| | - Mattias Jakobsson
- Human Evolution, Department of Organismal Biology, Uppsala University, Uppsala, Sweden
- Palaeo-Research Institute, University of Johannesburg, Johannesburg, South Africa
| | - Helena Malmström
- Human Evolution, Department of Organismal Biology, Uppsala University, Uppsala, Sweden
- Palaeo-Research Institute, University of Johannesburg, Johannesburg, South Africa
| | - Martin Sikora
- Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
37
|
Taylor DJ, Eizenga JM, Li Q, Das A, Jenike KM, Kenny EE, Miga KH, Monlong J, McCoy RC, Paten B, Schatz MC. Beyond the Human Genome Project: The Age of Complete Human Genome Sequences and Pangenome References. Annu Rev Genomics Hum Genet 2024; 25:77-104. [PMID: 38663087 PMCID: PMC11451085 DOI: 10.1146/annurev-genom-021623-081639] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/29/2024]
Abstract
The Human Genome Project was an enormous accomplishment, providing a foundation for countless explorations into the genetics and genomics of the human species. Yet for many years, the human genome reference sequence remained incomplete and lacked representation of human genetic diversity. Recently, two major advances have emerged to address these shortcomings: complete gap-free human genome sequences, such as the one developed by the Telomere-to-Telomere Consortium, and high-quality pangenomes, such as the one developed by the Human Pangenome Reference Consortium. Facilitated by advances in long-read DNA sequencing and genome assembly algorithms, complete human genome sequences resolve regions that have been historically difficult to sequence, including centromeres, telomeres, and segmental duplications. In parallel, pangenomes capture the extensive genetic diversity across populations worldwide. Together, these advances usher in a new era of genomics research, enhancing the accuracy of genomic analysis, paving the path for precision medicine, and contributing to deeper insights into human biology.
Collapse
Affiliation(s)
- Dylan J Taylor
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, USA; , ,
| | - Jordan M Eizenga
- Genomics Institute, University of California, Santa Cruz, California, USA; , ,
| | - Qiuhui Li
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA; ,
| | - Arun Das
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA; ,
| | - Katharine M Jenike
- Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA;
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA;
| | - Karen H Miga
- Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
- Genomics Institute, University of California, Santa Cruz, California, USA; , ,
| | - Jean Monlong
- Institut de Recherche en Santé Digestive, Université de Toulouse, INSERM, INRA, ENVT, UPS, Toulouse, France;
| | - Rajiv C McCoy
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, USA; , ,
| | - Benedict Paten
- Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
- Genomics Institute, University of California, Santa Cruz, California, USA; , ,
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA; ,
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, USA; , ,
| |
Collapse
|
38
|
Tavakoli N, Gibney D, Aluru S. GraphSlimmer: Preserving Read Mappability with the Minimum Number of Variants. J Comput Biol 2024; 31:616-637. [PMID: 38990757 DOI: 10.1089/cmb.2024.0601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2024] Open
Abstract
Modern genomic datasets, like those generated under the 1000 Genome Project, contain millions of variants belonging to known haplotypes. Although these datasets are more representative than a single reference sequence and can alleviate issues like reference bias, they are significantly more computationally burdensome to work with, often involving large-indexed genome graph data structures for tasks such as read mapping. The construction, preprocessing, and mapping algorithms can require substantial computational resources depending on the size of these variant sets. Moreover, the accuracy of mapping algorithms has been shown to decrease when working with complete variant sets. Therefore, a drastically reduced set of variants that preserves important properties of the original set is desirable. This work provides a technique for finding a minimal subset of variants S such that for given parameters α and δ, all substrings up to length α in the haplotypes are guaranteed to be still alignable to the appropriate locations with either Hamming or edit distance at most δ, using only S . Our contributions include showing the NP-hardness and inapproximability of these optimization problems and providing Integer Linear Programming (ILP) formulations. Our edit distance ILP formulation carefully decomposes the problem according to variant locations, which allows it to scale to support all of chromosome 22's variants from the 1000 Genome Project. Our experiments also demonstrate a significant reduction in the number of variants. For example, for moderately long reads, e.g., α = 1000, over 75% of the variants can be removed while preserving read mappability with edit distance at most one.
Collapse
Affiliation(s)
- Neda Tavakoli
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Gxeorgia, USA
| | - Daniel Gibney
- Department of Computer Science, University of Texas at Dallas, Richardson, Texas, USA
| | - Srinivas Aluru
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Gxeorgia, USA
| |
Collapse
|
39
|
Heumos S, Guarracino A, Schmelzle JNM, Li J, Zhang Z, Hagmann J, Nahnsen S, Prins P, Garrison E. Pangenome graph layout by Path-Guided Stochastic Gradient Descent. Bioinformatics 2024; 40:btae363. [PMID: 38960860 PMCID: PMC11227364 DOI: 10.1093/bioinformatics/btae363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 02/20/2024] [Accepted: 07/02/2024] [Indexed: 07/05/2024] Open
Abstract
MOTIVATION The increasing availability of complete genomes demands for models to study genomic variability within entire populations. Pangenome graphs capture the full genomic similarity and diversity between multiple genomes. In order to understand them, we need to see them. For visualization, we need a human-readable graph layout: a graph embedding in low (e.g. two) dimensional depictions. Due to a pangenome graph's potential excessive size, this is a significant challenge. RESULTS In response, we introduce a novel graph layout algorithm: the Path-Guided Stochastic Gradient Descent (PG-SGD). PG-SGD uses the genomes, represented in the pangenome graph as paths, as an embedded positional system to sample genomic distances between pairs of nodes. This avoids the quadratic cost seen in previous versions of graph drawing by SGD. We show that our implementation efficiently computes the low-dimensional layouts of gigabase-scale pangenome graphs, unveiling their biological features. AVAILABILITY AND IMPLEMENTATION We integrated PG-SGD in ODGI which is released as free software under the MIT open source license. Source code is available at https://github.com/pangenome/odgi.
Collapse
Affiliation(s)
- Simon Heumos
- Quantitative Biology Center (QBiC), University of Tübingen, 72076 Tübingen, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, 72076 Tübingen, Germany
- M3 Research Center, University Hospital Tübingen, 72076 Tübingen, Germany
- Institute for Bioinformatics and Medical Informatics (IBMI), University of Tübingen, 72076 Tübingen, Germany
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, United States
- Genomics Research Centre, Human Technopole, 20157 Milan, Italy
| | - Jan-Niklas M Schmelzle
- Department of Computer Engineering, School of Computation, Information and Technology (CIT), Technical University of Munich, 80333 Munich, Germany
- School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853, United States
| | - Jiajie Li
- School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853, United States
| | - Zhiru Zhang
- School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853, United States
| | | | - Sven Nahnsen
- Quantitative Biology Center (QBiC), University of Tübingen, 72076 Tübingen, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, 72076 Tübingen, Germany
- M3 Research Center, University Hospital Tübingen, 72076 Tübingen, Germany
- Institute for Bioinformatics and Medical Informatics (IBMI), University of Tübingen, 72076 Tübingen, Germany
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, United States
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, United States
| |
Collapse
|
40
|
Bolognini D, Halgren A, Lou RN, Raveane A, Rocha JL, Guarracino A, Soranzo N, Chin J, Garrison E, Sudmant PH. Global diversity, recurrent evolution, and recent selection on amylase structural haplotypes in humans. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.07.579378. [PMID: 38370750 PMCID: PMC10871346 DOI: 10.1101/2024.02.07.579378] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
The adoption of agriculture, first documented ~12,000 years ago in the Fertile Crescent, triggered a rapid shift toward starch-rich diets in human populations. Amylase genes facilitate starch digestion and increased salivary amylase copy number has been observed in some modern human populations with high starch intake, though evidence of recent selection is lacking. Here, using 52 long-read diploid assemblies and short read data from ~5,600 contemporary and ancient humans, we resolve the diversity, evolutionary history, and selective impact of structural variation at the amylase locus. We find that amylase genes have higher copy numbers in populations with agricultural subsistence compared to fishing, hunting, and pastoral groups. We identify 28 distinct amylase structural architectures and demonstrate that nearly identical structures have arisen recurrently on different haplotype backgrounds throughout recent human history. AMY1 and AMY2A genes each exhibit multiple duplications/deletions with mutation rates >10,000-fold the SNP mutation rate, whereas AMY2B gene duplications share a single origin. Using a pangenome graph-based approach to infer structural haplotypes across thousands of humans, we identify extensively duplicated haplotypes present at higher frequencies in modern day populations with traditionally agricultural diets. Leveraging 533 ancient human genomes we find that duplication-containing haplotypes (i.e. haplotypes with more amylase gene copies than the ancestral haplotype) have increased in frequency more than seven-fold over the last 12,000 years providing evidence for recent selection in West Eurasians. Together, our study highlights the potential impacts of the agricultural revolution on human genomes and the importance of long-read sequencing in identifying signatures of selection at structurally complex loci.
Collapse
Affiliation(s)
| | - Alma Halgren
- Department of Integrative Biology, University of California Berkeley, Berkeley, USA
| | - Runyang Nicolas Lou
- Department of Integrative Biology, University of California Berkeley, Berkeley, USA
| | | | - Joana L Rocha
- Department of Integrative Biology, University of California Berkeley, Berkeley, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, USA
| | | | - Jason Chin
- Foundation for Biological Data Science, Belmont, USA
| | - Erik Garrison
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, USA
| | - Peter H Sudmant
- Department of Integrative Biology, University of California Berkeley, Berkeley, USA
- Center for Computational Biology, University of California Berkeley, Berkeley, USA
| |
Collapse
|
41
|
Parmigiani L, Garrison E, Stoye J, Marschall T, Doerr D. Panacus: fast and exact pangenome growth and core size estimation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.11.598418. [PMID: 38915671 PMCID: PMC11195249 DOI: 10.1101/2024.06.11.598418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/26/2024]
Abstract
Motivation Using a single linear reference genome poses a limitation to exploring the full genomic diversity of a species. The release of a draft human pangenome underscores the increasing relevance of pangenomics to overcome these limitations. Pangenomes are commonly represented as graphs, which can represent billions of base pairs of sequence. Presently, there is a lack of scalable software able to perform key tasks on pangenomes, such as quantifying universally shared sequence across genomes (the core genome) and measuring the extent of genomic variability as a function of sample size (pangenome growth). Results We introduce Panacus (pangenome-abacus), a tool designed to rapidly perform these tasks and visualize the results in interactive plots. Panacus can process GFA files, the accepted standard for pangenome graphs, and is able to analyze a human pangenome graph with 110 million nodes in less than one hour. Availability Panacus is implemented in Rust and is published as Open Source software under the MIT license. The source code and documentation are available at https://github.com/marschall-lab/panacus. Panacus can be installed via Bioconda at https://bioconda.github.io/recipes/panacus/README.html.
Collapse
Affiliation(s)
- Luca Parmigiani
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, 33615, Germany
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Jens Stoye
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, 33615, Germany
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Germany
| | - Daniel Doerr
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Germany
- Department for Endocrinology and Diabetology, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Germany
- German Diabetes Center (DDZ), Leibniz Institute for Diabetes Research, Düsseldorf, Germany
| |
Collapse
|
42
|
Villani F, Guarracino A, Ward RR, Green T, Emms M, Pravenec M, Prins P, Garrison E, Williams RW, Chen H, Colonna V. Pangenome reconstruction in rats enhances genotype-phenotype mapping and novel variant discovery. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.10.575041. [PMID: 38260597 PMCID: PMC10802574 DOI: 10.1101/2024.01.10.575041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
The HXB/BXH family of recombinant inbred rat strains is a unique genetic resource that has been extensively phenotyped over 25 years, resulting in a vast dataset of quantitative molecular and physiological phenotypes. We built a pangenome graph from 10x Genomics Linked-Read data for 31 recombinant inbred rats to study genetic variation and association mapping. The pangenome includes 0.2Gb of sequence that is not present the reference mRatBN7.2, confirming the capture of substantial additional variation. We validated variants in challenging regions, including complex structural variants resolving into multiple haplotypes. Phenome-wide association analysis of validated SNPs uncovered variants associated with glucose/insulin levels and hippocampal gene expression. We propose an interaction between Pirl1l1, chromogranin expression, TNF-α levels, and insulin regulation. This study demonstrates the utility of linked-read pangenomes for comprehensive variant detection and mapping phenotypic diversity in a widely used rat genetic reference panel.
Collapse
Affiliation(s)
- Flavia Villani
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Rachel R Ward
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center
| | - Tomomi Green
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center
| | - Madeleine Emms
- Institute of Genetics and Biophysics, National Research Council, Naples, 80111, Italy
| | - Michal Pravenec
- Institute of Physiology, Czech Academy of Sciences, 14200 Prague, Czech Republic
| | - Pjotr Prins
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Erik Garrison
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Robert W. Williams
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Hao Chen
- Department of Pharmacology, Addiction Science, and Toxicology, University of Tennessee Health Science Center
| | - Vincenza Colonna
- Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Institute of Genetics and Biophysics, National Research Council, Naples, 80111, Italy
| |
Collapse
|
43
|
Hu H, Li R, Zhao J, Batley J, Edwards D. Technological Development and Advances for Constructing and Analyzing Plant Pangenomes. Genome Biol Evol 2024; 16:evae081. [PMID: 38669452 PMCID: PMC11058698 DOI: 10.1093/gbe/evae081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 04/09/2024] [Accepted: 04/11/2024] [Indexed: 04/28/2024] Open
Abstract
A pangenome captures the genomic diversity for a species, derived from a collection of genetic sequences of diverse populations. Advances in sequencing technologies have given rise to three primary methods for pangenome construction and analysis: de novo assembly and comparison, reference genome-based iterative assembly, and graph-based pangenome construction. Each method presents advantages and challenges in processing varying amounts and structures of DNA sequencing data. With the emergence of high-quality genome assemblies and advanced bioinformatic tools, the graph-based pangenome is emerging as an advanced reference for exploring the biological and functional implications of genetic variations.
Collapse
Affiliation(s)
- Haifei Hu
- Rice Research Institute, Guangdong Academy of Agricultural Sciences & Key Laboratory of Genetics and Breeding of High Quality Rice in Southern China (Co-construction by Ministry and Province), Ministry of Agriculture and Rural Affairs & Guangdong Key Laboratory of New Technology in Rice Breeding & Guangdong Rice Engineering Laboratory, Guangzhou 510640, China
| | - Risheng Li
- Rice Research Institute, Guangdong Academy of Agricultural Sciences & Key Laboratory of Genetics and Breeding of High Quality Rice in Southern China (Co-construction by Ministry and Province), Ministry of Agriculture and Rural Affairs & Guangdong Key Laboratory of New Technology in Rice Breeding & Guangdong Rice Engineering Laboratory, Guangzhou 510640, China
- College of Agriculture, South China Agricultural University, Guangzhou, Guangdong 510642, China
| | - Junliang Zhao
- Rice Research Institute, Guangdong Academy of Agricultural Sciences & Key Laboratory of Genetics and Breeding of High Quality Rice in Southern China (Co-construction by Ministry and Province), Ministry of Agriculture and Rural Affairs & Guangdong Key Laboratory of New Technology in Rice Breeding & Guangdong Rice Engineering Laboratory, Guangzhou 510640, China
| | - Jacqueline Batley
- School of Biological Sciences, University of Western Australia, Perth, WA, Australia
| | - David Edwards
- School of Biological Sciences, University of Western Australia, Perth, WA, Australia
- Centre for Applied Bioinformatics, University of Western Australia, Perth, WA 6009, Australia
| |
Collapse
|
44
|
Gustafson JA, Gibson SB, Damaraju N, Zalusky MPG, Hoekzema K, Twesigomwe D, Yang L, Snead AA, Richmond PA, De Coster W, Olson ND, Guarracino A, Li Q, Miller AL, Goffena J, Anderson Z, Storz SHR, Ward SA, Sinha M, Gonzaga-Jauregui C, Clarke WE, Basile AO, Corvelo A, Reeves C, Helland A, Musunuri RL, Revsine M, Patterson KE, Paschal CR, Zakarian C, Goodwin S, Jensen TD, Robb E, McCombie WR, Sedlazeck FJ, Zook JM, Montgomery SB, Garrison E, Kolmogorov M, Schatz MC, McLaughlin RN, Dashnow H, Zody MC, Loose M, Jain M, Eichler EE, Miller DE. Nanopore sequencing of 1000 Genomes Project samples to build a comprehensive catalog of human genetic variation. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.03.05.24303792. [PMID: 38496498 PMCID: PMC10942501 DOI: 10.1101/2024.03.05.24303792] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
Less than half of individuals with a suspected Mendelian condition receive a precise molecular diagnosis after comprehensive clinical genetic testing. Improvements in data quality and costs have heightened interest in using long-read sequencing (LRS) to streamline clinical genomic testing, but the absence of control datasets for variant filtering and prioritization has made tertiary analysis of LRS data challenging. To address this, the 1000 Genomes Project ONT Sequencing Consortium aims to generate LRS data from at least 800 of the 1000 Genomes Project samples. Our goal is to use LRS to identify a broader spectrum of variation so we may improve our understanding of normal patterns of human variation. Here, we present data from analysis of the first 100 samples, representing all 5 superpopulations and 19 subpopulations. These samples, sequenced to an average depth of coverage of 37x and sequence read N50 of 54 kbp, have high concordance with previous studies for identifying single nucleotide and indel variants outside of homopolymer regions. Using multiple structural variant (SV) callers, we identify an average of 24,543 high-confidence SVs per genome, including shared and private SVs likely to disrupt gene function as well as pathogenic expansions within disease-associated repeats that were not detected using short reads. Evaluation of methylation signatures revealed expected patterns at known imprinted loci, samples with skewed X-inactivation patterns, and novel differentially methylated regions. All raw sequencing data, processed data, and summary statistics are publicly available, providing a valuable resource for the clinical genetics community to discover pathogenic SVs.
Collapse
Affiliation(s)
- Jonas A. Gustafson
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
- Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA
| | - Sophia B. Gibson
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Nikhita Damaraju
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
- Institute for Public Health Genetics, University of Washington, Seattle, WA, USA
| | - Miranda PG Zalusky
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Kendra Hoekzema
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - David Twesigomwe
- Sydney Brenner Institute for Molecular Bioscience, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
| | - Lei Yang
- Pacific Northwest Research Institute, Seattle, WA, USA
| | | | | | - Wouter De Coster
- Applied and Translational Neurogenomics Group, VIB Center for Molecular Neurology, VIB, Antwerp, Belgium
- Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
| | - Nathan D. Olson
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
- Human Technopole, Milan, Italy
| | - Qiuhui Li
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Angela L. Miller
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Joy Goffena
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Zachery Anderson
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Sophie HR Storz
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Sydney A. Ward
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Maisha Sinha
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
| | - Claudia Gonzaga-Jauregui
- International Laboratory for Human Genome Research, Laboratorio Internacional de Investigación sobre el Genoma Humano, Universidad Nacional Autónoma de México
| | - Wayne E. Clarke
- New York Genome Center, New York, NY, USA
- Outlier Informatics Inc., Saskatoon, SK, Canada
| | | | | | | | | | | | - Mahler Revsine
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | | | - Cate R. Paschal
- Department of Laboratories, Seattle Children’s Hospital, Seattle, WA, USA
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, WA, USA
| | - Christina Zakarian
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Sara Goodwin
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | | | - Esther Robb
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | | | | | | | | | - Fritz J. Sedlazeck
- Human Genome Sequencing Center Baylor College of Medicine, Houston, TX, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
- Department of Computer Science, Rice University, Houston, TX, USA
| | - Justin M. Zook
- Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA
| | | | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | - Mikhail Kolmogorov
- Cancer Data Science Laboratory, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Michael C. Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Richard N. McLaughlin
- Molecular and Cellular Biology Program, University of Washington, Seattle, WA, USA
- Pacific Northwest Research Institute, Seattle, WA, USA
| | - Harriet Dashnow
- Department of Human Genetics, University of Utah, Salt Lake City, UT, USA
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA
| | | | - Matt Loose
- Deep Seq, School of Life Sciences, University of Nottingham, Nottingham, England
| | - Miten Jain
- Department of Bioengineering, Department of Physics, Khoury College of Computer Sciences, Northeastern University, Boston, MA
| | - Evan E. Eichler
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Danny E. Miller
- Division of Genetic Medicine, Department of Pediatrics, University of Washington, Seattle, WA, USA
- Department of Laboratory Medicine and Pathology, University of Washington, Seattle, WA, USA
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, USA
| |
Collapse
|
45
|
Carhuaricra-Huaman D, Setubal JC. Step-by-Step Bacterial Genome Comparison. Methods Mol Biol 2024; 2802:107-134. [PMID: 38819558 DOI: 10.1007/978-1-0716-3838-5_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Thanks to advancements in genome sequencing and bioinformatics, thousands of bacterial genome sequences are available in public databases. This presents an opportunity to study bacterial diversity in unprecedented detail. This chapter describes a complete bioinformatics workflow for comparative genomics of bacterial genomes, including genome annotation, pangenome reconstruction and visualization, phylogenetic analysis, and identification of sequences of interest such as antimicrobial-resistance genes, virulence factors, and phage sequences. The workflow uses state-of-the-art, open-source tools. The workflow is presented by means of a comparative analysis of Salmonella enterica serovar Typhimurium genomes. The workflow is based on Linux commands and scripts, and result visualization relies on the R environment. The chapter provides a step-by-step protocol that researchers with basic expertise in bioinformatics can easily follow to conduct investigations on their own genome datasets.
Collapse
Affiliation(s)
- Dennis Carhuaricra-Huaman
- Programa de Pós-Graduação Interunidades em Bioinformática, Instituto de Matemática e Estatística, Universidade de São Paulo, Sao Paulo, SP, Brazil
- Research Group in Biotechnology Applied to Animal Health, Production and Conservation (SANIGEN), Laboratory of Biology and Molecular Genetics, Faculty of Veterinary Medicine, Universidad Nacional Mayor de San Marcos, San Borja, Lima, Peru
| | - João Carlos Setubal
- Departamento de Bioquímica, Instituto de Química, Universidade de São Paulo, Sao Paulo, SP, Brazil.
| |
Collapse
|
46
|
Andreace F, Lechat P, Dufresne Y, Chikhi R. Comparing methods for constructing and representing human pangenome graphs. Genome Biol 2023; 24:274. [PMID: 38037131 PMCID: PMC10691155 DOI: 10.1186/s13059-023-03098-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2023] [Accepted: 10/26/2023] [Indexed: 12/02/2023] Open
Abstract
BACKGROUND As a single reference genome cannot possibly represent all the variation present across human individuals, pangenome graphs have been introduced to incorporate population diversity within a wide range of genomic analyses. Several data structures have been proposed for representing collections of genomes as pangenomes, in particular graphs. RESULTS In this work, we collect all publicly available high-quality human haplotypes and construct the largest human pangenome graphs to date, incorporating 52 individuals in addition to two synthetic references (CHM13 and GRCh38). We build variation graphs and de Bruijn graphs of this collection using five of the state-of-the-art tools: Bifrost, mdbg, Minigraph, Minigraph-Cactus and pggb. We examine differences in the way each of these tools represents variations between input sequences, both in terms of overall graph structure and representation of specific genetic loci. CONCLUSION This work sheds light on key differences between pangenome graph representations, informing end-users on how to select the most appropriate graph type for their application.
Collapse
Affiliation(s)
- Francesco Andreace
- Department of Computational Biology, Institut Pasteur, Université Paris Cité, Paris, F-75015, France.
- Sorbonne Université, Collège doctoral, F-75005, Paris, France.
| | - Pierre Lechat
- Bioinformatics and Biostatistics Hub, Institut Pasteur, Université de Paris, F-75015, Paris, France
| | - Yoann Dufresne
- Department of Computational Biology, Institut Pasteur, Université Paris Cité, Paris, F-75015, France
- Bioinformatics and Biostatistics Hub, Institut Pasteur, Université de Paris, F-75015, Paris, France
| | - Rayan Chikhi
- Department of Computational Biology, Institut Pasteur, Université Paris Cité, Paris, F-75015, France
| |
Collapse
|
47
|
Rice ES, Alberdi A, Alfieri J, Athrey G, Balacco JR, Bardou P, Blackmon H, Charles M, Cheng HH, Fedrigo O, Fiddaman SR, Formenti G, Frantz LAF, Gilbert MTP, Hearn CJ, Jarvis ED, Klopp C, Marcos S, Mason AS, Velez-Irizarry D, Xu L, Warren WC. A pangenome graph reference of 30 chicken genomes allows genotyping of large and complex structural variants. BMC Biol 2023; 21:267. [PMID: 37993882 PMCID: PMC10664547 DOI: 10.1186/s12915-023-01758-0] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 11/02/2023] [Indexed: 11/24/2023] Open
Abstract
BACKGROUND The red junglefowl, the wild outgroup of domestic chickens, has historically served as a reference for genomic studies of domestic chickens. These studies have provided insight into the etiology of traits of commercial importance. However, the use of a single reference genome does not capture diversity present among modern breeds, many of which have accumulated molecular changes due to drift and selection. While reference-based resequencing is well-suited to cataloging simple variants such as single-nucleotide changes and short insertions and deletions, it is mostly inadequate to discover more complex structural variation in the genome. METHODS We present a pangenome for the domestic chicken consisting of thirty assemblies of chickens from different breeds and research lines. RESULTS We demonstrate how this pangenome can be used to catalog structural variants present in modern breeds and untangle complex nested variation. We show that alignment of short reads from 100 diverse wild and domestic chickens to this pangenome reduces reference bias by 38%, which affects downstream genotyping results. This approach also allows for the accurate genotyping of a large and complex pair of structural variants at the K feathering locus using short reads, which would not be possible using a linear reference. CONCLUSIONS We expect that this new paradigm of genomic reference will allow better pinpointing of exact mutations responsible for specific phenotypes, which will in turn be necessary for breeding chickens that meet new sustainability criteria and are resilient to quickly evolving pathogen threats.
Collapse
Affiliation(s)
- Edward S Rice
- Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
- Faculty of Veterinary Medicine, Ludwig-Maximilians-Universität, Munich, Germany
| | - Antton Alberdi
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen (UCPH), Copenhagen, Denmark
| | - James Alfieri
- Department of Ecology & Evolutionary Biology, Texas A&M University, College Station, TX, USA
| | - Giridhar Athrey
- Department of Poultry Science, Texas A&M University, College Station, TX, USA
| | - Jennifer R Balacco
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Philippe Bardou
- Sigenae, GenPhySE, Université de Toulouse, INRAE, ENVT, Castanet Tolosan, 31326, France
| | - Heath Blackmon
- Department of Biology, Texas A&M University, College Station, TX, USA
| | - Mathieu Charles
- University Paris-Saclay, INRAE, AgroParisTech, GABI, Sigenae, Jouy-en-Josas, France
| | - Hans H Cheng
- Avian Disease and Oncology Laboratory, USDA, ARS, USNPRC, East Lansing, MI, USA
| | - Olivier Fedrigo
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | | | - Giulio Formenti
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
| | - Laurent A F Frantz
- Faculty of Veterinary Medicine, Ludwig-Maximilians-Universität, Munich, Germany
- School of Biological and Behavioural Sciences, Queen Mary University of London, London, E1 4DQ, UK
| | - M Thomas P Gilbert
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen (UCPH), Copenhagen, Denmark
| | - Cari J Hearn
- Avian Disease and Oncology Laboratory, USDA, ARS, USNPRC, East Lansing, MI, USA
| | - Erich D Jarvis
- Vertebrate Genome Laboratory, The Rockefeller University, New York, NY, USA
- The Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Christophe Klopp
- Sigenae, Genotoul Bioinfo, MIAT UR875, INRAE, Castanet Tolosan, France
| | - Sofia Marcos
- Center for Evolutionary Hologenomics, Globe Institute, University of Copenhagen (UCPH), Copenhagen, Denmark
- Applied Genomics and Bioinformatics, University of the Basque Country (UPV/EHU), Leioa, Bilbao, Spain
| | | | | | - Luohao Xu
- Key Laboratory of Freshwater Fish Reproduction and Development (Ministry of Education), Key Laboratory of Aquatic Science of Chongqing, School of Life Sciences, Southwest University, Chongqing, 400715, China
| | - Wesley C Warren
- Department of Animal Sciences, University of Missouri, Columbia, MO, USA.
| |
Collapse
|
48
|
Heumos S, Guarracino A, Schmelzle JNM, Li J, Zhang Z, Hagmann J, Nahnsen S, Prins P, Garrison E. Pangenome graph layout by Path-Guided Stochastic Gradient Descent. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.22.558964. [PMID: 37790531 PMCID: PMC10542513 DOI: 10.1101/2023.09.22.558964] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/05/2023]
Abstract
Motivation The increasing availability of complete genomes demands for models to study genomic variability within entire populations. Pangenome graphs capture the full genomic similarity and diversity between multiple genomes. In order to understand them, we need to see them. For visualization, we need a human readable graph layout: A graph embedding in low (e.g. two) dimensional depictions. Due to a pangenome graph's potential excessive size, this is a significant challenge. Results In response, we introduce a novel graph layout algorithm: the Path-Guided Stochastic Gradient Descent (PG-SGD). PG-SGD uses the genomes, represented in the pangenome graph as paths, as an embedded positional system to sample genomic distances between pairs of nodes. This avoids the quadratic cost seen in previous versions of graph drawing by Stochastic Gradient Descent (SGD). We show that our implementation efficiently computes the low dimensional layouts of gigabase-scale pangenome graphs, unveiling their biological features. Availability We integrated PG-SGD in ODGI which is released as free software under the MIT open source license. Source code is available at https://github.com/pangenome/odgi.
Collapse
Affiliation(s)
- Simon Heumos
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen 72076, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen 72076, Germany
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
- Genomics Research Centre, Human Technopole, Milan 20157, Italy
| | - Jan-Niklas M. Schmelzle
- Department of Computer Engineering, School of Computation, Information and Technology (CIT), Technical University of Munich, Munich 80333, Germany
- School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853, USA
| | - Jiajie Li
- School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853, USA
| | - Zhiru Zhang
- School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853, USA
| | - Jörg Hagmann
- Computomics GmbH, Eisenbahnstr. 1, 72072 Tübingen, Germany
| | - Sven Nahnsen
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen 72076, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen 72076, Germany
- M3 Research Center, University Hospital Tübingen, 72076 Tübingen, Germany
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| |
Collapse
|
49
|
Contreras-Moreira B, Saraf S, Naamati G, Casas AM, Amberkar SS, Flicek P, Jones AR, Dyer S. GET_PANGENES: calling pangenes from plant genome alignments confirms presence-absence variation. Genome Biol 2023; 24:223. [PMID: 37798615 PMCID: PMC10552430 DOI: 10.1186/s13059-023-03071-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 09/21/2023] [Indexed: 10/07/2023] Open
Abstract
Crop pangenomes made from individual cultivar assemblies promise easy access to conserved genes, but genome content variability and inconsistent identifiers hamper their exploration. To address this, we define pangenes, which summarize a species coding potential and link back to original annotations. The protocol get_pangenes performs whole genome alignments (WGA) to call syntenic gene models based on coordinate overlaps. A benchmark with small and large plant genomes shows that pangenes recapitulate phylogeny-based orthologies and produce complete soft-core gene sets. Moreover, WGAs support lift-over and help confirm gene presence-absence variation. Source code and documentation: https://github.com/Ensembl/plant-scripts .
Collapse
Affiliation(s)
- Bruno Contreras-Moreira
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK.
- Estación Experimental Aula Dei-CSIC, 50059, Zaragoza, Spain.
| | - Shradha Saraf
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Guy Naamati
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Ana M Casas
- Estación Experimental Aula Dei-CSIC, 50059, Zaragoza, Spain
| | - Sandeep S Amberkar
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, UK
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Andrew R Jones
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, UK
| | - Sarah Dyer
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK.
| |
Collapse
|
50
|
Naithani S, Deng CH, Sahu SK, Jaiswal P. Exploring Pan-Genomes: An Overview of Resources and Tools for Unraveling Structure, Function, and Evolution of Crop Genes and Genomes. Biomolecules 2023; 13:1403. [PMID: 37759803 PMCID: PMC10527062 DOI: 10.3390/biom13091403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 08/29/2023] [Accepted: 09/12/2023] [Indexed: 09/29/2023] Open
Abstract
The availability of multiple sequenced genomes from a single species made it possible to explore intra- and inter-specific genomic comparisons at higher resolution and build clade-specific pan-genomes of several crops. The pan-genomes of crops constructed from various cultivars, accessions, landraces, and wild ancestral species represent a compendium of genes and structural variations and allow researchers to search for the novel genes and alleles that were inadvertently lost in domesticated crops during the historical process of crop domestication or in the process of extensive plant breeding. Fortunately, many valuable genes and alleles associated with desirable traits like disease resistance, abiotic stress tolerance, plant architecture, and nutrition qualities exist in landraces, ancestral species, and crop wild relatives. The novel genes from the wild ancestors and landraces can be introduced back to high-yielding varieties of modern crops by implementing classical plant breeding, genomic selection, and transgenic/gene editing approaches. Thus, pan-genomic represents a great leap in plant research and offers new avenues for targeted breeding to mitigate the impact of global climate change. Here, we summarize the tools used for pan-genome assembly and annotations, web-portals hosting plant pan-genomes, etc. Furthermore, we highlight a few discoveries made in crops using the pan-genomic approach and future potential of this emerging field of study.
Collapse
Affiliation(s)
- Sushma Naithani
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331, USA;
| | - Cecilia H. Deng
- Molecular & Digital Breeing Group, New Cultivar Innovation, The New Zealand Institute for Plant and Food Research Limited, Private Bag 92169, Auckland 1142, New Zealand;
| | - Sunil Kumar Sahu
- State Key Laboratory of Agricultural Genomics, Key Laboratory of Genomics, Ministry of Agriculture, BGI Research, Shenzhen 518083, China;
| | - Pankaj Jaiswal
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331, USA;
| |
Collapse
|