1
|
Sanaullah A, Villalobos S, Zhi D, Zhang S. Haplotype Matching with GBWT for Pangenome Graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.03.634410. [PMID: 39975036 PMCID: PMC11838520 DOI: 10.1101/2025.02.03.634410] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Traditionally, variations from a linear reference genome were used to represent large sets of haplotypes compactly. In the linear reference genome based paradigm, the positional Burrows-Wheeler transform (PBWT) has traditionally been used to perform efficient haplotype matching. Pangenome graphs have recently been proposed as an alternative to linear reference genomes for representing the full spectrum of variations in the human genome. However, haplotype matches in pangenome graph based haplotype sets are not trivially generalizable from haplotype matches in the linear reference genome based haplotype sets. Work has been done to represent large sets of haplotypes as paths through a pangenome graph. The graph Burrows-Wheeler transform (GBWT) is one such work. The GBWT essentially stores the haplotype paths in a run length compressed BWT with compressed local alphabets. Although efficient in practice count and locate queries on the GBWT were provided by the original authors, the efficient haplotype matching capabilities of the PBWT have never been shown on the GBWT. In this paper, we formally define the notion of haplotype matches in pangenome graph-based haplotype sets by generalizing from haplotype matches in linear reference genome-based haplotype sets. We also describe the relationship between set maximal matches, long matches, locally maximal matches, and text maximal matches on the GBWT, PBWT, and the BWT. We provide algorithms for outputting some of these matches by applying the data structures of the r-index (introduced by Gagie et al.) to the GBWT. We show that these structures enable set maximal match and long match queries on the GBWT in almost linear time and in space close to linear in the number of runs in the GBWT. We also provide multiple versions of the query algorithms for different combinations of the available data structures. The long match query algorithms presented here even run on the BWT in the same time complexity as the GBWT due to their similarity.
Collapse
Affiliation(s)
- Ahsan Sanaullah
- Department of Computer Science, University of Central Florida, Orlando, FL, USA
| | - Seba Villalobos
- Department of Computer Science, University of Central Florida, Orlando, FL, USA
| | - Degui Zhi
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Shaojie Zhang
- Department of Computer Science, University of Central Florida, Orlando, FL, USA
| |
Collapse
|
2
|
Zhang W, Macias-Velasco J, Zhuo X, Belter EA, Tomlinson C, Garza J, Tekkey N, Li D, Wang T. methylGrapher: genome-graph-based processing of DNA methylation data from whole genome bisulfite sequencing. Nucleic Acids Res 2025; 53:gkaf028. [PMID: 39868538 PMCID: PMC11770346 DOI: 10.1093/nar/gkaf028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2024] [Revised: 12/23/2024] [Accepted: 01/20/2025] [Indexed: 01/28/2025] Open
Abstract
Genome graphs, including the recently released draft human pangenome graph, can represent the breadth of genetic diversity and thus transcend the limits of traditional linear reference genomes. However, there are no genome-graph-compatible tools for analyzing whole genome bisulfite sequencing (WGBS) data. To close this gap, we introduce methylGrapher, a tool tailored for accurate DNA methylation analysis by mapping WGBS data to a genome graph. Notably, methylGrapher can reconstruct methylation patterns along haplotype paths precisely and efficiently. To demonstrate the utility of methylGrapher, we analyzed the WGBS data derived from five individuals whose genomes were included in the first Human Pangenome draft as well as WGBS data from ENCODE (EN-TEx). Along with standard performance benchmarking, we show that methylGrapher fully recapitulates DNA methylation patterns defined by classic linear genome analysis approaches. Importantly, methylGrapher captures a substantial number of CpG sites that are missed by linear methods, and improves overall genome coverage while reducing alignment reference bias. Thus, methylGrapher is a first step toward unlocking the full potential of Human Pangenome graphs in genomic DNA methylation analysis.
Collapse
Affiliation(s)
- Wenjin Zhang
- Department of Genetics, The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO 63110, USA
| | - Juan F Macias-Velasco
- Department of Genetics, The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO 63110, USA
| | - Xiaoyu Zhuo
- Department of Genetics, The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO 63110, USA
| | - Edward A Belter
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO 63108, USA
| | - Chad Tomlinson
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO 63108, USA
| | - John Garza
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO 63108, USA
| | - Nina Tekkey
- Department of Genetics, The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO 63110, USA
| | - Daofeng Li
- Department of Genetics, The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO 63110, USA
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO 63108, USA
| | - Ting Wang
- Department of Genetics, The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, St. Louis, MO 63110, USA
- McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO 63108, USA
| |
Collapse
|
3
|
Ghavi Hossein-Zadeh N. An overview of recent technological developments in bovine genomics. Vet Anim Sci 2024; 25:100382. [PMID: 39166173 PMCID: PMC11334705 DOI: 10.1016/j.vas.2024.100382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/22/2024] Open
Abstract
Cattle are regarded as highly valuable animals because of their milk, beef, dung, fur, and ability to draft. The scientific community has tried a number of strategies to improve the genetic makeup of bovine germplasm. To ensure higher returns for the dairy and beef industries, researchers face their greatest challenge in improving commercially important traits. One of the biggest developments in the last few decades in the creation of instruments for cattle genetic improvement is the discovery of the genome. Breeding livestock is being revolutionized by genomic selection made possible by the availability of medium- and high-density single nucleotide polymorphism (SNP) arrays coupled with sophisticated statistical techniques. It is becoming easier to access high-dimensional genomic data in cattle. Continuously declining genotyping costs and an increase in services that use genomic data to increase return on investment have both made a significant contribution to this. The field of genomics has come a long way thanks to groundbreaking discoveries such as radiation-hybrid mapping, in situ hybridization, synteny analysis, somatic cell genetics, cytogenetic maps, molecular markers, association studies for quantitative trait loci, high-throughput SNP genotyping, whole-genome shotgun sequencing to whole-genome mapping, and genome editing. These advancements have had a significant positive impact on the field of cattle genomics. This manuscript aimed to review recent advances in genomic technologies for cattle breeding and future prospects in this field.
Collapse
Affiliation(s)
- Navid Ghavi Hossein-Zadeh
- Department of Animal Science, Faculty of Agricultural Sciences, University of Guilan, Rasht, 41635-1314, Iran
| |
Collapse
|
4
|
Wu EY, Singh NP, Choi K, Zakeri M, Vincent M, Churchill GA, Ackert-Bicknell CL, Patro R, Love MI. SEESAW: detecting isoform-level allelic imbalance accounting for inferential uncertainty. Genome Biol 2023; 24:165. [PMID: 37438847 PMCID: PMC10337143 DOI: 10.1186/s13059-023-03003-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Accepted: 06/29/2023] [Indexed: 07/14/2023] Open
Abstract
Detecting allelic imbalance at the isoform level requires accounting for inferential uncertainty, caused by multi-mapping of RNA-seq reads. Our proposed method, SEESAW, uses Salmon and Swish to offer analysis at various levels of resolution, including gene, isoform, and aggregating isoforms to groups by transcription start site. The aggregation strategies strengthen the signal for transcripts with high uncertainty. The SEESAW suite of methods is shown to have higher power than other allelic imbalance methods when there is isoform-level allelic imbalance. We also introduce a new test for detecting imbalance that varies across a covariate, such as time.
Collapse
Affiliation(s)
- Euphy Y Wu
- Department of Biostatistics, University of North Carolina-Chapel Hill, Chapel Hill, NC, USA
| | - Noor P Singh
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | | | - Mohsen Zakeri
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | | | | | - Cheryl L Ackert-Bicknell
- Department of Orthopedics, School of Medicine, University of Colorado, Anschutz Campus, Aurora, CO, USA
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Michael I Love
- Department of Biostatistics, University of North Carolina-Chapel Hill, Chapel Hill, NC, USA.
- Department of Genetics, University of North Carolina-Chapel Hill, Chapel Hill, NC, USA.
| |
Collapse
|
5
|
Smith TPL, Bickhart DM, Boichard D, Chamberlain AJ, Djikeng A, Jiang Y, Low WY, Pausch H, Demyda-Peyrás S, Prendergast J, Schnabel RD, Rosen BD. The Bovine Pangenome Consortium: democratizing production and accessibility of genome assemblies for global cattle breeds and other bovine species. Genome Biol 2023; 24:139. [PMID: 37337218 DOI: 10.1186/s13059-023-02975-0] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Accepted: 05/19/2023] [Indexed: 06/21/2023] Open
Abstract
The Bovine Pangenome Consortium (BPC) is an international collaboration dedicated to the assembly of cattle genomes to develop a more complete representation of cattle genomic diversity. The goal of the BPC is to provide genome assemblies and a community-agreed pangenome representation to replace breed-specific reference assemblies for cattle genomics. The BPC invites partners sharing our vision to participate in the production of these assemblies and the development of a common, community-approved, pangenome reference as a public resource for the research community ( https://bovinepangenome.github.io/ ). This community-driven resource will provide the context for comparison between studies and the future foundation for cattle genomic selection.
Collapse
Affiliation(s)
- Timothy P L Smith
- US Meat Animal Research Center, USDA-ARS, Clay Center, NE, 68933, USA
| | | | - Didier Boichard
- Université Paris-Saclay, INRAE, AgroParisTech, GABI, 78350, Jouy-en-Josas, France
| | - Amanda J Chamberlain
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, VIC, 3083, Australia
| | - Appolinaire Djikeng
- Centre for Tropical Livestock Genetics and Health, ILRI Kenya, Nairobi, 30709-00100, Kenya
- Centre for Tropical Livestock Genetics and Health, Easter Bush, Midlothian, EH25 9RG, UK
| | - Yu Jiang
- Center for Ruminant Genetics and Evolution, Northwest A&F University, Yangling, 712100, China
| | - Wai Y Low
- The Davies Research Centre, School of Animal and Veterinary Sciences, University of Adelaide, Roseworthy, SA, 5371, Australia
| | - Hubert Pausch
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland
| | - Sebastian Demyda-Peyrás
- Departamento de Producción Animal, Facultad de Ciencias Veterinarias, Universidad Nacional de La Plata, 1900, La Plata, Argentina
- Consejo Superior de Investigaciones Científicas Y Tecnológicas (CONICET), CCT-La Plata, 1900, La Plata, Argentina
| | - James Prendergast
- Centre for Tropical Livestock Genetics and Health, Easter Bush, Midlothian, EH25 9RG, UK
- The Roslin Institute, University of Edinburgh, Easter Bush, Midlothian, EH25 9RG, UK
| | - Robert D Schnabel
- Division of Animal Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - Benjamin D Rosen
- Animal Genomics and Improvement Laboratory, USDA-ARS, Beltsville, MD, 20705, USA.
| |
Collapse
|
6
|
Tognon M, Giugno R, Pinello L. A survey on algorithms to characterize transcription factor binding sites. Brief Bioinform 2023; 24:bbad156. [PMID: 37099664 PMCID: PMC10422928 DOI: 10.1093/bib/bbad156] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 03/27/2023] [Accepted: 04/01/2023] [Indexed: 04/28/2023] Open
Abstract
Transcription factors (TFs) are key regulatory proteins that control the transcriptional rate of cells by binding short DNA sequences called transcription factor binding sites (TFBS) or motifs. Identifying and characterizing TFBS is fundamental to understanding the regulatory mechanisms governing the transcriptional state of cells. During the last decades, several experimental methods have been developed to recover DNA sequences containing TFBS. In parallel, computational methods have been proposed to discover and identify TFBS motifs based on these DNA sequences. This is one of the most widely investigated problems in bioinformatics and is referred to as the motif discovery problem. In this manuscript, we review classical and novel experimental and computational methods developed to discover and characterize TFBS motifs in DNA sequences, highlighting their advantages and drawbacks. We also discuss open challenges and future perspectives that could fill the remaining gaps in the field.
Collapse
Affiliation(s)
- Manuel Tognon
- Computer Science Department, University of Verona, Verona, Italy
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
| | - Rosalba Giugno
- Computer Science Department, University of Verona, Verona, Italy
| | - Luca Pinello
- Molecular Pathology Unit, Center for Computational and Integrative Biology and Center for Cancer Research, Massachusetts General Hospital, Charlestown, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Pathology, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
7
|
Talenti A, Powell J, Hemmink JD, Cook EAJ, Wragg D, Jayaraman S, Paxton E, Ezeasor C, Obishakin ET, Agusi ER, Tijjani A, Amanyire W, Muhanguzi D, Marshall K, Fisch A, Ferreira BR, Qasim A, Chaudhry U, Wiener P, Toye P, Morrison LJ, Connelley T, Prendergast JGD. A cattle graph genome incorporating global breed diversity. Nat Commun 2022; 13:910. [PMID: 35177600 PMCID: PMC8854726 DOI: 10.1038/s41467-022-28605-0] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2020] [Accepted: 01/20/2022] [Indexed: 11/28/2022] Open
Abstract
Despite only 8% of cattle being found in Europe, European breeds dominate current genetic resources. This adversely impacts cattle research in other important global cattle breeds, especially those from Africa for which genomic resources are particularly limited, despite their disproportionate importance to the continent's economies. To mitigate this issue, we have generated assemblies of African breeds, which have been integrated with genomic data for 294 diverse cattle into a graph genome that incorporates global cattle diversity. We illustrate how this more representative reference assembly contains an extra 116.1 Mb (4.2%) of sequence absent from the current Hereford sequence and consequently inaccessible to current studies. We further demonstrate how using this graph genome increases read mapping rates, reduces allelic biases and improves the agreement of structural variant calling with independent optical mapping data. Consequently, we present an improved, more representative, reference assembly that will improve global cattle research.
Collapse
Affiliation(s)
- A Talenti
- The Roslin Institute, Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK.
| | - J Powell
- The Roslin Institute, Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK
| | - J D Hemmink
- The Roslin Institute, Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK
- The International Livestock Research Institute, PO Box 30709, Nairobi, Kenya
- Centre for Tropical Livestock Genetics and Health, Easter Bush, Midlothian, EH25 9RG, UK
- Centre for Tropical Livestock Genetics and Health, ILRI Kenya, Nairobi, 30709-00100, Kenya
| | - E A J Cook
- The International Livestock Research Institute, PO Box 30709, Nairobi, Kenya
- Centre for Tropical Livestock Genetics and Health, ILRI Kenya, Nairobi, 30709-00100, Kenya
| | - D Wragg
- The Roslin Institute, Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK
- Centre for Tropical Livestock Genetics and Health, Easter Bush, Midlothian, EH25 9RG, UK
| | - S Jayaraman
- The Roslin Institute, Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK
| | - E Paxton
- The Roslin Institute, Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK
| | - C Ezeasor
- Department of Veterinary Pathology and Microbiology, University of Nigeria, Nsukka, Enugu State, Nigeria
| | - E T Obishakin
- Biotechnology Division, National Veterinary Research Institute, Vom, Plateau State, Nigeria
- Biomedical Research Centre, Ghent University Global Campus, Songdo, Incheon, South Korea
| | - E R Agusi
- Biotechnology Division, National Veterinary Research Institute, Vom, Plateau State, Nigeria
- Biomedical Research Centre, Ghent University Global Campus, Songdo, Incheon, South Korea
| | - A Tijjani
- International Livestock Research Institute (ILRI) PO, 5689, Addis Ababa, Ethiopia
- Centre for Tropical Livestock Genetics and Health (CTLGH), ILRI Ethiopia, PO Box 5689, Addis Ababa, Ethiopia
| | - W Amanyire
- School of Biosecurity, Biotechnology and Laboratory Sciences (SBLS), College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, P.O Box 7062, Kampala, Uganda
| | - D Muhanguzi
- School of Biosecurity, Biotechnology and Laboratory Sciences (SBLS), College of Veterinary Medicine, Animal Resources and Biosecurity, Makerere University, P.O Box 7062, Kampala, Uganda
| | - K Marshall
- The International Livestock Research Institute, PO Box 30709, Nairobi, Kenya
- Centre for Tropical Livestock Genetics and Health, ILRI Kenya, Nairobi, 30709-00100, Kenya
| | - A Fisch
- Ribeirão Preto College of Nursing, University of Sao Paulo, Ribeirão Preto, SP, Brazil
| | - B R Ferreira
- Ribeirão Preto College of Nursing, University of Sao Paulo, Ribeirão Preto, SP, Brazil
| | - A Qasim
- Faculty of Veterinary and Animal Sciences, Gomal University, Dera Ismail Khan, Pakistan
| | - U Chaudhry
- The Roslin Institute, Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK
| | - P Wiener
- The Roslin Institute, Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK
| | - P Toye
- The International Livestock Research Institute, PO Box 30709, Nairobi, Kenya
- Centre for Tropical Livestock Genetics and Health, ILRI Kenya, Nairobi, 30709-00100, Kenya
| | - L J Morrison
- The Roslin Institute, Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK
- Centre for Tropical Livestock Genetics and Health, Easter Bush, Midlothian, EH25 9RG, UK
| | - T Connelley
- The Roslin Institute, Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK
- Centre for Tropical Livestock Genetics and Health, Easter Bush, Midlothian, EH25 9RG, UK
| | - J G D Prendergast
- The Roslin Institute, Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Campus, Midlothian, EH25 9RG, UK.
- Centre for Tropical Livestock Genetics and Health, Easter Bush, Midlothian, EH25 9RG, UK.
| |
Collapse
|