1
|
Pacce V, Guimarães AM, Kremer FS, Ferreira GN, Vedova-Costa JMD, dos Santos AC, Dellagostin OA, Soccol CR, Thomaz-Soccol V. Integrated Bioinformatics Analysis for Target Identification and Evaluation of Recombinant Protein as an Antigen for Intradermal Skin Test in Bovine Tuberculosis Diagnosis. ACS OMEGA 2025; 10:9187-9196. [PMID: 40092765 PMCID: PMC11904847 DOI: 10.1021/acsomega.4c09374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Revised: 01/13/2025] [Accepted: 01/21/2025] [Indexed: 03/19/2025]
Abstract
Bovine tuberculosis (bTB) is a respiratory disease caused by Mycobacterium bovis, posing a significant threat to animal health and the livestock industry. Current control strategies for bTB rely on diagnostic tests and slaughter policies. However, the limitations of existing diagnostic methods, which depend on PPD antigens, necessitate the exploration of alternative antigens to enhance the accuracy and reliability of bTB diagnosis. This study aimed to identify, produce, and evaluate novel antigens for use in the intradermal skin test for bTB diagnosis. A pangenome analysis of four Mycobacterium species identified 12 unique genes specific to M. bovis SP38. Further integrated bioinformatic analysis revealed 224 genomic islands associated with virulence and pathogenesis. Among these, a highly antigenic protein, termed HP28, was selected for in vivo testing. The recombinant HP28 protein (rHP28) was expressed in E. coli and assessed for its ability to induce intradermal skin reactions in guinea pigs. The rHP28 protein elicited a skin reaction of 6.6 mm at 72 h post-injection, whereas negative controls showed no reaction. This study presents a pipeline for the selection of antigens using integrated bioinformatic analysis to identify diagnostic targets that can effectively distinguish between sensitized and non-sensitized animals, offering a promising approach for improving bTB diagnostics.
Collapse
Affiliation(s)
- Violetta
Dias Pacce
- Laboratório
de Biologia Molecular, Programa de Pós Graduação
em Engenharia de Bioprocessos e Biotecnologia, Universidade Federal do Paraná, Curitiba, Paraná 81531-990, Brazil
| | - Amanda Munari Guimarães
- Programa
de Pós Graduação em Biotecnologia, Centro de
Desenvolvimento Tecnológico, Universidade
Federal de Pelotas, Pelotas, Rio Grande do Sul 96160-000, Brazil
| | - Frederico Schmitt Kremer
- Programa
de Pós Graduação em Biotecnologia, Centro de
Desenvolvimento Tecnológico, Universidade
Federal de Pelotas, Pelotas, Rio Grande do Sul 96160-000, Brazil
| | - Gabriela Nascimento Ferreira
- Laboratório
de Biologia Molecular, Programa de Pós Graduação
em Engenharia de Bioprocessos e Biotecnologia, Universidade Federal do Paraná, Curitiba, Paraná 81531-990, Brazil
| | - Jean Michel Dela Vedova-Costa
- Laboratório
de Biologia Molecular, Programa de Pós Graduação
em Engenharia de Bioprocessos e Biotecnologia, Universidade Federal do Paraná, Curitiba, Paraná 81531-990, Brazil
| | - Aline Cristina dos Santos
- Laboratório
Provas Biológicas, Instituto de Tecnologia
do Paraná, Curitiba, Paraná 80035-060, Brazil
| | - Odir Antônio Dellagostin
- Programa
de Pós Graduação em Biotecnologia, Centro de
Desenvolvimento Tecnológico, Universidade
Federal de Pelotas, Pelotas, Rio Grande do Sul 96160-000, Brazil
| | - Carlos Ricardo Soccol
- Laboratório
de Biologia Molecular, Programa de Pós Graduação
em Engenharia de Bioprocessos e Biotecnologia, Universidade Federal do Paraná, Curitiba, Paraná 81531-990, Brazil
| | - Vanete Thomaz-Soccol
- Laboratório
de Biologia Molecular, Programa de Pós Graduação
em Engenharia de Bioprocessos e Biotecnologia, Universidade Federal do Paraná, Curitiba, Paraná 81531-990, Brazil
| |
Collapse
|
2
|
Zytnicki M. Assessing genome conservation on pangenome graphs with PanSel. BIOINFORMATICS ADVANCES 2025; 5:vbaf018. [PMID: 40092526 PMCID: PMC11908644 DOI: 10.1093/bioadv/vbaf018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 12/21/2024] [Accepted: 02/03/2025] [Indexed: 03/19/2025]
Abstract
Motivation With more and more telomere-to-telomere genomes assembled, pangenomes make it possible to capture the genomic diversity of a species. Because they introduce less biases, pangenomes, represented as graphs, tend to supplant the usual linear representation of a reference genome, augmented with variations. However, this major change requires new tools adapted to this data structure. Among the numerous questions that can be addressed to a pangenome graph is the search for conserved or divergent genes. Results In this article, we present a new tool, named PanSel, which computes a conservation score for each segment of the genome, and finds genomic regions that are significantly conserved, or divergent. PanSel can be used on prokaryotes and eukaryotes, with a sequence identity not less than 98%. Availability and implementation PanSel, written in C++11 with no dependency, is available at https://github.com/mzytnicki/pansel.
Collapse
Affiliation(s)
- Matthias Zytnicki
- Unité de Mathématiques et Informatique Appliquées, INRAE, 31 326 Castanet-Tolosan, France
| |
Collapse
|
3
|
de Pontes FCF, Machado IP, Silveira MVDS, Lobo ALA, Sabadin F, Fritsche-Neto R, DoVale JC. Combining genotyping approaches improves resolution for association mapping: a case study in tropical maize under water stress conditions. FRONTIERS IN PLANT SCIENCE 2025; 15:1442008. [PMID: 39917602 PMCID: PMC11798985 DOI: 10.3389/fpls.2024.1442008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/01/2024] [Accepted: 12/31/2024] [Indexed: 02/09/2025]
Abstract
Genome-wide Association Studies (GWAS) identify genome variations related to specific phenotypes using Single Nucleotide Polymorphism (SNP) markers. Genotyping platforms like SNP-Array or sequencing-based techniques (GBS) can genotype samples with many SNPs. These approaches may bias tropical maize analyses due to reliance on the temperate line B73 as the reference genome. An alternative is a simulated genome called "Mock," adapted to the population using bioinformatics. Recent studies show SNP-Array, GBS, and Mock yield similar results for population structure, heterotic groups definition, tester selection, and genomic hybrid prediction. However, no studies have examined the results generated by these different genotyping approaches for GWAS. This study aims to test the equivalence among the three genotyping scenarios in identifying significant effect genes in GWAS. To achieve this, maize was used as the model species, where SNP-Array genotyped 360 inbred lines from a public panel via the Affymetrix platform and GBS. The GBS data were used to perform SNP calling using the temperate inbred line B73 as the reference genome (GBS-B73) and a simulated genome "Mock" obtained in-silico (GBS-Mock). The study encompassed four above-ground traits with plants grown under two levels of water supply: well-watered (WW) and water-stressed (WS). In total, 46, 34, and 31 SNP were identified in the SNP-Array, GBS-B73, and GBS-Mock scenarios, respectively, across the two water levels, associated with the evaluated traits following the comparative analysis of each genotyping method individually. Overall, the identified candidate genes varied along the various scenarios but had the same functionality. Regarding SNP-Array and GBS-B73, genes with functional similarity were identified even without coincidence in the physical position of the SNPs. These genes and regions are involved in various processes and responses with applications in plant breeding. In terms of accuracy, the combination of genotyping scenarios compared to those isolated is feasible and recommended, as it increased all traits under both water conditions. In this sense, it is worth highlighting the combination of GBS-B73 and GBS-Mock scenarios, not only due to the increase in the resolution of GWAS results but also the reduction of costs associated with genotyping and the possibility of conducting genomic breeding methods.
Collapse
Affiliation(s)
| | - Ingrid Pinheiro Machado
- Postgraduate Program of Plant Science, Federal University of Ceará, Fortaleza, Ceará, Brazil
| | | | | | - Felipe Sabadin
- College of Agriculture and Applied Sciences, Utah State University, Logan, UT, United States
| | | | - Júlio César DoVale
- Postgraduate Program of Plant Science, Federal University of Ceará, Fortaleza, Ceará, Brazil
| |
Collapse
|
4
|
Depuydt L, Renders L, Van de Vyver S, Veys L, Gagie T, Fostier J. b-move: Faster Lossless Approximate Pattern Matching in a Run-Length Compressed Index. RESEARCH SQUARE 2024:rs.3.rs-5367343. [PMID: 39606487 PMCID: PMC11601852 DOI: 10.21203/rs.3.rs-5367343/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Background Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns. Results We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient, lossless approximate pattern matching in run-length compressed space. It achieves bidirectional character extensions up to 7 times faster than the br-index, closing the performance gap with FM-index-based alternatives. For locating occurrences, b-move performs ϕ andϕ - 1 operations up to 7 times faster than the br-index. At the same time, it maintains the favorable memory characteristics of the br-index, for example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop. Conclusions b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.
Collapse
Affiliation(s)
- Lore Depuydt
- Ghent University - imec, Technologiepark 126, 9052 Ghent, Belgium
| | - Luca Renders
- Ghent University - imec, Technologiepark 126, 9052 Ghent, Belgium
| | | | - Lennart Veys
- Ghent University, Technologiepark 126, 9052 Ghent, Belgium
| | - Travis Gagie
- Dalhousie University, 6050 University Avenue, PO BOX 15000, Halifax, NS B3H 4R2, Canada
| | - Jan Fostier
- Ghent University - imec, Technologiepark 126, 9052 Ghent, Belgium
| |
Collapse
|
5
|
Avila Cartes J, Bonizzoni P, Ciccolella S, Della Vedova G, Denti L. PangeBlocks: customized construction of pangenome graphs via maximal blocks. BMC Bioinformatics 2024; 25:344. [PMID: 39497039 PMCID: PMC11533710 DOI: 10.1186/s12859-024-05958-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Accepted: 10/16/2024] [Indexed: 11/06/2024] Open
Abstract
BACKGROUND The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling. RESULTS In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase. CONCLUSION We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.
Collapse
Affiliation(s)
- Jorge Avila Cartes
- Department of Informatics, Systems, and Communications, University of Milano - Bicocca, Viale Sarca, 20126, Milano, Italy
| | - Paola Bonizzoni
- Department of Informatics, Systems, and Communications, University of Milano - Bicocca, Viale Sarca, 20126, Milano, Italy.
| | - Simone Ciccolella
- Department of Informatics, Systems, and Communications, University of Milano - Bicocca, Viale Sarca, 20126, Milano, Italy
| | - Gianluca Della Vedova
- Department of Informatics, Systems, and Communications, University of Milano - Bicocca, Viale Sarca, 20126, Milano, Italy
| | - Luca Denti
- Department of Informatics, Systems, and Communications, University of Milano - Bicocca, Viale Sarca, 20126, Milano, Italy
- Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Mlynská dolina F1, Bratislava, 84248, Slovakia
| |
Collapse
|
6
|
Chandra G, Hossen MH, Scholz S, Dilthey AT, Gibney D, Jain C. Integer programming framework for pangenome-based genome inference. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.27.620212. [PMID: 39554168 PMCID: PMC11565907 DOI: 10.1101/2024.10.27.620212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2024]
Abstract
Affordable genotyping methods are essential in genomics. Commonly used genotyping methods primarily support single nucleotide variants and short indels but neglect structural variants. Additionally, accuracy of read alignments to a reference genome is unreliable in highly polymorphic and repetitive regions, further impacting genotyping performance. Recent works highlight the advantage of haplotype-resolved pangenome graphs in addressing these challenges. Building on these developments, we propose a rigorous alignment-free genotyping framework. Our formulation seeks a path through the pangenome graph that maximizes the matches between the path and substrings of sequencing reads (e.g., k-mers) while minimizing recombination events (haplotype switches) along the path. We prove that this problem is NP-Hard and develop efficient integer-programming solutions. We benchmarked the algorithm using downsampled short-read datasets from homozygous human cell lines with coverage ranging from 0.1× to 10×. Our algorithm accurately estimates complete major histocompatibility complex (MHC) haplotype sequences with small edit distances from the ground-truth sequences, providing a significant advantage over existing methods on low-coverage inputs. Although our algorithm is designed for haploid samples, we discuss future extensions to diploid samples.
Collapse
Affiliation(s)
- Ghanshyam Chandra
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore KA 560012, India
| | - Md Helal Hossen
- Department of Computer Science, The University of Texas at Dallas, TX 75080, USA
| | - Stephan Scholz
- Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Alexander T Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Center for Digital Medicine, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Daniel Gibney
- Department of Computer Science, The University of Texas at Dallas, TX 75080, USA
| | - Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore KA 560012, India
| |
Collapse
|
7
|
Ndiaye M, Prieto-Baños S, Fitzgerald LM, Yazdizadeh Kharrazi A, Oreshkov S, Dessimoz C, Sedlazeck FJ, Glover N, Majidian S. When less is more: sketching with minimizers in genomics. Genome Biol 2024; 25:270. [PMID: 39402664 PMCID: PMC11472564 DOI: 10.1186/s13059-024-03414-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 10/01/2024] [Indexed: 10/19/2024] Open
Abstract
The exponential increase in sequencing data calls for conceptual and computational advances to extract useful biological insights. One such advance, minimizers, allows for reducing the quantity of data handled while maintaining some of its key properties. We provide a basic introduction to minimizers, cover recent methodological developments, and review the diverse applications of minimizers to analyze genomic data, including de novo genome assembly, metagenomics, read alignment, read correction, and pangenomes. We also touch on alternative data sketching techniques including universal hitting sets, syncmers, or strobemers. Minimizers and their alternatives have rapidly become indispensable tools for handling vast amounts of data.
Collapse
Affiliation(s)
- Malick Ndiaye
- Department of Fundamental Microbiology, UNIL, Lausanne, Switzerland
| | - Silvia Prieto-Baños
- Department of Computational Biology, UNIL, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | | | | | - Sergey Oreshkov
- Department of Endocrinology, Diabetology, Metabolism, CHUV, Lausanne, Switzerland
| | - Christophe Dessimoz
- Department of Computational Biology, UNIL, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | | | - Natasha Glover
- Department of Computational Biology, UNIL, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Sina Majidian
- Department of Computational Biology, UNIL, Lausanne, Switzerland.
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| |
Collapse
|
8
|
Gabory E, Mwaniki MN, Pisanti N, Pissis SP, Radoszewski J, Sweering M, Zuba W. Pangenome comparison via ED strings. FRONTIERS IN BIOINFORMATICS 2024; 4:1397036. [PMID: 39391331 PMCID: PMC11464492 DOI: 10.3389/fbinf.2024.1397036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Accepted: 08/23/2024] [Indexed: 10/12/2024] Open
Abstract
Introduction An elastic-degenerate (ED) string is a sequence of sets of strings. It can also be seen as a directed acyclic graph whose edges are labeled by strings. The notion of ED strings was introduced as a simple alternative to variation and sequence graphs for representing a pangenome, that is, a collection of genomic sequences to be analyzed jointly or to be used as a reference. Methods In this study, we define notions of matching statistics of two ED strings as similarity measures between pangenomes and, consequently infer a corresponding distance measure. We then show that both measures can be computed efficiently, in both theory and practice, by employing the intersection graph of two ED strings. Results We also implemented our methods as a software tool for pangenome comparison and evaluated their efficiency and effectiveness using both synthetic and real datasets. Discussion As for efficiency, we compare the runtime of the intersection graph method against the classic product automaton construction showing that the intersection graph is faster by up to one order of magnitude. For showing effectiveness, we used real SARS-CoV-2 datasets and our matching statistics similarity measure to reproduce a well-established clade classification of SARS-CoV-2, thus demonstrating that the classification obtained by our method is in accordance with the existing one.
Collapse
Affiliation(s)
| | | | - Nadia Pisanti
- Department of Computer Science, University of Pisa, Pisa, Italy
| | - Solon P. Pissis
- Centrum Wiskunde & Informatica, Amsterdam, Netherlands
- Department of Computer Science, Vrije Universiteit, Amsterdam, Netherlands
| | | | | | - Wiktor Zuba
- Centrum Wiskunde & Informatica, Amsterdam, Netherlands
| |
Collapse
|
9
|
Matthews CA, Watson-Haigh NS, Burton RA, Sheppard AE. A gentle introduction to pangenomics. Brief Bioinform 2024; 25:bbae588. [PMID: 39552065 PMCID: PMC11570541 DOI: 10.1093/bib/bbae588] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Revised: 09/12/2024] [Accepted: 11/01/2024] [Indexed: 11/19/2024] Open
Abstract
Pangenomes have emerged in response to limitations associated with traditional linear reference genomes. In contrast to a traditional reference that is (usually) assembled from a single individual, pangenomes aim to represent all of the genomic variation found in a group of organisms. The term 'pangenome' is currently used to describe multiple different types of genomic information, and limited language is available to differentiate between them. This is frustrating for researchers working in the field and confusing for researchers new to the field. Here, we provide an introduction to pangenomics relevant to both prokaryotic and eukaryotic organisms and propose a formalization of the language used to describe pangenomes (see the Glossary) to improve the specificity of discussion in the field.
Collapse
Affiliation(s)
- Chelsea A Matthews
- School of Agriculture, Food and Wine, Waite Campus, University of Adelaide, Urrbrae, South Australia 5064, Australia
| | - Nathan S Watson-Haigh
- Australian Genome Research Facility, Victorian Comprehensive Cancer Centre, Melbourne, Victoria 3000, Australia
- South Australian Genomics Centre, SAHMRI, North Terrace, Adelaide, South Australia 5000, Australia
- Alkahest Inc., San Carlos, CA 94070, United States
| | - Rachel A Burton
- School of Agriculture, Food and Wine, Waite Campus, University of Adelaide, Urrbrae, South Australia 5064, Australia
| | - Anna E Sheppard
- School of Biological Sciences, University of Adelaide, Adelaide, South Australia 5005, Australia
| |
Collapse
|
10
|
Schebera J, Zeckzer D, Wiegreffe D. A layout framework for genome-wide multiple sequence alignment graphs. FRONTIERS IN BIOINFORMATICS 2024; 4:1358374. [PMID: 39221004 PMCID: PMC11362851 DOI: 10.3389/fbinf.2024.1358374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Accepted: 07/08/2024] [Indexed: 09/04/2024] Open
Abstract
Sequence alignments are often used to analyze genomic data. However, such alignments are often only calculated and compared on small sequence intervals for analysis purposes. When comparing longer sequences, these are usually divided into shorter sequence intervals for better alignment results. This usually means that the order context of the original sequence is lost. To prevent this, it is possible to use a graph structure to represent the order of the original sequence on the alignment blocks. The visualization of these graph structures can provide insights into the structural variations of genomes in a semi-global context. In this paper, we propose a new graph drawing framework for representing gMSA data. We produce a hierarchical graph layout that supports the comparative analysis of genomes. Based on a reference, the differences and similarities of the different genome orders are visualized. In this work, we present a complete graph drawing framework for gMSA graphs together with the respective algorithms for each of the steps. Additionally, we provide a prototype and an example data set for analyzing gMSA graphs. Based on this data set, we demonstrate the functionalities of the framework using two examples.
Collapse
Affiliation(s)
- Jeremias Schebera
- Image and Signal Processing Group, Institute for Computer Science, Leipzig University, Leipzig, Germany
- Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig, Leipzig University, Leipzig, Germany
| | - Dirk Zeckzer
- Image and Signal Processing Group, Institute for Computer Science, Leipzig University, Leipzig, Germany
| | - Daniel Wiegreffe
- Image and Signal Processing Group, Institute for Computer Science, Leipzig University, Leipzig, Germany
| |
Collapse
|
11
|
Sarawad A, Hosagoudar S, Parvatikar P. Pan-genomics: Insight into the Functional Genome, Applications, Advancements, and Challenges. Curr Genomics 2024; 26:2-14. [PMID: 39911277 PMCID: PMC11793047 DOI: 10.2174/0113892029311541240627111506] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 04/30/2024] [Accepted: 05/29/2024] [Indexed: 02/07/2025] Open
Abstract
A pan-genome is a compilation of the common and unique genomes found in a given species. It incorporates the genetic information from all of the genomes sampled, producing a big and diverse set of genetic material. Pan-genomic analysis has various advantages over typical genomics research. It creates a vast and varied spectrum of genetic material by combining the genetic data from all the sampled genomes. Comparing pan-genomics analysis to conventional genomic research, there are a number of benefits. Although the most recent era of pan-genomic studies has used cutting-edge sequencing technology to shed fresh light on biological variety and improvement, the potential uses of pan-genomics in improvement have not yet been fully realized. Pan-genome research in various organisms has demonstrated that missing genetic components and the detection of significant Structural Variants (SVs) can be investigated using pan-genomic methods. Many individual-specific sequences have been linked to biological adaptability, phenotypic, and key economic attributes. This study aims to focus on how pangenome analysis uncovers genetic differences in various organisms, including human, and their effects on phenotypes, as well as how this might help us comprehend the diversity of species. The review also concentrated on potential problems and the prospects for future pangenome research.
Collapse
Affiliation(s)
- Akansha Sarawad
- Department of Biotechnology, Applied School of Science and Technology, BLDE (DU), Vijayapura, Karnataka, India
| | - Spoorti Hosagoudar
- Department of Biotechnology, Applied School of Science and Technology, BLDE (DU), Vijayapura, Karnataka, India
| | - Prachi Parvatikar
- Department of Biotechnology, Applied School of Science and Technology, BLDE (DU), Vijayapura, Karnataka, India
| |
Collapse
|
12
|
Brejová B, Gagie T, Herencsárová E, Vinař T. Maximum-scoring path sets on pangenome graphs of constant treewidth. FRONTIERS IN BIOINFORMATICS 2024; 4:1391086. [PMID: 39011297 PMCID: PMC11246863 DOI: 10.3389/fbinf.2024.1391086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2024] [Accepted: 06/03/2024] [Indexed: 07/17/2024] Open
Abstract
We generalize a problem of finding maximum-scoring segment sets, previously studied by Csűrös (IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2004, 1, 139-150), from sequences to graphs. Namely, given a vertex-weighted graph G and a non-negative startup penalty c, we can find a set of vertex-disjoint paths in G with maximum total score when each path's score is its vertices' total weight minus c. We call this new problem maximum-scoring path sets (MSPS). We present an algorithm that has a linear-time complexity for graphs with a constant treewidth. Generalization from sequences to graphs allows the algorithm to be used on pangenome graphs representing several related genomes and can be seen as a common abstraction for several biological problems on pangenomes, including searching for CpG islands, ChIP-seq data analysis, analysis of region enrichment for functional elements, or simple chaining problems.
Collapse
Affiliation(s)
- Broňa Brejová
- Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada
| | - Eva Herencsárová
- Department of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
| | - Tomáš Vinař
- Department of Applied Informatics, Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava, Bratislava, Slovakia
| |
Collapse
|
13
|
Coggi M, Sgarlata A, Di Donato GW, Santambrogio MD. On the optimization of GWFA algorithm: enabling real-case applications supporting alignment backtracking. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2024; 2024:1-4. [PMID: 40039311 DOI: 10.1109/embc53108.2024.10781891] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
The Human Pangenome Reference Consortium (HPRC) proved that pangenome graphs represent a population's genetic variability more efficiently and accurately than linear references. Graphs can intrinsically encode variations as alternative paths inside a directed set of sequence nodes connected by edges. Despite their higher complexity, graph-based genome analysis pipelines are gaining significant interest, and the first sequence-to-graph aligners have already shown improvements in semi-global alignment. However, in pangenomics studies, the global alignment of long reads is fundamental for identifying structural variations and haplotype phasing. In this context, the Graph Wavefront Alignment (GWFA) algorithm emerged as the fastest strategy for aligning long reads to genomic graphs. However, the available GWFA implementation does not support alignment backtracking, a crucial feature in real-case studies. In this paper, we propose a new open-source1 implementation of the GWFA algorithm that computes and reports the complete traceback in the standard GAF format. Our work achieves a 20× speedup in execution time compared to the state-of-the-art tool GraphAligner and competitive memory usage.
Collapse
|
14
|
Heumos S, Guarracino A, Schmelzle JNM, Li J, Zhang Z, Hagmann J, Nahnsen S, Prins P, Garrison E. Pangenome graph layout by Path-Guided Stochastic Gradient Descent. Bioinformatics 2024; 40:btae363. [PMID: 38960860 PMCID: PMC11227364 DOI: 10.1093/bioinformatics/btae363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 02/20/2024] [Accepted: 07/02/2024] [Indexed: 07/05/2024] Open
Abstract
MOTIVATION The increasing availability of complete genomes demands for models to study genomic variability within entire populations. Pangenome graphs capture the full genomic similarity and diversity between multiple genomes. In order to understand them, we need to see them. For visualization, we need a human-readable graph layout: a graph embedding in low (e.g. two) dimensional depictions. Due to a pangenome graph's potential excessive size, this is a significant challenge. RESULTS In response, we introduce a novel graph layout algorithm: the Path-Guided Stochastic Gradient Descent (PG-SGD). PG-SGD uses the genomes, represented in the pangenome graph as paths, as an embedded positional system to sample genomic distances between pairs of nodes. This avoids the quadratic cost seen in previous versions of graph drawing by SGD. We show that our implementation efficiently computes the low-dimensional layouts of gigabase-scale pangenome graphs, unveiling their biological features. AVAILABILITY AND IMPLEMENTATION We integrated PG-SGD in ODGI which is released as free software under the MIT open source license. Source code is available at https://github.com/pangenome/odgi.
Collapse
Affiliation(s)
- Simon Heumos
- Quantitative Biology Center (QBiC), University of Tübingen, 72076 Tübingen, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, 72076 Tübingen, Germany
- M3 Research Center, University Hospital Tübingen, 72076 Tübingen, Germany
- Institute for Bioinformatics and Medical Informatics (IBMI), University of Tübingen, 72076 Tübingen, Germany
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, United States
- Genomics Research Centre, Human Technopole, 20157 Milan, Italy
| | - Jan-Niklas M Schmelzle
- Department of Computer Engineering, School of Computation, Information and Technology (CIT), Technical University of Munich, 80333 Munich, Germany
- School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853, United States
| | - Jiajie Li
- School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853, United States
| | - Zhiru Zhang
- School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853, United States
| | | | - Sven Nahnsen
- Quantitative Biology Center (QBiC), University of Tübingen, 72076 Tübingen, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, 72076 Tübingen, Germany
- M3 Research Center, University Hospital Tübingen, 72076 Tübingen, Germany
- Institute for Bioinformatics and Medical Informatics (IBMI), University of Tübingen, 72076 Tübingen, Germany
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, United States
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, United States
| |
Collapse
|
15
|
Depuydt L, Renders L, de Vyver SV, Veys L, Gagie T, Fostier J. b-move: faster bidirectional character extensions in a run-length compressed index. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.30.596587. [PMID: 38854079 PMCID: PMC11160816 DOI: 10.1101/2024.05.30.596587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns. We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient bidirectional character extensions in run-length compressed space. It achieves bidirectional character extensions up to 8 times faster than the br-index, closing the performance gap with FM-index-based alternatives, while maintaining the br-index's favorable memory characteristics. For example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop. Thus, b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.
Collapse
Affiliation(s)
- Lore Depuydt
- Ghent University - imec, Technologiepark 126, 9052 Ghent, Belgium
| | - Luca Renders
- Ghent University - imec, Technologiepark 126, 9052 Ghent, Belgium
| | | | - Lennart Veys
- Ghent University, Technologiepark 126, 9052 Ghent, Belgium
| | - Travis Gagie
- Dalhousie University, 6050 University Avenue, PO BOX 15000, Halifax, NS B3H 4R2, Canada
| | - Jan Fostier
- Ghent University - imec, Technologiepark 126, 9052 Ghent, Belgium
| |
Collapse
|
16
|
Rizzo N, Cáceres M, Mäkinen V. Finding maximal exact matches in graphs. Algorithms Mol Biol 2024; 19:10. [PMID: 38468275 DOI: 10.1186/s13015-024-00255-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 01/30/2024] [Indexed: 03/13/2024] Open
Abstract
BACKGROUND We study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled graph G. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least κ ( κ -MEMs). However, on arbitrary input graphs, the problem of finding MEMs cannot be solved in truly sub-quadratic time under SETH (Equi et al., TALG 2023) even on acyclic graphs. RESULTS In this paper we show an O ( n · L · d L - 1 + m + M κ , L ) -time algorithm finding all κ -MEMs between Q and G spanning exactly L nodes in G, where n is the total length of node labels, d is the maximum degree of a node in G, m = | Q | , and M κ , L is the number of output MEMs. We use this algorithm to develop a κ -MEM finding solution on indexable Elastic Founder Graphs (Equi et al., Algorithmica 2022) running in time O ( n H 2 + m + M κ ) , where H is the maximum number of nodes in a block, and M κ is the total number of κ -MEMs. Our results generalize to the analysis of multiple query strings (MEMs between G and any of the strings). Additionally, we provide some experimental results showing that the number of graph MEMs is an order of magnitude smaller than the number of string MEMs of the corresponding concatenated collection. CONCLUSIONS We show that seed-chain-extend type of alignment methods can be implemented on top of indexable Elastic Founder Graphs by providing an efficient way to produce the seeds between a set of queries and the graph. The code is available in https://github.com/algbio/efg-mems .
Collapse
Affiliation(s)
- Nicola Rizzo
- Department of Computer Science, University of Helsinki, Pietari Kalmin katu 5, P.O. Box 68, Helsinki, 00014, Finland.
| | - Manuel Cáceres
- Department of Computer Science, University of Helsinki, Pietari Kalmin katu 5, P.O. Box 68, Helsinki, 00014, Finland
| | - Veli Mäkinen
- Department of Computer Science, University of Helsinki, Pietari Kalmin katu 5, P.O. Box 68, Helsinki, 00014, Finland
| |
Collapse
|
17
|
Basharat Z, Ahmed I, Alnasser SM, Meshal A, Waheed Y. Exploring Lead-Like Molecules of Traditional Chinese Medicine for Treatment Quest against Aliarcobacter butzleri: In Silico Toxicity Assessment, Dynamics Simulation, and Pharmacokinetic Profiling. BIOMED RESEARCH INTERNATIONAL 2024; 2024:9377016. [PMID: 39282570 PMCID: PMC11401669 DOI: 10.1155/2024/9377016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Revised: 01/21/2024] [Accepted: 02/07/2024] [Indexed: 09/19/2024]
Abstract
BACKGROUND Aliarcobacter butzleri is a Gram-negative, curved or spiral-shaped, microaerophilic bacterium and causes human infections, specifically diarrhea, fever, and sepsis. The research objective of this study was to employ computer-aided drug design techniques to identify potential natural product inhibitors of a vital enzyme in this bacterium. The pyrimidine biosynthesis pathway in its core genome fraction is crucial for its survival and presents a potential target for novel therapeutics. Hence, novel small molecule inhibitors were identified (from traditional Chinese medicinal (TCM) compound library) against it, which may be used for possible curbing of infection by A. butzleri. Methods. A comprehensive subtractive genomics approach was utilized to identify a key enzyme (orotidine-5'-phosphate decarboxylase) cluster conserved in the core genome fraction of A. butzleri. It was selected for inhibitor screening due to its vital role in pyrimidine biosynthesis. TCM library (n > 36,000 compounds) was screened against it using pharmacophore model based on orotidylic acid (control), and the obtained lead-like molecules were subjected to structural docking using AutoDock Vina. The top-scoring compounds, ZINC70454134, ZINC85632684, and ZINC85632721, underwent further scrutiny via a combination of physiological-based pharmacokinetics, toxicity assessment, and atomic-scale dynamics simulations (100 ns). RESULTS Among the screened compounds, ZINC70454134 displayed the most favorable characteristics in terms of binding, stability, absorption, and safety parameters. Overall, traditional Chinese medicine (TCM) compounds exhibited high bioavailability, but in diseased states (cirrhosis, renal impairment, and steatosis), there was a significant decrease in absorption, Cmax, and AUC of the compounds compared to the healthy state. Furthermore, MD simulation demonstrated that the ODCase-ZINC70454134 complex had a superior overall binding affinity, supported by PCA proportion of variance and eigenvalue rank analysis. These favorable characteristics underscore its potential as a promising drug candidate. CONCLUSION The computer-aided drug design approach employed for this study helped expedite the discovery of antibacterial compounds against A. butzleri, offering a cost-effective and efficient approach to address infection by it. It is recommended that ZINC70454134 should be considered for further experimental analysis due to its indication as a potential therapeutic agent for combating A. butzleri infections. This study provides valuable insights into the molecular basis of biophysical inhibition of A. butzleri through TCM compounds.
Collapse
Affiliation(s)
| | - Ibrar Ahmed
- Alpha Genomics (Private) Limited, Islamabad 45710, Pakistan
- Group of Biometrology, The Korea Research Institute of Standards and Science (KRISS), Yuseong District, Daejeon 34113, Republic of Korea
| | - Sulaiman Mohammed Alnasser
- Department of Pharmacology and Toxicology, Unaizah College of Pharmacy, Qassim University, Buraydah 52571, Saudi Arabia
| | - Alotaibi Meshal
- Department of Pharmacy Practice, College of Pharmacy, University of Hafr Al Batin, Hafar Al Batin, Saudi Arabia
| | - Yasir Waheed
- Office of Research, Innovation and Commercialization (ORIC), Shaheed Zulfiqar Ali Bhutto Medical University (SZABMU), Islamabad 44000, Pakistan
- Gilbert and Rose-Marie Chagoury School of Medicine, Lebanese American University, Byblos 1401, Lebanon
| |
Collapse
|
18
|
Woolley SA, Salavati M, Clark EL. Recent advances in the genomic resources for sheep. Mamm Genome 2023; 34:545-558. [PMID: 37752302 PMCID: PMC10627984 DOI: 10.1007/s00335-023-10018-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Accepted: 08/30/2023] [Indexed: 09/28/2023]
Abstract
Sheep (Ovis aries) provide a vital source of protein and fibre to human populations. In coming decades, as the pressures associated with rapidly changing climates increase, breeding sheep sustainably as well as producing enough protein to feed a growing human population will pose a considerable challenge for sheep production across the globe. High quality reference genomes and other genomic resources can help to meet these challenges by: (1) informing breeding programmes by adding a priori information about the genome, (2) providing tools such as pangenomes for characterising and conserving global genetic diversity, and (3) improving our understanding of fundamental biology using the power of genomic information to link cell, tissue and whole animal scale knowledge. In this review we describe recent advances in the genomic resources available for sheep, discuss how these might help to meet future challenges for sheep production, and provide some insight into what the future might hold.
Collapse
Affiliation(s)
- Shernae A Woolley
- The Roslin Institute, University of Edinburgh, Easter Bush, Midlothian, EH25 9RG, UK
| | - Mazdak Salavati
- The Roslin Institute, University of Edinburgh, Easter Bush, Midlothian, EH25 9RG, UK
- Scotland's Rural College, Parkgate, Barony Campus, Dumfries, DG1 3NE, UK
| | - Emily L Clark
- The Roslin Institute, University of Edinburgh, Easter Bush, Midlothian, EH25 9RG, UK.
| |
Collapse
|
19
|
Andreace F, Lechat P, Dufresne Y, Chikhi R. Comparing methods for constructing and representing human pangenome graphs. Genome Biol 2023; 24:274. [PMID: 38037131 PMCID: PMC10691155 DOI: 10.1186/s13059-023-03098-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2023] [Accepted: 10/26/2023] [Indexed: 12/02/2023] Open
Abstract
BACKGROUND As a single reference genome cannot possibly represent all the variation present across human individuals, pangenome graphs have been introduced to incorporate population diversity within a wide range of genomic analyses. Several data structures have been proposed for representing collections of genomes as pangenomes, in particular graphs. RESULTS In this work, we collect all publicly available high-quality human haplotypes and construct the largest human pangenome graphs to date, incorporating 52 individuals in addition to two synthetic references (CHM13 and GRCh38). We build variation graphs and de Bruijn graphs of this collection using five of the state-of-the-art tools: Bifrost, mdbg, Minigraph, Minigraph-Cactus and pggb. We examine differences in the way each of these tools represents variations between input sequences, both in terms of overall graph structure and representation of specific genetic loci. CONCLUSION This work sheds light on key differences between pangenome graph representations, informing end-users on how to select the most appropriate graph type for their application.
Collapse
Affiliation(s)
- Francesco Andreace
- Department of Computational Biology, Institut Pasteur, Université Paris Cité, Paris, F-75015, France.
- Sorbonne Université, Collège doctoral, F-75005, Paris, France.
| | - Pierre Lechat
- Bioinformatics and Biostatistics Hub, Institut Pasteur, Université de Paris, F-75015, Paris, France
| | - Yoann Dufresne
- Department of Computational Biology, Institut Pasteur, Université Paris Cité, Paris, F-75015, France
- Bioinformatics and Biostatistics Hub, Institut Pasteur, Université de Paris, F-75015, Paris, France
| | - Rayan Chikhi
- Department of Computational Biology, Institut Pasteur, Université Paris Cité, Paris, F-75015, France
| |
Collapse
|
20
|
Chandra G, Jain C. Gap-Sensitive Colinear Chaining Algorithms for Acyclic Pangenome Graphs. J Comput Biol 2023; 30:1182-1197. [PMID: 37902967 DOI: 10.1089/cmb.2023.0186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2023] Open
Abstract
A pangenome graph can serve as a better reference for genomic studies because it allows a compact representation of multiple genomes within a species. Aligning sequences to a graph is critical for pangenome-based resequencing. The seed-chain-extend heuristic works by finding short exact matches between a sequence and a graph. In this heuristic, colinear chaining helps identify a good cluster of exact matches that can be combined to form an alignment. Colinear chaining algorithms have been extensively studied for aligning two sequences with various gap costs, including linear, concave, and convex cost functions. However, extending these algorithms for sequence-to-graph alignment presents significant challenges. Recently, Makinen et al. introduced a sparse dynamic programming framework that exploits the small path cover property of acyclic pangenome graphs, enabling efficient chaining. However, this framework does not consider gap costs, limiting its practical effectiveness. We address this limitation by developing novel problem formulations and provably good chaining algorithms that support a variety of gap cost functions. These functions are carefully designed to enable fast chaining algorithms whose time requirements are parameterized in terms of the size of the minimum path cover. Through an empirical evaluation, we demonstrate the superior performance of our algorithm compared with existing aligners. When mapping simulated long reads to a pangenome graph comprising 95 human haplotypes, we achieved 98.7% precision while leaving <2% of reads unmapped.
Collapse
Affiliation(s)
- Ghanshyam Chandra
- Department of Computational and Data Sciences, Indian Institute of Science Bengaluru, India
| | - Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science Bengaluru, India
| |
Collapse
|
21
|
Depuydt L, Renders L, Abeel T, Fostier J. Pan-genome de Bruijn graph using the bidirectional FM-index. BMC Bioinformatics 2023; 24:400. [PMID: 37884897 PMCID: PMC10605969 DOI: 10.1186/s12859-023-05531-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Accepted: 10/12/2023] [Indexed: 10/28/2023] Open
Abstract
BACKGROUND Pan-genome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pan-genome. Most state-of-the-art pan-genome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FM-index. As such, a memory-efficient graph data structure is obtained that inherits the FM-index' backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance. RESULTS We present a data structure for a pan-genome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FM-index, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph's topology through visualization and sequence alignment. CONCLUSIONS We propose a memory-efficient representation of the pan-genome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL-3.0 license.
Collapse
Affiliation(s)
- Lore Depuydt
- Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, 9052, Ghent, Belgium.
| | - Luca Renders
- Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, 9052, Ghent, Belgium
| | - Thomas Abeel
- Delft Bioinformatics Lab, Delft University of Technology, 2628 XE, Delft, The Netherlands
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Jan Fostier
- Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, 9052, Ghent, Belgium.
| |
Collapse
|
22
|
Heumos S, Guarracino A, Schmelzle JNM, Li J, Zhang Z, Hagmann J, Nahnsen S, Prins P, Garrison E. Pangenome graph layout by Path-Guided Stochastic Gradient Descent. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.22.558964. [PMID: 37790531 PMCID: PMC10542513 DOI: 10.1101/2023.09.22.558964] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/05/2023]
Abstract
Motivation The increasing availability of complete genomes demands for models to study genomic variability within entire populations. Pangenome graphs capture the full genomic similarity and diversity between multiple genomes. In order to understand them, we need to see them. For visualization, we need a human readable graph layout: A graph embedding in low (e.g. two) dimensional depictions. Due to a pangenome graph's potential excessive size, this is a significant challenge. Results In response, we introduce a novel graph layout algorithm: the Path-Guided Stochastic Gradient Descent (PG-SGD). PG-SGD uses the genomes, represented in the pangenome graph as paths, as an embedded positional system to sample genomic distances between pairs of nodes. This avoids the quadratic cost seen in previous versions of graph drawing by Stochastic Gradient Descent (SGD). We show that our implementation efficiently computes the low dimensional layouts of gigabase-scale pangenome graphs, unveiling their biological features. Availability We integrated PG-SGD in ODGI which is released as free software under the MIT open source license. Source code is available at https://github.com/pangenome/odgi.
Collapse
Affiliation(s)
- Simon Heumos
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen 72076, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen 72076, Germany
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
- Genomics Research Centre, Human Technopole, Milan 20157, Italy
| | - Jan-Niklas M. Schmelzle
- Department of Computer Engineering, School of Computation, Information and Technology (CIT), Technical University of Munich, Munich 80333, Germany
- School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853, USA
| | - Jiajie Li
- School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853, USA
| | - Zhiru Zhang
- School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853, USA
| | - Jörg Hagmann
- Computomics GmbH, Eisenbahnstr. 1, 72072 Tübingen, Germany
| | - Sven Nahnsen
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen 72076, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen 72076, Germany
- M3 Research Center, University Hospital Tübingen, 72076 Tübingen, Germany
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| |
Collapse
|
23
|
Xie S, Isaacs K, Becker G, Murdoch BM. A computational framework for improving genetic variants identification from 5,061 sheep sequencing data. J Anim Sci Biotechnol 2023; 14:127. [PMID: 37779189 PMCID: PMC10544426 DOI: 10.1186/s40104-023-00923-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2023] [Accepted: 08/01/2023] [Indexed: 10/03/2023] Open
Abstract
BACKGROUND Pan-genomics is a recently emerging strategy that can be utilized to provide a more comprehensive characterization of genetic variation. Joint calling is routinely used to combine identified variants across multiple related samples. However, the improvement of variants identification using the mutual support information from multiple samples remains quite limited for population-scale genotyping. RESULTS In this study, we developed a computational framework for joint calling genetic variants from 5,061 sheep by incorporating the sequencing error and optimizing mutual support information from multiple samples' data. The variants were accurately identified from multiple samples by using four steps: (1) Probabilities of variants from two widely used algorithms, GATK and Freebayes, were calculated by Poisson model incorporating base sequencing error potential; (2) The variants with high mapping quality or consistently identified from at least two samples by GATK and Freebayes were used to construct the raw high-confidence identification (rHID) variants database; (3) The high confidence variants identified in single sample were ordered by probability value and controlled by false discovery rate (FDR) using rHID database; (4) To avoid the elimination of potentially true variants from rHID database, the variants that failed FDR were reexamined to rescued potential true variants and ensured high accurate identification variants. The results indicated that the percent of concordant SNPs and Indels from Freebayes and GATK after our new method were significantly improved 12%-32% compared with raw variants and advantageously found low frequency variants of individual sheep involved several traits including nipples number (GPC5), scrapie pathology (PAPSS2), seasonal reproduction and litter size (GRM1), coat color (RAB27A), and lentivirus susceptibility (TMEM154). CONCLUSION The new method used the computational strategy to reduce the number of false positives, and simultaneously improve the identification of genetic variants. This strategy did not incur any extra cost by using any additional samples or sequencing data information and advantageously identified rare variants which can be important for practical applications of animal breeding.
Collapse
Affiliation(s)
- Shangqian Xie
- Department of Animal, Veterinary & Food Sciences, University of Idaho, Moscow, ID, USA
| | | | - Gabrielle Becker
- Department of Animal, Veterinary & Food Sciences, University of Idaho, Moscow, ID, USA
| | - Brenda M Murdoch
- Department of Animal, Veterinary & Food Sciences, University of Idaho, Moscow, ID, USA.
| |
Collapse
|
24
|
Yang Z, Guarracino A, Biggs PJ, Black MA, Ismail N, Wold JR, Merriman TR, Prins P, Garrison E, de Ligt J. Pangenome graphs in infectious disease: a comprehensive genetic variation analysis of Neisseria meningitidis leveraging Oxford Nanopore long reads. Front Genet 2023; 14:1225248. [PMID: 37636268 PMCID: PMC10448961 DOI: 10.3389/fgene.2023.1225248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Accepted: 08/01/2023] [Indexed: 08/29/2023] Open
Abstract
Whole genome sequencing has revolutionized infectious disease surveillance for tracking and monitoring the spread and evolution of pathogens. However, using a linear reference genome for genomic analyses may introduce biases, especially when studies are conducted on highly variable bacterial genomes of the same species. Pangenome graphs provide an efficient model for representing and analyzing multiple genomes and their variants as a graph structure that includes all types of variations. In this study, we present a practical bioinformatics pipeline that employs the PanGenome Graph Builder and the Variation Graph toolkit to build pangenomes from assembled genomes, align whole genome sequencing data and call variants against a graph reference. The pangenome graph enables the identification of structural variants, rearrangements, and small variants (e.g., single nucleotide polymorphisms and insertions/deletions) simultaneously. We demonstrate that using a pangenome graph, instead of a single linear reference genome, improves mapping rates and variant calling for both simulated and real datasets of the pathogen Neisseria meningitidis. Overall, pangenome graphs offer a promising approach for comparative genomics and comprehensive genetic variation analysis in infectious disease. Moreover, this innovative pipeline, leveraging pangenome graphs, can bridge variant analysis, genome assembly, population genetics, and evolutionary biology, expanding the reach of genomic understanding and applications.
Collapse
Affiliation(s)
- Zuyu Yang
- Institute of Environmental Science and Research, Porirua, New Zealand
| | - Andrea Guarracino
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, United States
- Genomics Research Centre, Human Technopole, Milan, Italy
| | - Patrick J. Biggs
- Molecular Biosciences Group, School of Natural Sciences, Massey University, Palmerston North, New Zealand
- Molecular Epidemiology and Public Health Laboratory, School of Veterinary Science, Massey University, Palmerston North, New Zealand
| | - Michael A. Black
- Department of Biochemistry, University of Otago, Dunedin, New Zealand
| | - Nuzla Ismail
- Department of Biochemistry, University of Otago, Dunedin, New Zealand
| | - Jana Renee Wold
- School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
| | - Tony R. Merriman
- Department of Biochemistry, University of Otago, Dunedin, New Zealand
- Division of Clinical Immunology and Rheumatology, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, United States
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, United States
| | - Joep de Ligt
- Institute of Environmental Science and Research, Porirua, New Zealand
| |
Collapse
|
25
|
Ma J, Cáceres M, Salmela L, Mäkinen V, Tomescu AI. Chaining for accurate alignment of erroneous long reads to acyclic variation graphs. Bioinformatics 2023; 39:btad460. [PMID: 37494467 PMCID: PMC10423031 DOI: 10.1093/bioinformatics/btad460] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 06/08/2023] [Accepted: 07/25/2023] [Indexed: 07/28/2023] Open
Abstract
MOTIVATION Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications such as improving variant calling. While the vg toolkit [Garrison et al. (Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875-9)] is a popular aligner of short reads, GraphAligner [Rautiainen and Marschall (GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol 2020;21:253-28)] is the state-of-the-art aligner of erroneous long reads. GraphAligner works by finding candidate read occurrences based on individually extending the best seeds of the read in the variation graph. However, a more principled approach recognized in the community is to co-linearly chain multiple seeds. RESULTS We present a new algorithm to co-linearly chain a set of seeds in a string labeled acyclic graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of erroneous long reads to acyclic variation graphs, GraphChainer. We run experiments aligning real and simulated PacBio CLR reads with average error rates 15% and 5%. Compared to GraphAligner, GraphChainer aligns 12-17% more reads, and 21-28% more total read length, on real PacBio CLR reads from human chromosomes 1, 22, and the whole human pangenome. On both simulated and real data, GraphChainer aligns between 95% and 99% of all reads, and of total read length. We also show that minigraph [Li et al. (The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020;21:265-19.)] and minichain [Chandra and Jain (Sequence to graph alignment using gap-sensitive co-linear chaining. In: Proceedings of the 27th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2023). Springer, 2023, 58-73.)] obtain an accuracy of <60% on this setting. AVAILABILITY AND IMPLEMENTATION GraphChainer is freely available at https://github.com/algbio/GraphChainer. The datasets and evaluation pipeline can be reached from the previous address.
Collapse
Affiliation(s)
- Jun Ma
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| | - Manuel Cáceres
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| | - Leena Salmela
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| | - Veli Mäkinen
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| | - Alexandru I Tomescu
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| |
Collapse
|
26
|
Glick L, Mayrose I. The Effect of Methodological Considerations on the Construction of Gene-Based Plant Pan-genomes. Genome Biol Evol 2023; 15:evad121. [PMID: 37401440 PMCID: PMC10340445 DOI: 10.1093/gbe/evad121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2023] [Revised: 06/21/2023] [Accepted: 06/28/2023] [Indexed: 07/05/2023] Open
Abstract
Pan-genomics is an emerging approach for studying the genetic diversity within plant populations. In contrast to common resequencing studies that compare whole genome sequencing data with a single reference genome, the construction of a pan-genome (PG) involves the direct comparison of multiple genomes to one another, thereby enabling the detection of genomic sequences and genes not present in the reference, as well as the analysis of gene content diversity. Although multiple studies describing PGs of various plant species have been published in recent years, a better understanding regarding the effect of the computational procedures used for PG construction could guide researchers in making more informed methodological decisions. Here, we examine the effect of several key methodological factors on the obtained gene pool and on gene presence-absence detections by constructing and comparing multiple PGs of Arabidopsis thaliana and cultivated soybean, as well as conducting a meta-analysis on published PGs. These factors include the construction method, the sequencing depth, and the extent of input data used for gene annotation. We observe substantial differences between PGs constructed using three common procedures (de novo assembly and annotation, map-to-pan, and iterative assembly) and that results are dependent on the extent of the input data. Specifically, we report low agreement between the gene content inferred using different procedures and input data. Our results should increase the awareness of the community to the consequences of methodological decisions made during the process of PG construction and emphasize the need for further investigation of commonly applied methodologies.
Collapse
Affiliation(s)
- Lior Glick
- Department of Life Sciences, School of Plant Sciences and Food Security, Tel-Aviv University, Tel Aviv, Israel
| | - Itay Mayrose
- Department of Life Sciences, School of Plant Sciences and Food Security, Tel-Aviv University, Tel Aviv, Israel
| |
Collapse
|
27
|
Luo G, Kumar H, Alridge K, Rieger S, Jiang E, Chan ER, Soliman A, Mahdi H, Letterio JJ. A core NRF2 gene set defined through comprehensive transcriptomic analysis predicts selective drug resistance and poor multi-cancer prognosis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.20.537691. [PMID: 37131828 PMCID: PMC10153264 DOI: 10.1101/2023.04.20.537691] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The NRF2-KEAP1 pathway plays an important role in the cellular response to oxidative stress but may also contribute to metabolic changes and drug resistance in cancer. We investigated the activation of NRF2 in human cancers and fibroblast cells through KEAP1 inhibition and cancer associated KEAP1/NRF2 mutations. We define a core set of 14 upregulated NRF2 target genes from seven RNA-Sequencing databases that we generated and analyzed, which we validated this gene set through analyses of published databases and gene sets. An NRF2 activity score based on expression of these core target genes correlates with resistance to drugs such as PX-12 and necrosulfonamide but not to paclitaxel or bardoxolone methyl. We validated these findings and also found NRF2 activation led to radioresistance in cancer cell lines. Finally, our NRF2 score is prognostic for cancer survival and validated in additional independent cohorts for novel cancers types not associated with NRF2-KEAP1 mutations. These analyses define a core NRF2 gene set that is robust, versatile, and useful as a NRF2 biomarker and for predicting drug resistance and cancer prognosis.
Collapse
|
28
|
Gong Y, Li Y, Liu X, Ma Y, Jiang L. A review of the pangenome: how it affects our understanding of genomic variation, selection and breeding in domestic animals? J Anim Sci Biotechnol 2023; 14:73. [PMID: 37143156 PMCID: PMC10161434 DOI: 10.1186/s40104-023-00860-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 03/01/2023] [Indexed: 05/06/2023] Open
Abstract
As large-scale genomic studies have progressed, it has been revealed that a single reference genome pattern cannot represent genetic diversity at the species level. While domestic animals tend to have complex routes of origin and migration, suggesting a possible omission of some population-specific sequences in the current reference genome. Conversely, the pangenome is a collection of all DNA sequences of a species that contains sequences shared by all individuals (core genome) and is also able to display sequence information unique to each individual (variable genome). The progress of pangenome research in humans, plants and domestic animals has proved that the missing genetic components and the identification of large structural variants (SVs) can be explored through pangenomic studies. Many individual specific sequences have been shown to be related to biological adaptability, phenotype and important economic traits. The maturity of technologies and methods such as third-generation sequencing, Telomere-to-telomere genomes, graphic genomes, and reference-free assembly will further promote the development of pangenome. In the future, pangenome combined with long-read data and multi-omics will help to resolve large SVs and their relationship with the main economic traits of interest in domesticated animals, providing better insights into animal domestication, evolution and breeding. In this review, we mainly discuss how pangenome analysis reveals genetic variations in domestic animals (sheep, cattle, pigs, chickens) and their impacts on phenotypes and how this can contribute to the understanding of species diversity. Additionally, we also go through potential issues and the future perspectives of pangenome research in livestock and poultry.
Collapse
Affiliation(s)
- Ying Gong
- Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, 100193, China
- National Germplasm Center of Domestic Animal Resources, Ministry of Technology, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, 100193, China
| | - Yefang Li
- Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, 100193, China
- National Germplasm Center of Domestic Animal Resources, Ministry of Technology, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, 100193, China
| | - Xuexue Liu
- Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, 100193, China
- National Germplasm Center of Domestic Animal Resources, Ministry of Technology, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, 100193, China
- Centre d'Anthropobiologie et de Génomique de Toulouse, Université Paul Sabatier, 37 allées Jules Guesde, Toulouse, 31000, France
| | - Yuehui Ma
- Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, 100193, China.
- National Germplasm Center of Domestic Animal Resources, Ministry of Technology, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, 100193, China.
| | - Lin Jiang
- Laboratory of Animal Genetics, Breeding and Reproduction, Ministry of Agriculture, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, 100193, China.
- National Germplasm Center of Domestic Animal Resources, Ministry of Technology, Institute of Animal Sciences, Chinese Academy of Agricultural Sciences (CAAS), Beijing, 100193, China.
| |
Collapse
|
29
|
Sibbesen JA, Eizenga JM, Novak AM, Sirén J, Chang X, Garrison E, Paten B. Haplotype-aware pantranscriptome analyses using spliced pangenome graphs. Nat Methods 2023; 20:239-247. [PMID: 36646895 DOI: 10.1101/2021.03.26.437240] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 11/28/2022] [Indexed: 05/24/2023]
Abstract
Pangenomics is emerging as a powerful computational paradigm in bioinformatics. This field uses population-level genome reference structures, typically consisting of a sequence graph, to mitigate reference bias and facilitate analyses that were challenging with previous reference-based methods. In this work, we extend these methods into transcriptomics to analyze sequencing data using the pantranscriptome: a population-level transcriptomic reference. Our toolchain, which consists of additions to the VG toolkit and a standalone tool, RPVG, can construct spliced pangenome graphs, map RNA sequencing data to these graphs, and perform haplotype-aware expression quantification of transcripts in a pantranscriptome. We show that this workflow improves accuracy over state-of-the-art RNA sequencing mapping methods, and that it can efficiently quantify haplotype-specific transcript expression without needing to characterize the haplotypes of a sample beforehand.
Collapse
Affiliation(s)
| | | | - Adam M Novak
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Jouni Sirén
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Xian Chang
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Erik Garrison
- University of Tennessee Health Science Center, Memphis, TN, USA
| | | |
Collapse
|
30
|
Li T, Yin Y. Critical assessment of pan-genomic analysis of metagenome-assembled genomes. Brief Bioinform 2022; 23:6702672. [PMID: 36124775 PMCID: PMC9677465 DOI: 10.1093/bib/bbac413] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2022] [Revised: 08/23/2022] [Accepted: 08/26/2022] [Indexed: 12/30/2022] Open
Abstract
Pan-genome analyses of metagenome-assembled genomes (MAGs) may suffer from the known issues with MAGs: fragmentation, incompleteness and contamination. Here, we conducted a critical assessment of pan-genomics of MAGs, by comparing pan-genome analysis results of complete bacterial genomes and simulated MAGs. We found that incompleteness led to significant core gene (CG) loss. The CG loss remained when using different pan-genome analysis tools (Roary, BPGA, Anvi'o) and when using a mixture of MAGs and complete genomes. Contamination had little effect on core genome size (except for Roary due to in its gene clustering issue) but had major influence on accessory genomes. Importantly, the CG loss was partially alleviated by lowering the CG threshold and using gene prediction algorithms that consider fragmented genes, but to a less degree when incompleteness was higher than 5%. The CG loss also led to incorrect pan-genome functional predictions and inaccurate phylogenetic trees. Our main findings were supported by a study of real MAG-isolate genome data. We conclude that lowering CG threshold and predicting genes in metagenome mode (as Anvi'o does with Prodigal) are necessary in pan-genome analysis of MAGs. Development of new pan-genome analysis tools specifically for MAGs are needed in future studies.
Collapse
Affiliation(s)
- Tang Li
- Nebraska Food for Health Center, Department of Food Science and Technology, University of Nebraska - Lincoln, Lincoln, NE, 68508, USA
| | - Yanbin Yin
- Corresponding author. Yanbin Yin, Nebraska Food for Health Center, Department of Food Science and Technology, University of Nebraska - Lincoln, Lincoln, NE 68508, USA. Tel.: +1-402-472-4303; E-mail:
| |
Collapse
|
31
|
Guarracino A, Heumos S, Nahnsen S, Prins P, Garrison E. ODGI: understanding pangenome graphs. BIOINFORMATICS (OXFORD, ENGLAND) 2022; 38:3319-3326. [PMID: 35552372 DOI: 10.1101/2021.11.10.467921] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/09/2021] [Revised: 03/18/2022] [Indexed: 05/24/2023]
Abstract
MOTIVATION Pangenome graphs provide a complete representation of the mutual alignment of collections of genomes. These models offer the opportunity to study the entire genomic diversity of a population, including structurally complex regions. Nevertheless, analyzing hundreds of gigabase-scale genomes using pangenome graphs is difficult as it is not well-supported by existing tools. Hence, fast and versatile software is required to ask advanced questions to such data in an efficient way. RESULTS We wrote Optimized Dynamic Genome/Graph Implementation (ODGI), a novel suite of tools that implements scalable algorithms and has an efficient in-memory representation of DNA pangenome graphs in the form of variation graphs. ODGI supports pre-built graphs in the Graphical Fragment Assembly format. ODGI includes tools for detecting complex regions, extracting pangenomic loci, removing artifacts, exploratory analysis, manipulation, validation and visualization. Its fast parallel execution facilitates routine pangenomic tasks, as well as pipelines that can quickly answer complex biological questions of gigabase-scale pangenome graphs. AVAILABILITY AND IMPLEMENTATION ODGI is published as free software under the MIT open source license. Source code can be downloaded from https://github.com/pangenome/odgi and documentation is available at https://odgi.readthedocs.io. ODGI can be installed via Bioconda https://bioconda.github.io/recipes/odgi/README.html or GNU Guix https://github.com/pangenome/odgi/blob/master/guix.scm. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Simon Heumos
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen 72076, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen 72076, Germany
| | - Sven Nahnsen
- Quantitative Biology Center (QBiC), University of Tübingen, Tübingen 72076, Germany
- Biomedical Data Science, Department of Computer Science, University of Tübingen, Tübingen 72076, Germany
| | - Pjotr Prins
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| |
Collapse
|
32
|
Pangenomics in Microbial and Crop Research: Progress, Applications, and Perspectives. Genes (Basel) 2022; 13:genes13040598. [PMID: 35456404 PMCID: PMC9031676 DOI: 10.3390/genes13040598] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Revised: 03/16/2022] [Accepted: 03/25/2022] [Indexed: 01/25/2023] Open
Abstract
Advances in sequencing technologies and bioinformatics tools have fueled a renewed interest in whole genome sequencing efforts in many organisms. The growing availability of multiple genome sequences has advanced our understanding of the within-species diversity, in the form of a pangenome. Pangenomics has opened new avenues for future research such as allowing dissection of complex molecular mechanisms and increased confidence in genome mapping. To comprehensively capture the genetic diversity for improving plant performance, the pangenome concept is further extended from species to genus level by the inclusion of wild species, constituting a super-pangenome. Characterization of pangenome has implications for both basic and applied research. The concept of pangenome has transformed the way biological questions are addressed. From understanding evolution and adaptation to elucidating host–pathogen interactions, finding novel genes or breeding targets to aid crop improvement to design effective vaccines for human prophylaxis, the increasing availability of the pangenome has revolutionized several aspects of biological research. The future availability of high-resolution pangenomes based on reference-level near-complete genome assemblies would greatly improve our ability to address complex biological problems.
Collapse
|
33
|
Baaijens JA, Bonizzoni P, Boucher C, Della Vedova G, Pirola Y, Rizzi R, Sirén J. Computational graph pangenomics: a tutorial on data structures and their applications. NATURAL COMPUTING 2022; 21:81-108. [PMID: 36969737 PMCID: PMC10038355 DOI: 10.1007/s11047-022-09882-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 02/14/2022] [Indexed: 05/08/2023]
Abstract
Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project. A major contribution of the 1000 Genomes Project is the characterization of a broad spectrum of genetic variations in the human genome, including the discovery of novel variations in the South Asian, African and European populations-thus enhancing the catalogue of variability within the reference genome. Currently, the need to take into account the high variability in population genomes as well as the specificity of an individual genome in a personalized approach to medicine is rapidly pushing the abandonment of the traditional paradigm of using a single reference genome. A graph-based representation of multiple genomes, or a graph pangenome, is replacing the linear reference genome. This means completely rethinking well-established procedures to analyze, store, and access information from genome representations. Properly addressing these challenges is crucial to face the computational tasks of ambitious healthcare projects aiming to characterize human diversity by sequencing 1M individuals (Stark et al. 2019). This tutorial aims to introduce readers to the most recent advances in the theory of data structures for the representation of graph pangenomes. We discuss efficient representations of haplotypes and the variability of genotypes in graph pangenomes, and highlight applications in solving computational problems in human and microbial (viral) pangenomes.
Collapse
Affiliation(s)
- Jasmijn A. Baaijens
- Department of Intelligent Systems, Delft University of Technology, Van Mourik Broekmanweg 6, 2628XE Delft, The Netherlands
- Department of Biomedical Informatics, Harvard University, 10 Shattuck St, Boston, MA 02115, USA
| | - Paola Bonizzoni
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, 432 Newell Dr, Gainesville, FL 32603, USA
| | - Gianluca Della Vedova
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Yuri Pirola
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Raffaella Rizzi
- Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, V.le Sarca, 336, 20126 Milan, Italy
| | - Jouni Sirén
- Genomics Institute, University of California, 1156 High St., Santa Cruz, CA 95064, USA
| |
Collapse
|
34
|
Tay Fernandez CG, Nestor BJ, Danilevicz MF, Gill M, Petereit J, Bayer PE, Finnegan PM, Batley J, Edwards D. Pangenomes as a Resource to Accelerate Breeding of Under-Utilised Crop Species. Int J Mol Sci 2022; 23:2671. [PMID: 35269811 PMCID: PMC8910360 DOI: 10.3390/ijms23052671] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 02/21/2022] [Accepted: 02/21/2022] [Indexed: 02/01/2023] Open
Abstract
Pangenomes are a rich resource to examine the genomic variation observed within a species or genera, supporting population genetics studies, with applications for the improvement of crop traits. Major crop species such as maize (Zea mays), rice (Oryza sativa), Brassica (Brassica spp.), and soybean (Glycine max) have had pangenomes constructed and released, and this has led to the discovery of valuable genes associated with disease resistance and yield components. However, pangenome data are not available for many less prominent crop species that are currently under-utilised. Despite many under-utilised species being important food sources in regional populations, the scarcity of genomic data for these species hinders their improvement. Here, we assess several under-utilised crops and review the pangenome approaches that could be used to build resources for their improvement. Many of these under-utilised crops are cultivated in arid or semi-arid environments, suggesting that novel genes related to drought tolerance may be identified and used for introgression into related major crop species. In addition, we discuss how previously collected data could be used to enrich pangenome functional analysis in genome-wide association studies (GWAS) based on studies in major crops. Considering the technological advances in genome sequencing, pangenome references for under-utilised species are becoming more obtainable, offering the opportunity to identify novel genes related to agro-morphological traits in these species.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - David Edwards
- School of Biological Sciences, The University of Western Australia, Perth, WA 6009, Australia; (C.G.T.F.); (B.J.N.); (M.F.D.); (M.G.); (J.P.); (P.E.B.); (P.M.F.); (J.B.)
| |
Collapse
|
35
|
Unveiling lignocellulolytic trait of a goat omasum inhabitant Klebsiella variicola strain HSTU-AAM51 in light of biochemical and genome analyses. Braz J Microbiol 2022; 53:99-130. [PMID: 35088248 PMCID: PMC8882562 DOI: 10.1007/s42770-021-00660-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 11/19/2021] [Indexed: 01/30/2023] Open
Abstract
Klebsiella variicola is generally known as endophyte as well as lignocellulose-degrading strain. However, their roles in goat omasum along with lignocellulolytic genetic repertoire are not yet explored. In this study, five different pectin-degrading bacteria were isolated from a healthy goat omasum. Among them, a new Klebsiella variicola strain HSTU-AAM51 was identified to degrade lignocellulose. The genome of the HSTU-AAM51 strain comprised 5,564,045 bp with a GC content of 57.2% and 5312 coding sequences. The comparison of housekeeping genes (16S rRNA, TonB, gyrase B, RecA) and whole-genome sequence (ANI, pangenome, synteny, DNA-DNA hybridization) revealed that the strain HSTU-AAM51 was clustered with Klebsiella variicola strains, but the HSTU-AAM51 strain was markedly deviated. It consisted of seventeen cellulases (GH1, GH3, GH4, GH5, GH13), fourteen beta-glucosidase (2GH3, 7GH4, 4GH1), two glucosidase, and one pullulanase genes. The strain secreted cellulase, pectinase, and xylanase, lignin peroxidase approximately 76-78 U/mL and 57-60 U/mL, respectively, when it was cultured on banana pseudostem for 96 h. The catalytically important residues of extracellular cellulase, xylanase, mannanase, pectinase, chitinase, and tannase proteins (validated 3D model) were bound to their specific ligands. Besides, genes involved in the benzoate and phenylacetate catabolic pathways as well as laccase and DiP-type peroxidase were annotated, which indicated the strain lignin-degrading potentiality. This study revealed a new K. variicola bacterium from goat omasum which harbored lignin and cellulolytic enzymes that could be utilized for the production of bioethanol from lignocelluloses.
Collapse
|
36
|
Gavali S, Ross KE, Cowart J, Chen C, Wu CH. iPTMnet RESTful API for Post-translational Modification Network Analysis. Methods Mol Biol 2022; 2499:187-204. [PMID: 35696082 PMCID: PMC10082948 DOI: 10.1007/978-1-0716-2317-6_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
iPTMnet is a resource that combines rich information about protein post-translational modifications (PTM) from curated databases as well as text mining tools. Researchers can use the iPTMnet website to query, analyze and download the PTM data. In this chapter we describe the iPTMnet RESTful API which provides a way to streamline the integration of iPTMnet data into an automated data analysis workflow. In the first section, we give an overview of the architecture of the API. In the second section, we describe various function defined by the API and provide detailed examples of using these functions.
Collapse
|
37
|
Sirén J, Monlong J, Chang X, Novak AM, Eizenga JM, Markello C, Sibbesen JA, Hickey G, Chang PC, Carroll A, Gupta N, Gabriel S, Blackwell TW, Ratan A, Taylor KD, Rich SS, Rotter JI, Haussler D, Garrison E, Paten B. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 2021; 374:abg8871. [PMID: 34914532 PMCID: PMC9365333 DOI: 10.1126/science.abg8871] [Citation(s) in RCA: 162] [Impact Index Per Article: 40.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
We introduce Giraffe, a pangenome short-read mapper that can efficiently map to a collection of haplotypes threaded through a sequence graph. Giraffe maps sequencing reads to thousands of human genomes at a speed comparable to that of standard methods mapping to a single reference genome. The increased mapping accuracy enables downstream improvements in genome-wide genotyping pipelines for both small variants and larger structural variants. We used Giraffe to genotype 167,000 structural variants, discovered in long-read studies, in 5202 diverse human genomes that were sequenced using short reads. We conclude that pangenomics facilitates a more comprehensive characterization of variation and, as a result, has the potential to improve many genomic analyses.
Collapse
Affiliation(s)
- Jouni Sirén
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Jean Monlong
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Xian Chang
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Adam M. Novak
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | | | | | | | - Glenn Hickey
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
| | - Pi-Chuan Chang
- Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA, USA
| | - Andrew Carroll
- Google Inc, 1600 Amphitheatre Pkwy, Mountain View, CA, USA
| | - Namrata Gupta
- Genomics Platform, Broad Institute, Cambridge, MA, USA
| | - Stacey Gabriel
- Program in Medical and Population Genetics, Broad Institute, Cambridge, MA, USA
| | | | - Aakrosh Ratan
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
| | - Kent D. Taylor
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - Stephen S. Rich
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA
| | - Jerome I. Rotter
- The Institute for Translational Genomics and Population Sciences, Department of Pediatrics, The Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center, Torrance, CA, USA
| | - David Haussler
- UC Santa Cruz Genomics Institute, Santa Cruz, CA, USA
- Howard Hughes Medical Institute, University of California, Santa Cruz, CA, USA
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, USA
| | | |
Collapse
|
38
|
Joo KA, Muszynski MG, Kantar MB, Wang ML, He X, Del Valle Echevarria AR. Utilizing CRISPR-Cas in Tropical Crop Improvement: A Decision Process for Fitting Genome Engineering to Your Species. Front Genet 2021; 12:786140. [PMID: 34868276 PMCID: PMC8633396 DOI: 10.3389/fgene.2021.786140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 10/29/2021] [Indexed: 11/13/2022] Open
Abstract
Adopting modern gene-editing technologies for trait improvement in agriculture requires important workflow developments, yet these developments are not often discussed. Using tropical crop systems as a case study, we describe a workflow broken down into discrete processes with specific steps and decision points that allow for the practical application of the CRISPR-Cas gene editing platform in a crop of interest. While we present the steps of developing genome-edited plants as sequential, in practice parts can be done in parallel, which are discussed in this perspective. The main processes include 1) understanding the genetic basis of the trait along with having the crop’s genome sequence, 2) testing and optimization of the editing reagents, development of efficient 3) tissue culture and 4) transformation methods, and 5) screening methods to identify edited events with commercial potential. Our goal in this perspective is to help any lab that wishes to implement this powerful, easy-to-use tool in their pipeline, thus aiming to democratize the technology.
Collapse
Affiliation(s)
- Kathleen A Joo
- Department of Tropical Plant and Soil Sciences, University of Hawaii at Manoa, Honolulu, HI, United States
| | - Michael G Muszynski
- Department of Tropical Plant and Soil Sciences, University of Hawaii at Manoa, Honolulu, HI, United States
| | - Michael B Kantar
- Department of Tropical Plant and Soil Sciences, University of Hawaii at Manoa, Honolulu, HI, United States
| | - Ming-Li Wang
- Hawaii Agriculture Research Center, Waipahu, HI, United States
| | - Xiaoling He
- Hawaii Agriculture Research Center, Waipahu, HI, United States
| | - Angel R Del Valle Echevarria
- Department of Tropical Plant and Soil Sciences, University of Hawaii at Manoa, Honolulu, HI, United States.,Hawaii Agriculture Research Center, Waipahu, HI, United States
| |
Collapse
|
39
|
Jiao D, Dong X, Yu Y, Wei C. Gene Presence/Absence Variation analysis of coronavirus family displays its pan-genomic diversity. Int J Biol Sci 2021; 17:3717-3727. [PMID: 34671195 PMCID: PMC8495401 DOI: 10.7150/ijbs.58220] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2021] [Accepted: 08/07/2021] [Indexed: 11/15/2022] Open
Abstract
SARS-CoV-2 belongs to the coronavirus family. Comparing genomic features of viral genomes of coronavirus family can improve our understanding about SARS-CoV-2. Here we present the first pan-genome analysis of 3,932 whole genomes of 101 species out of 4 genera from the coronavirus family. We found that a total of 181 genes in the pan-genome of coronavirus family, among which only 3 genes, the S gene, M gene and N gene, are highly conserved. We also constructed a pan-genome from 23,539 whole genomes of SARS-CoV-2. There are 13 genes in total in the SARS-CoV-2 pan-genome. All of the 13 genes are core genes for SARS-CoV-2. The pan-genome of coronaviruses shows a lower level of diversity than the pan-genomes of other RNA viruses, which contain no core gene. The three highly conserved genes in coronavirus family, which are also core genes in SARS-CoV-2 pan-genome, could be potential targets in developing nucleic acid diagnostic reagents with a decreased possibility of cross-reaction with other coronavirus species.
Collapse
Affiliation(s)
- Du Jiao
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China
| | - Xiaorui Dong
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China
| | - Yingyan Yu
- Department of General Surgery of Ruijin Hospital, Shanghai Institute of Digestive Surgery, and Shanghai Key Laboratory for Gastric Neoplasms, Shanghai Jiao Tong University School of Medicine, 200025, Shanghai, China
| | - Chaochun Wei
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China
| |
Collapse
|
40
|
Liang Q, Lonardi S. Reference-agnostic representation and visualization of pan-genomes. BMC Bioinformatics 2021; 22:502. [PMID: 34656081 PMCID: PMC8520301 DOI: 10.1186/s12859-021-04424-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Accepted: 10/04/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The pan-genome of a species is the union of the genes and non-coding sequences present in all individuals (cultivar, accessions, or strains) within that species. RESULTS Here we introduce PGV, a reference-agnostic representation of the pan-genome of a species based on the notion of consensus ordering. Our experimental results demonstrate that PGV enables an intuitive, effective and interactive visualization of a pan-genome by providing a genome browser that can elucidate complex structural genomic variations. CONCLUSIONS The PGV software can be installed via conda or downloaded from https://github.com/ucrbioinfo/PGV . The companion PGV browser at http://pgv.cs.ucr.edu can be tested using example bed tracks available from the GitHub page.
Collapse
Affiliation(s)
- Qihua Liang
- Department of Computer Science and Engineering, University of California, Riverside, CA, 92521, USA.
| | - Stefano Lonardi
- Department of Computer Science and Engineering, University of California, Riverside, CA, 92521, USA
| |
Collapse
|
41
|
Durant É, Sabot F, Conte M, Rouard M. Panache: a Web Browser-Based Viewer for Linearized Pangenomes. Bioinformatics 2021; 37:4556-4558. [PMID: 34601567 PMCID: PMC8652104 DOI: 10.1093/bioinformatics/btab688] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Revised: 07/28/2021] [Accepted: 09/24/2021] [Indexed: 11/15/2022] Open
Abstract
Motivation Pangenomics evolved since its first applications on bacteria, extending from the study of genes for a given population to the study of all of its sequences available. While multiple methods are being developed to construct pangenomes in eukaryotic species there is still a gap for efficient and user-friendly visualization tools. Emerging graph representations come with their own challenges, and linearity remains a suitable option for user-friendliness. Results We introduce Panache, a tool for the visualization and exploration of linear representations of gene-based and sequence-based pangenomes. It uses a layout similar to genome browsers to display presence absence variations and additional tracks along a linear axis with a pangenomics perspective. Availability and implementation Panache is available at github.com/SouthGreenPlatform/panache under the MIT License.
Collapse
Affiliation(s)
- Éloi Durant
- DIADE, Univ Montpellier, CIRAD, IRD, Montpellier, 34830, France.,Syngenta Seeds SAS, Saint-Sauveur, 31790, France.,Bioversity International, Parc Scientifique Agropolis II, Montpellier, 34397, France.,French Institute of Bioinformatics (IFB)-South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, Montpellier, 34398, France
| | - François Sabot
- DIADE, Univ Montpellier, CIRAD, IRD, Montpellier, 34830, France.,French Institute of Bioinformatics (IFB)-South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, Montpellier, 34398, France
| | | | - Mathieu Rouard
- Bioversity International, Parc Scientifique Agropolis II, Montpellier, 34397, France
| |
Collapse
|
42
|
Colquhoun RM, Hall MB, Lima L, Roberts LW, Malone KM, Hunt M, Letcher B, Hawkey J, George S, Pankhurst L, Iqbal Z. Pandora: nucleotide-resolution bacterial pan-genomics with reference graphs. Genome Biol 2021; 22:267. [PMID: 34521456 PMCID: PMC8442373 DOI: 10.1186/s13059-021-02473-1] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2020] [Accepted: 08/19/2021] [Indexed: 12/21/2022] Open
Abstract
We present pandora, a novel pan-genome graph structure and algorithms for identifying variants across the full bacterial pan-genome. As much bacterial adaptability hinges on the accessory genome, methods which analyze SNPs in just the core genome have unsatisfactory limitations. Pandora approximates a sequenced genome as a recombinant of references, detects novel variation and pan-genotypes multiple samples. Using a reference graph of 578 Escherichia coli genomes, we compare 20 diverse isolates. Pandora recovers more rare SNPs than single-reference-based tools, is significantly better than picking the closest RefSeq reference, and provides a stable framework for analyzing diverse samples without reference bias.
Collapse
Affiliation(s)
- Rachel M Colquhoun
- European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
- Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, UK
- Institute of Evolutionary Biology, Ashworth Laboratories, University of Edinburgh, Edinburgh, UK
| | - Michael B Hall
- European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
| | - Leandro Lima
- European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
| | - Leah W Roberts
- European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
| | - Kerri M Malone
- European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
| | - Martin Hunt
- European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
- Nuffield Department of Medicine, University of Oxford, Oxford, UK
| | - Brice Letcher
- European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
| | - Jane Hawkey
- Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Victoria, 3004, Australia
| | - Sophie George
- Nuffield Department of Medicine, University of Oxford, Oxford, UK
| | - Louise Pankhurst
- Nuffield Department of Medicine, University of Oxford, Oxford, UK
- Department of Zoology, University of Oxford, Mansfield Road, Oxford, UK
| | - Zamin Iqbal
- European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK.
| |
Collapse
|
43
|
Abstract
The reference human genome sequence is inarguably the most important and widely used resource in the fields of human genetics and genomics. It has transformed the conduct of biomedical sciences and brought invaluable benefits to the understanding and improvement of human health. However, the commonly used reference sequence has profound limitations, because across much of its span, it represents the sequence of just one human haplotype. This single, monoploid reference structure presents a critical barrier to representing the broad genomic diversity in the human population. In this review, we discuss the modernization of the reference human genome sequence to a more complete reference of human genomic diversity, known as a human pangenome.
Collapse
Affiliation(s)
- Karen H Miga
- UC Santa Cruz Genomics Institute and Department of Biomedical Engineering, University of California, Santa Cruz, California 95064, USA;
| | - Ting Wang
- Department of Genetics, Edison Family Center for Genome Sciences and Systems Biology, and McDonnell Genome Institute, Washington University School of Medicine, St. Louis, Missouri 63110, USA;
| |
Collapse
|
44
|
Li Q, Tian S, Yan B, Liu CM, Lam TW, Li R, Luo R. Building a Chinese pan-genome of 486 individuals. Commun Biol 2021; 4:1016. [PMID: 34462542 PMCID: PMC8405635 DOI: 10.1038/s42003-021-02556-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2020] [Accepted: 08/13/2021] [Indexed: 02/07/2023] Open
Abstract
Pan-genome sequence analysis of human population ancestry is critical for expanding and better defining human genome sequence diversity. However, the amount of genetic variation still missing from current human reference sequences is still unknown. Here, we used 486 deep-sequenced Han Chinese genomes to identify 276 Mbp of DNA sequences that, to our knowledge, are absent in the current human reference. We classified these sequences into individual-specific and common sequences, and propose that the common sequence size is uncapped with a growing population. The 46.646 Mbp common sequences obtained from the 486 individuals improved the accuracy of variant calling and mapping rate when added to the reference genome. We also analyzed the genomic positions of these common sequences and found that they came from genomic regions characterized by high mutation rate and low pathogenicity. Our study authenticates the Chinese pan-genome as representative of DNA sequences specific to the Han Chinese population missing from the GRCh38 reference genome and establishes the newly defined common sequences as candidates to supplement the current human reference.
Collapse
Affiliation(s)
- Qiuhui Li
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Shilin Tian
- Novogene Bioinformatics Institute, Beijing, China
| | - Bin Yan
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Chi Man Liu
- Department of Computer Science, The University of Hong Kong, Hong Kong, China
| | - Tak-Wah Lam
- Department of Computer Science, The University of Hong Kong, Hong Kong, China.
| | - Ruiqiang Li
- Novogene Bioinformatics Institute, Beijing, China.
| | - Ruibang Luo
- Department of Computer Science, The University of Hong Kong, Hong Kong, China.
| |
Collapse
|
45
|
Abstract
Pangenomes are organized collections of the genomic information from related individuals or groups. Graphical pangenomics is the study of these pangenomes using graphical methods to identify and analyze genes, regions, and mutations of interest to an array of biological questions. This field has seen significant progress in recent years including the development of graph based models that better resolve biological phenomena, and an explosion of new tools for mapping reads, creating graphical genomes, and performing pangenome analysis. In this review, we discuss recent developments in models, algorithms associated with graphical genomes, and comparisons between similar tools. In addition we briefly discuss what these developments may mean for the future of genomics.
Collapse
|
46
|
Pandey P, Gao Y, Kingsford C. VariantStore: an index for large-scale genomic variant search. Genome Biol 2021; 22:231. [PMID: 34412679 PMCID: PMC8375130 DOI: 10.1186/s13059-021-02442-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Accepted: 07/27/2021] [Indexed: 11/18/2022] Open
Abstract
Efficiently scaling genomic variant search indexes to thousands of samples is computationally challenging due to the presence of multiple coordinate systems to avoid reference biases. We present VariantStore, a system that indexes genomic variants from multiple samples using a variation graph and enables variant queries across any sample-specific coordinate system. We show the scalability of VariantStore by indexing genomic variants from the TCGA project in 4 h and the 1000 Genomes project in 3 h. Querying for variants in a gene takes between 0.002 and 3 seconds using memory only 10% of the size of the full representation.
Collapse
Affiliation(s)
- Prashant Pandey
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
| | - Yinjie Gao
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
| | - Carl Kingsford
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA.
| |
Collapse
|
47
|
Abstract
MOTIVATION Variation graph representations are projected to either replace or supplement conventional single genome references due to their ability to capture population genetic diversity and reduce reference bias. Vast catalogues of genetic variants for many species now exist, and it is natural to ask which among these are crucial to circumvent reference bias during read mapping. RESULTS In this work, we propose a novel mathematical framework for variant selection, by casting it in terms of minimizing variation graph size subject to preserving paths of length α with at most δ differences. This framework leads to a rich set of problems based on the types of variants [e.g. single nucleotide polymorphisms (SNPs), indels or structural variants (SVs)], and whether the goal is to minimize the number of positions at which variants are listed or to minimize the total number of variants listed. We classify the computational complexity of these problems and provide efficient algorithms along with their software implementation when feasible. We empirically evaluate the magnitude of graph reduction achieved in human chromosome variation graphs using multiple α and δ parameter values corresponding to short and long-read resequencing characteristics. When our algorithm is run with parameter settings amenable to long-read mapping (α = 10 kbp, δ = 1000), 99.99% SNPs and 73% SVs can be safely excluded from human chromosome 1 variation graph. The graph size reduction can benefit downstream pan-genome analysis. AVAILABILITY AND IMPLEMENTATION : https://github.com/AT-CG/VF. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, KA 560012, India
| | - Neda Tavakoli
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Srinivas Aluru
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| |
Collapse
|
48
|
Abstract
Motivation The size of a genome graph—the space required to store the nodes, node labels and edges—affects the efficiency of operations performed on it. For example, the time complexity to align a sequence to a graph without a graph index depends on the total number of characters in the node labels and the number of edges in the graph. This raises the need for approaches to construct space-efficient genome graphs. Results We point out similarities in the string encoding mechanisms of genome graphs and the external pointer macro (EPM) compression model. We present a pair of linear-time algorithms that transform between genome graphs and EPM-compressed forms. The algorithms result in an upper bound on the size of the genome graph constructed in terms of an optimal EPM compression. To further reduce the size of the genome graph, we propose the source assignment problem that optimizes over the equivalent choices during compression and introduce an ILP formulation that solves that problem optimally. As a proof-of-concept, we introduce RLZ-Graph, a genome graph constructed based on the relative Lempel–Ziv algorithm. Using RLZ-Graph, across all human chromosomes, we are able to reduce the disk space to store a genome graph on average by 40.7% compared to colored compacted de Bruijn graphs constructed by Bifrost under the default settings. The RLZ-Graph scales well in terms of running time and graph sizes with an increasing number of human genome sequences compared to Bifrost and variation graphs produced by VGtoolkit. Availability The RLZ-Graph software is available at: https://github.com/Kingsford-Group/rlzgraph. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yutong Qiu
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Carl Kingsford
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
49
|
Jayakodi M, Schreiber M, Stein N, Mascher M. Building pan-genome infrastructures for crop plants and their use in association genetics. DNA Res 2021; 28:6117190. [PMID: 33484244 PMCID: PMC7934568 DOI: 10.1093/dnares/dsaa030] [Citation(s) in RCA: 56] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Indexed: 12/20/2022] Open
Abstract
Pan-genomic studies aim at representing the entire sequence diversity within a species to provide useful resources for evolutionary studies, functional genomics and breeding of cultivated plants. Cost reductions in high-throughput sequencing and advances in sequence assembly algorithms have made it possible to create multiple reference genomes along with a catalogue of all forms of genetic variations in plant species with large and complex or polyploid genomes. In this review, we summarize the current approaches to building pan-genomes as an in silico representation of plant sequence diversity and outline relevant methods for their effective utilization in linking structural with phenotypic variation. We propose as future research avenues (i) transcriptomic and epigenomic studies across multiple reference genomes and (ii) the development of user-friendly and feature-rich pan-genome browsers.
Collapse
Affiliation(s)
- Murukarthick Jayakodi
- Department of Genebank, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Mona Schreiber
- Department of Genebank, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany
| | - Nils Stein
- Department of Genebank, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany.,Center for Integrated Breeding Research (CiBreed), Georg-August-University Göttingen, Göttingen, Germany
| | - Martin Mascher
- Department of Genebank, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany.,German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Saxony, Germany
| |
Collapse
|
50
|
Cao H, Xu H, Ning C, Xiang L, Ren Q, Zhang T, Zhang Y, Gao R. Multi-Omics Approach Reveals the Potential Core Vaccine Targets for the Emerging Foodborne Pathogen Campylobacter jejuni. Front Microbiol 2021; 12:665858. [PMID: 34248875 PMCID: PMC8265506 DOI: 10.3389/fmicb.2021.665858] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2021] [Accepted: 04/06/2021] [Indexed: 11/30/2022] Open
Abstract
Campylobacter jejuni is a leading cause of bacterial gastroenteritis in humans around the world. The emergence of bacterial resistance is becoming more serious; therefore, development of new vaccines is considered to be an alternative strategy against drug-resistant pathogen. In this study, we investigated the pangenome of 173 C. jejuni strains and analyzed the phylogenesis and the virulence factor genes. In order to acquire a high-quality pangenome, genomic relatedness was firstly performed with average nucleotide identity (ANI) analyses, and an open pangenome of 8,041 gene families was obtained with the correct taxonomy genomes. Subsequently, the virulence property of the core genome was analyzed and 145 core virulence factor (VF) genes were obtained. Upon functional genomics and immunological analyses, five core VF proteins with high antigenicity were selected as potential core vaccine targets for humans. Furthermore, functional annotations indicated that these proteins are involved in important molecular functions and biological processes, such as adhesion, regulation, and secretion. In addition, transcriptome analysis in human cells and pig intestinal loop proved that these vaccine target genes are important in the virulence of C. jejuni in different hosts. Comprehensive pangenome and relevant animal experiments will facilitate discovering the potential core vaccine targets with improved efficiency in reverse vaccinology. Likewise, this study provided some insights into the genetic polymorphism and phylogeny of C. jejuni and discovered potential vaccine candidates for humans. Prospective development of new vaccines using the targets will be an alternative to the use of antibiotics and prevent the development of multidrug-resistant C. jejuni in humans and even other animals.
Collapse
Affiliation(s)
- Hengchun Cao
- School of Mathematics and Statistics, Shandong University, Weihai, China
| | - Hanxiao Xu
- School of Mathematics and Statistics, Shandong University, Weihai, China
| | - Chunhui Ning
- School of Mathematics and Statistics, Shandong University, Weihai, China
| | - Li Xiang
- School of Mathematics and Statistics, Shandong University, Weihai, China
| | - Qiufang Ren
- School of Mathematics and Statistics, Shandong University, Weihai, China
| | - Tiantian Zhang
- School of Mathematics and Statistics, Shandong University, Weihai, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University, Weihai, China
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan, China
| |
Collapse
|