1
|
Kortsinoglou AM, Wood MJ, Myridakis AI, Andrikopoulos M, Roussis A, Eastwood D, Butt T, Kouvelis VN. Comparative genomics of Metarhizium brunneum strains V275 and ARSEF 4556: unraveling intraspecies diversity. G3 (BETHESDA, MD.) 2024; 14:jkae190. [PMID: 39210673 PMCID: PMC11457142 DOI: 10.1093/g3journal/jkae190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Accepted: 07/31/2024] [Indexed: 09/04/2024]
Abstract
Entomopathogenic fungi belonging to the Order Hypocreales are renowned for their ability to infect and kill insect hosts, while their endophytic mode of life and the beneficial rhizosphere effects on plant hosts have only been recently recognized. Understanding the molecular mechanisms underlying their different lifestyles could optimize their potential as both biocontrol and biofertilizer agents, as well as the wider appreciation of niche plasticity in fungal ecology. This study describes the comprehensive whole genome sequencing and analysis of one of the most effective entomopathogenic and endophytic EPF strains, Metarhizium brunneum V275 (commercially known as Lalguard Met52), achieved through Nanopore and Illumina reads. Comparative genomics for exploring intraspecies variability and analyses of key gene sets were conducted with a second effective EPF strain, M. brunneum ARSEF 4556. The search for strain- or species-specific genes was extended to M. brunneum strain ARSEF 3297 and other species of genus Metarhizium, to identify molecular mechanisms and putative key genome adaptations associated with mode of life differences. Genome size differed significantly, with M. brunneum V275 having the largest genome amongst M. brunneum strains sequenced to date. Genome analyses revealed an abundance of plant-degrading enzymes, plant colonization-associated genes, and intriguing intraspecies variations regarding their predicted secondary metabolic compounds and the number and localization of Transposable Elements. The potential significance of the differences found between closely related endophytic and entomopathogenic fungi, regarding plant growth-promoting and entomopathogenic abilities, are discussed, enhancing our understanding of their diverse functionalities and putative applications in agriculture and ecology.
Collapse
Affiliation(s)
- Alexandra M Kortsinoglou
- Section of Genetics and Biotechnology, Department of Biology, National and Kapodistrian University of Athens, 15771 Athens, Greece
| | - Martyn J Wood
- Department of Biosciences, Faculty of Science and Engineering, Swansea University, Singleton Park, SA2 8PP, Swansea, UK
| | - Antonis I Myridakis
- Section of Genetics and Biotechnology, Department of Biology, National and Kapodistrian University of Athens, 15771 Athens, Greece
| | - Marios Andrikopoulos
- Section of Genetics and Biotechnology, Department of Biology, National and Kapodistrian University of Athens, 15771 Athens, Greece
| | - Andreas Roussis
- Section of Botany, Department of Biology, National and Kapodistrian University of Athens, 15784 Athens, Greece
| | - Dan Eastwood
- Department of Biosciences, Faculty of Science and Engineering, Swansea University, Singleton Park, SA2 8PP, Swansea, UK
| | - Tariq Butt
- Department of Biosciences, Faculty of Science and Engineering, Swansea University, Singleton Park, SA2 8PP, Swansea, UK
| | - Vassili N Kouvelis
- Section of Genetics and Biotechnology, Department of Biology, National and Kapodistrian University of Athens, 15771 Athens, Greece
| |
Collapse
|
2
|
Cenzato D, Lipták Z. A survey of BWT variants for string collections. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae333. [PMID: 38788221 DOI: 10.1093/bioinformatics/btae333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/04/2024] [Revised: 04/13/2024] [Accepted: 05/23/2024] [Indexed: 05/26/2024]
Abstract
MOTIVATION In recent years, the focus of bioinformatics research has moved from individual sequences to collections of sequences. Given the fundamental role of the Burrows-Wheeler Transform (BWT) in string processing, a number of dedicated tools have been developed for computing the BWT of string collections. While the focus has been on improving efficiency, both in space and time, the exact definition of the BWT employed has not been at the center of attention. As we show in this paper, the different tools in use often compute non-equivalent BWT variants: the resulting transforms can differ from each other significantly, including the number r of runs, a central parameter of the BWT. Moreover, with many tools, the transform depends on the input order of the collection. In other words, on the same dataset, the same tool may output different transforms if the dataset is given in a different order. RESULTS We studied 18 dedicated tools for computing the BWT of string collections and were able to identify 6 different BWT variants computed by these tools. We review the differences between these BWT variants, both from a theoretical and from a practical point of view, comparing them on 8 real-life biological datasets with different characteristics. We find that the differences can be extensive, depending on the datasets, and are largest on collections of many similar short sequences. The parameter r, the number of runs of the BWT, also shows notable variation between the different BWT variants; on our datasets, it varied by a multiplicative factor of up to 4.2. AVAILABILITY Source code and scripts to replicate the results and download the data used in the article are available at https://github.com/davidecenzato/BWT-variants-for-string-collections. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Davide Cenzato
- Department of Environmental Sciences, Informatics and Statistics, Ca' Foscari University, Venice, Italy
| | - Zsuzsanna Lipták
- Department of Computer Science, University of Verona, Verona, Italy
| |
Collapse
|
3
|
Valderrama E, Landis JB, Skinner D, Maas PJM, Maas-van de Kramer H, André T, Grunder N, Sass C, Pinilla-Vargas M, Guan CJ, Phillips HR, de Almeida AMR, Specht CD. The genetic mechanisms underlying the convergent evolution of pollination syndromes in the Neotropical radiation of Costus L. FRONTIERS IN PLANT SCIENCE 2022; 13:874322. [PMID: 36161003 PMCID: PMC9493542 DOI: 10.3389/fpls.2022.874322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/12/2022] [Accepted: 06/27/2022] [Indexed: 06/16/2023]
Abstract
Selection together with variation in floral traits can act to mold floral form, often driven by a plant's predominant or most effective pollinators. To investigate the evolution of traits associated with pollination, we developed a phylogenetic framework for evaluating tempo and mode of pollination shifts across the genus Costus L., known for its evolutionary toggle between traits related to bee and bird pollination. Using a target enrichment approach, we obtained 957 loci for 171 accessions to expand the phylogenetic sampling of Neotropical Costus. In addition, we performed whole genome resequencing for a subset of 20 closely related species with contrasting pollination syndromes. For each of these 20 genomes, a high-quality assembled transcriptome was used as reference for consensus calling of candidate loci hypothesized to be associated with pollination-related traits of interest. To test for the role these candidate genes may play in evolutionary shifts in pollinators, signatures of selection were estimated as dN/dS across the identified candidate loci. We obtained a well-resolved phylogeny for Neotropical Costus despite conflict among gene trees that provide evidence of incomplete lineage sorting and/or reticulation. The overall topology and the network of genome-wide single nucleotide polymorphisms (SNPs) indicate that multiple shifts in pollination strategy have occurred across Costus, while also suggesting the presence of previously undetected signatures of hybridization between distantly related taxa. Traits related to pollination syndromes are strongly correlated and have been gained and lost in concert several times throughout the evolution of the genus. The presence of bract appendages is correlated with two traits associated with defenses against herbivory. Although labellum shape is strongly correlated with overall pollination syndrome, we found no significant impact of labellum shape on diversification rates. Evidence suggests an interplay of pollination success with other selective pressures shaping the evolution of the Costus inflorescence. Although most of the loci used for phylogenetic inference appear to be under purifying selection, many candidate genes associated with functional traits show evidence of being under positive selection. Together these results indicate an interplay of phylogenetic history with adaptive evolution leading to the diversification of pollination-associated traits in Neotropical Costus.
Collapse
Affiliation(s)
- Eugenio Valderrama
- School of Integrative Plant Science, Section of Plant Biology and the L.H. Bailey Hortorium, Cornell University, Ithaca, NY, United States
| | - Jacob B. Landis
- School of Integrative Plant Science, Section of Plant Biology and the L.H. Bailey Hortorium, Cornell University, Ithaca, NY, United States
- BTI Computational Biology Center, Boyce Thompson Institute, Ithaca, NY, United States
| | - Dave Skinner
- Le Jardin Ombragé, Tallahassee, FL, United States
| | - Paul J. M. Maas
- Section Botany, Naturalis Biodiversity Center, Leiden, Netherlands
| | | | - Thiago André
- Departamento de Botânica, Instituto de Ciências Biológicas, Universidade de Brasília, Brasília, DF, Brazil
| | - Nikolaus Grunder
- Department of Biological Sciences, California State University, East Bay, Hayward, CA, United States
| | - Chodon Sass
- University and Jepson Herbaria, University of California, Berkeley, Berkeley, CA, United States
| | - Maria Pinilla-Vargas
- School of Integrative Plant Science, Section of Plant Biology and the L.H. Bailey Hortorium, Cornell University, Ithaca, NY, United States
| | - Clarice J. Guan
- School of Integrative Plant Science, Section of Plant Biology and the L.H. Bailey Hortorium, Cornell University, Ithaca, NY, United States
| | - Heather R. Phillips
- School of Integrative Plant Science, Section of Plant Biology and the L.H. Bailey Hortorium, Cornell University, Ithaca, NY, United States
| | | | - Chelsea D. Specht
- School of Integrative Plant Science, Section of Plant Biology and the L.H. Bailey Hortorium, Cornell University, Ithaca, NY, United States
| |
Collapse
|
4
|
Ju CJT, Jiang JY, Li R, Li Z, Wang W. TahcoRoll: fast genomic signature profiling via thinned automaton and rolling hash. MEDICAL REVIEW (2021) 2021; 1:114-125. [PMID: 35881666 PMCID: PMC9027990 DOI: 10.1515/mr-2021-0016] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/05/2021] [Accepted: 11/11/2021] [Indexed: 12/04/2022]
Abstract
Objectives Genomic signatures like k-mers have become one of the most prominent approaches to describe genomic data. As a result, myriad real-world applications, such as the construction of de Bruijn graphs in genome assembly, have been benefited by recognizing genomic signatures. In other words, an efficient approach of genomic signature profiling is an essential need for tackling high-throughput sequencing reads. However, most of the existing approaches only recognize fixed-size k-mers while many research studies have shown the importance of considering variable-length k-mers. Methods In this paper, we present a novel genomic signature profiling approach, TahcoRoll, by extending the Aho-Corasick algorithm (AC) for the task of profiling variable-length k-mers. We first group nucleotides into two clusters and represent each cluster with a bit. The rolling hash technique is further utilized to encode signatures and read patterns for efficient matching. Results In extensive experiments, TahcoRoll significantly outperforms the most state-of-the-art k-mer counters and has the capability of processing reads across different sequencing platforms on a budget desktop computer. Conclusions The single-thread version of TahcoRoll is as efficient as the eight-thread version of the state-of-the-art, JellyFish, while the eight-thread TahcoRoll outperforms the eight-thread JellyFish by at least four times.
Collapse
Affiliation(s)
- Chelsea J.-T. Ju
- Department of Computer Science, University of California, Los Angeles, USA
| | - Jyun-Yu Jiang
- Department of Computer Science, University of California, Los Angeles, USA
| | - Ruirui Li
- Department of Computer Science, University of California, Los Angeles, USA
| | - Zeyu Li
- Department of Computer Science, University of California, Los Angeles, USA
| | - Wei Wang
- Department of Computer Science, University of California, Los Angeles, USA
| |
Collapse
|
5
|
Sun KY, Oreper D, Schoenrock SA, McMullan R, Giusti-Rodríguez P, Zhabotynsky V, Miller DR, Tarantino LM, Pardo-Manuel de Villena F, Valdar W. Bayesian modeling of skewed X inactivation in genetically diverse mice identifies a novel Xce allele associated with copy number changes. Genetics 2021; 218:6162162. [PMID: 33693696 DOI: 10.1093/genetics/iyab034] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2020] [Accepted: 02/15/2021] [Indexed: 11/13/2022] Open
Abstract
Female mammals are functional mosaics of their parental X-linked gene expression due to X chromosome inactivation (XCI). This process inactivates one copy of the X chromosome in each cell during embryogenesis and that state is maintained clonally through mitosis. In mice, the choice of which parental X chromosome remains active is determined by the X chromosome controlling element (Xce), which has been mapped to a 176-kb candidate interval. A series of functional Xce alleles has been characterized or inferred for classical inbred strains based on biased, or skewed, inactivation of the parental X chromosomes in crosses between strains. To further explore the function structure basis and location of the Xce, we measured allele-specific expression of X-linked genes in a large population of F1 females generated from Collaborative Cross (CC) strains. Using published sequence data and applying a Bayesian "Pólya urn" model of XCI skew, we report two major findings. First, inter-individual variability in XCI suggests mouse epiblasts contain on average 20-30 cells contributing to brain. Second, CC founder strain NOD/ShiLtJ has a novel and unique functional allele, Xceg, that is the weakest in the Xce allelic series. Despite phylogenetic analysis confirming that NOD/ShiLtJ carries a haplotype almost identical to the well-characterized C57BL/6J (Xceb), we observed unexpected patterns of XCI skewing in females carrying the NOD/ShiLtJ haplotype within the Xce. Copy number variation is common at the Xce locus and we conclude that the observed allelic series is a product of independent and recurring duplications shared between weak Xce alleles.
Collapse
Affiliation(s)
- Kathie Y Sun
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.,Bioinformatics and Computational Biology Curriculum, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Daniel Oreper
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.,Bioinformatics and Computational Biology Curriculum, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Sarah A Schoenrock
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.,Neuroscience Curriculum, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Rachel McMullan
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.,Genetics and Molecular Biology Curriculum, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Paola Giusti-Rodríguez
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Vasyl Zhabotynsky
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Darla R Miller
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.,Lineberger Comprehensive Cancer Center, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Lisa M Tarantino
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.,Division of Pharmacotherapy and Experimental Therapeutics, Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Fernando Pardo-Manuel de Villena
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.,Lineberger Comprehensive Cancer Center, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - William Valdar
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.,Lineberger Comprehensive Cancer Center, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
6
|
Qian K, Xu JX, Deng Y, Peng H, Peng J, Ou CM, Liu Z, Jiang LH, Tai YH. Signaling pathways of genetic variants and miRNAs in the pathogenesis of myasthenia gravis. Gland Surg 2020; 9:1933-1944. [PMID: 33447544 PMCID: PMC7804555 DOI: 10.21037/gs-20-39] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Accepted: 09/30/2020] [Indexed: 01/06/2023]
Abstract
BACKGROUND Myasthenia gravis (MG) is a chronic autoimmune neuromuscular disorder causing muscle weakness and characterized by a defect in synaptic transmission at the neuromuscular junction. The pathogenesis of this disease remains unclear. We aimed to predict the key signaling pathways of genetic variants and miRNAs in the pathogenesis of MG, and identify the key genes among them. METHODS We searched published information regarding associated single nucleotide polymorphisms (SNPs) and differentially-expressed miRNAs in MG cases. We search of SNPs and miRNAs in literature databases about MG, then we used bioinformatic tools to predict target genes of miRNAs. Moreover, functional enrichment analysis for key genes was carried out utilizing the Cytoscape-plugin, known as ClueGO. These key genes were mapped to STRING database to construct a protein-protein interaction (PPI) network. Then a miRNA-target gene regulatory network was established to screen key genes. RESULTS Five genes containing SNPs associated with MG risk were involved in the inflammatory bowel disease (IBD) signaling pathway, and FoxP3 was the key gene. MAPK1, SMAD4, SMAD2 and BCL2 were predicted to be targeted by the 18 miRNAs and to act as the key genes in adherens, junctions, apoptosis, or cancer-related pathways respectively. These five key genes containing SNPs or targeted by miRNAs were found to be involved in negative regulation of T cell differentiation. CONCLUSIONS We speculate that SNPs cause the genes to be defective or the miRNAs to downregulate the factors that subsequently negatively regulate regulatory T cells and trigger the onset of MG.
Collapse
Affiliation(s)
- Kai Qian
- Faculty of Life and Biotechnology, Kunming University of Science and Technology, Kunming, China
- Department of Thoracic Surgery, Institute of The First People’s Hospital of Yunnan Province, Kunming, China
| | - Jia-Xin Xu
- Department of Cardiovascular surgery, Yan’ an Affiliated Hospital of Kunming Medical University, Kunming, China
| | - Yi Deng
- Department of Oncology, Institute of Surgery Research, Daping Hospital, Army Medical University, Chongqing, China
| | - Hao Peng
- Department of Thoracic Surgery, Institute of The First People’s Hospital of Yunnan Province, Kunming, China
| | - Jun Peng
- Department of Thoracic Surgery, Institute of The First People’s Hospital of Yunnan Province, Kunming, China
| | - Chun-Mei Ou
- Department of Cardiovascular surgery, Institute of the First People’s Hospital of Yunnan Province, Kunming, China
| | - Zu Liu
- Department of Cardiovascular surgery, Yan’ an Affiliated Hospital of Kunming Medical University, Kunming, China
| | - Li-Hong Jiang
- Department of Thoracic Surgery, Institute of The First People’s Hospital of Yunnan Province, Kunming, China
| | - Yong-Hang Tai
- School of Electronic Information in the Yunnan Normal University, Kunming, China
| |
Collapse
|
7
|
Rautiainen M, Marschall T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol 2020; 21:253. [PMID: 32972461 PMCID: PMC7513500 DOI: 10.1186/s13059-020-02157-2] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Accepted: 08/26/2020] [Indexed: 02/07/2023] Open
Abstract
Genome graphs can represent genetic variation and sequence uncertainty. Aligning sequences to genome graphs is key to many applications, including error correction, genome assembly, and genotyping of variants in a pangenome graph. Yet, so far, this step is often prohibitively slow. We present GraphAligner, a tool for aligning long reads to genome graphs. Compared to the state-of-the-art tools, GraphAligner is 13x faster and uses 3x less memory. When employing GraphAligner for error correction, we find it to be more than twice as accurate and over 12x faster than extant tools.Availability: Package manager: https://anaconda.org/bioconda/graphaligner and source code: https://github.com/maickrau/GraphAligner.
Collapse
Affiliation(s)
- Mikko Rautiainen
- Center for Bioinformatics, Saarland University, Saarland Informatics Campus E2.1, Saarbrücken, 66123, Germany.
- Max Planck Institute for Informatics, Saarland Informatics Campus E1.4, Saarbrücken, 66123, Germany.
- Saarbrücken Graduate School for Computer Science, Saarland Informatics Campus E1.3, Saarbrücken, 66123, Germany.
| | - Tobias Marschall
- Heinrich Heine University Düsseldorf, Medical Faculty, Institute for Medical Biometry and Bioinformatics, Moorenstraße 5, Düsseldorf, 40225, Germany.
| |
Collapse
|
8
|
Landis JB, Kurti A, Lawhorn AJ, Litt A, McCarthy EW. Differential Gene Expression with an Emphasis on Floral Organ Size Differences in Natural and Synthetic Polyploids of Nicotiana tabacum (Solanaceae). Genes (Basel) 2020; 11:E1097. [PMID: 32961813 PMCID: PMC7563459 DOI: 10.3390/genes11091097] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 09/14/2020] [Accepted: 09/16/2020] [Indexed: 11/16/2022] Open
Abstract
Floral organ size, especially the size of the corolla, plays an important role in plant reproduction by facilitating pollination efficiency. Previous studies have outlined a hypothesized organ size pathway. However, the expression and function of many of the genes in the pathway have only been investigated in model diploid species; therefore, it is unknown how these genes interact in polyploid species. Although correlations between ploidy and cell size have been shown in many systems, it is unclear whether there is a difference in cell size between naturally occurring and synthetic polyploids. To address these questions comparing floral organ size and cell size across ploidy, we use natural and synthetic polyploids of Nicotiana tabacum (Solanaceae) as well as their known diploid progenitors. We employ a comparative transcriptomics approach to perform analyses of differential gene expression, focusing on candidate genes that may be involved in floral organ size, both across developmental stages and across accessions. We see differential expression of several known floral organ candidate genes including ARF2, BIG BROTHER, and GASA/GAST1. Results from linear models show that ploidy, cell width, and cell number positively influence corolla tube circumference; however, the effect of cell width varies by ploidy, and diploids have a significantly steeper slope than both natural and synthetic polyploids. These results demonstrate that polyploids have wider cells and that polyploidy significantly increases corolla tube circumference.
Collapse
Affiliation(s)
- Jacob B. Landis
- Department of Botany and Plant Sciences, University of California Riverside, Riverside, CA 92521, USA; (A.K.); (A.J.L.); (A.L.)
- School of Integrative Plant Science, Section of Plant Biology and the L.H. Bailey Hortorium, Cornell University, Ithaca, NY 14853, USA
| | - Amelda Kurti
- Department of Botany and Plant Sciences, University of California Riverside, Riverside, CA 92521, USA; (A.K.); (A.J.L.); (A.L.)
| | - Amber J. Lawhorn
- Department of Botany and Plant Sciences, University of California Riverside, Riverside, CA 92521, USA; (A.K.); (A.J.L.); (A.L.)
| | - Amy Litt
- Department of Botany and Plant Sciences, University of California Riverside, Riverside, CA 92521, USA; (A.K.); (A.J.L.); (A.L.)
| | - Elizabeth W. McCarthy
- Department of Botany and Plant Sciences, University of California Riverside, Riverside, CA 92521, USA; (A.K.); (A.J.L.); (A.L.)
- Department of Biology, SUNY Cortland, Cortland, NY 13045, USA
| |
Collapse
|
9
|
Holley G, Melsted P. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol 2020; 21:249. [PMID: 32943081 PMCID: PMC7499882 DOI: 10.1186/s13059-020-02135-8] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 08/06/2020] [Indexed: 02/07/2023] Open
Abstract
Memory consumption of de Bruijn graphs is often prohibitive. Most de Bruijn graph-based assemblers reduce the complexity by compacting paths into single vertices, but this is challenging as it requires the uncompacted de Bruijn graph to be available in memory. We present a parallel and memory-efficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted graph. Bifrost features a broad range of functions, such as indexing, editing, and querying the graph, and includes a graph coloring method that maps each k-mer of the graph to the genomes it occurs in.Availability https://github.com/pmelsted/bifrost.
Collapse
Affiliation(s)
- Guillaume Holley
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland.
| | - Páll Melsted
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland
| |
Collapse
|
10
|
Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era. QUANTITATIVE BIOLOGY 2019. [DOI: 10.1007/s40484-019-0181-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
11
|
Abstract
MOTIVATION There exist several large genomic and metagenomic data collection efforts, including GenomeTrakr and MetaSub, which are routinely updated with new data. To analyze such datasets, memory-efficient methods to construct and store the colored de Bruijn graph were developed. Yet, a problem that has not been considered is constructing the colored de Bruijn graph in a scalable manner that allows new data to be added without reconstruction. This problem is important for large public datasets as scalability is needed but also the ability to update the construction is also needed. RESULTS We create a method for constructing the colored de Bruijn graph for large datasets that is based on partitioning the data into smaller datasets, building the colored de Bruijn graph using a FM-index based representation, and succinctly merging these representations to build a single graph. The last step, merging succinctly, is the algorithmic challenge which we solve in this article. We refer to the resulting method as VariMerge. This construction method also allows the graph to be updated with new data. We validate our approach and show it produces a three-fold reduction in working space when constructing a colored de Bruijn graph for 8000 strains. Lastly, we compare VariMerge to other competing methods-including Vari, Rainbowfish, Mantis, Bloom Filter Trie, the method of Almodaresi et al. and Multi-BRWT-and illustrate that VariMerge is the only method that is capable of building the colored de Bruijn graph for 16 000 strains in a manner that allows it to be updated. Competing methods either did not scale to this large of a dataset or do not allow for additions without reconstruction. AVAILABILITY AND IMPLEMENTATION VariMerge is available at https://github.com/cosmo-team/cosmo/tree/VARI-merge under GPLv3 license. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martin D Muggli
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA
| | - Bahar Alipanahi
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| |
Collapse
|
12
|
Bonizzoni P, Della Vedova G, Pirola Y, Previtali M, Rizzi R. Multithread Multistring Burrows-Wheeler Transform and Longest Common Prefix Array. J Comput Biol 2019; 26:948-961. [PMID: 31140836 DOI: 10.1089/cmb.2018.0230] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Indexing huge collections of strings, such as those produced by the widespread sequencing technologies, heavily relies on multistring generalizations of the Burrows-Wheeler transform (BWT) and the longest common prefix (LCP) array, since solving efficiently both problems are essential ingredients of several algorithms on a collection of strings, such as those for genome assembly. In this article, we explore a multithread computational strategy for building the BWT and LCP array. Our algorithm applies a divide and conquer approach that leads to parallel computation of multistring BWT and LCP array.
Collapse
Affiliation(s)
- Paola Bonizzoni
- Dipartimento di Informatica Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Milan, Italy
| | - Gianluca Della Vedova
- Dipartimento di Informatica Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Milan, Italy
| | - Yuri Pirola
- Dipartimento di Informatica Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Milan, Italy
| | - Marco Previtali
- Dipartimento di Informatica Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Milan, Italy
| | - Raffaella Rizzi
- Dipartimento di Informatica Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Milan, Italy
| |
Collapse
|
13
|
Shorter JR, Najarian ML, Bell TA, Blanchard M, Ferris MT, Hock P, Kashfeen A, Kirchoff KE, Linnertz CL, Sigmon JS, Miller DR, McMillan L, Pardo-Manuel de Villena F. Whole Genome Sequencing and Progress Toward Full Inbreeding of the Mouse Collaborative Cross Population. G3 (BETHESDA, MD.) 2019; 9:1303-1311. [PMID: 30858237 PMCID: PMC6505143 DOI: 10.1534/g3.119.400039] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/28/2019] [Accepted: 03/08/2019] [Indexed: 12/20/2022]
Abstract
Two key features of recombinant inbred panels are well-characterized genomes and reproducibility. Here we report on the sequenced genomes of six additional Collaborative Cross (CC) strains and on inbreeding progress of 72 CC strains. We have previously reported on the sequences of 69 CC strains that were publicly available, bringing the total of CC strains with whole genome sequence up to 75. The sequencing of these six CC strains updates the efforts toward inbreeding undertaken by the UNC Systems Genetics Core. The timing reflects our competing mandates to release to the public as many CC strains as possible while achieving an acceptable level of inbreeding. The new six strains have a higher than average founder contribution from non-domesticus strains than the previously released CC strains. Five of the six strains also have high residual heterozygosity (>14%), which may be related to non-domesticus founder contributions. Finally, we report on updated estimates on residual heterozygosity across the entire CC population using a novel, simple and cost effective genotyping platform on three mice from each strain. We observe a reduction in residual heterozygosity across all previously released CC strains. We discuss the optimal use of different genetic resources available for the CC population.
Collapse
Affiliation(s)
| | | | - Timothy A Bell
- Department of Genetics
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina 27599
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Egidi L, Louza FA, Manzini G, Telles GP. External memory BWT and LCP computation for sequence collections with applications. Algorithms Mol Biol 2019; 14:6. [PMID: 30899322 PMCID: PMC6408864 DOI: 10.1186/s13015-019-0140-0] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2018] [Accepted: 02/23/2019] [Indexed: 11/10/2022] Open
Abstract
Background Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM. Results We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs. Conclusions We prove that our algorithm performs \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\mathcal {O}}(n\, \mathsf {maxlcp})$$\end{document}O(nmaxlcp) sequential I/Os, where n is the total length of the collection and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\mathsf {maxlcp}$$\end{document}maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.
Collapse
|
15
|
A Perfect Match Genomic Landscape Provides a Unified Framework for the Precise Detection of Variation in Natural and Synthetic Haploid Genomes. Genetics 2018; 208:1631-1641. [PMID: 29367403 PMCID: PMC5887153 DOI: 10.1534/genetics.117.300589] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2017] [Accepted: 01/19/2018] [Indexed: 01/13/2023] Open
Abstract
We present a conceptually simple, sensitive, precise, and essentially nonstatistical solution for the analysis of genome variation in haploid organisms. The generation of a Perfect Match Genomic Landscape (PMGL), which computes intergenome identity with single nucleotide resolution, reveals signatures of variation wherever a query genome differs from a reference genome. Such signatures encode the precise location of different types of variants, including single nucleotide variants, deletions, insertions, and amplifications, effectively introducing the concept of a general signature of variation. The precise nature of variants is then resolved through the generation of targeted alignments between specific sets of sequence reads and known regions of the reference genome. Thus, the perfect match logic decouples the identification of the location of variants from the characterization of their nature, providing a unified framework for the detection of genome variation. We assessed the performance of the PMGL strategy via simulation experiments. We determined the variation profiles of natural genomes and of a synthetic chromosome, both in the context of haploid yeast strains. Our approach uncovered variants that have previously escaped detection. Moreover, our strategy is ideally suited for further refining high-quality reference genomes. The source codes for the automated PMGL pipeline have been deposited in a public repository.
Collapse
|
16
|
Abstract
The Collaborative Cross (CC) is a multiparent panel of recombinant inbred (RI) mouse strains derived from eight founder laboratory strains. RI panels are popular because of their long-term genetic stability, which enhances reproducibility and integration of data collected across time and conditions. Characterization of their genomes can be a community effort, reducing the burden on individual users. Here we present the genomes of the CC strains using two complementary approaches as a resource to improve power and interpretation of genetic experiments. Our study also provides a cautionary tale regarding the limitations imposed by such basic biological processes as mutation and selection. A distinct advantage of inbred panels is that genotyping only needs to be performed on the panel, not on each individual mouse. The initial CC genome data were haplotype reconstructions based on dense genotyping of the most recent common ancestors (MRCAs) of each strain followed by imputation from the genome sequence of the corresponding founder inbred strain. The MRCA resource captured segregating regions in strains that were not fully inbred, but it had limited resolution in the transition regions between founder haplotypes, and there was uncertainty about founder assignment in regions of limited diversity. Here we report the whole genome sequence of 69 CC strains generated by paired-end short reads at 30× coverage of a single male per strain. Sequencing leads to a substantial improvement in the fine structure and completeness of the genomes of the CC. Both MRCAs and sequenced samples show a significant reduction in the genome-wide haplotype frequencies from two wild-derived strains, CAST/EiJ and PWK/PhJ. In addition, analysis of the evolution of the patterns of heterozygosity indicates that selection against three wild-derived founder strains played a significant role in shaping the genomes of the CC. The sequencing resource provides the first description of tens of thousands of new genetic variants introduced by mutation and drift in the CC genomes. We estimate that new SNP mutations are accumulating in each CC strain at a rate of 2.4 ± 0.4 per gigabase per generation. The fixation of new mutations by genetic drift has introduced thousands of new variants into the CC strains. The majority of these mutations are novel compared to currently sequenced laboratory stocks and wild mice, and some are predicted to alter gene function. Approximately one-third of the CC inbred strains have acquired large deletions (>10 kb) many of which overlap known coding genes and functional elements. The sequence of these mice is a critical resource to CC users, increases threefold the number of mouse inbred strain genomes available publicly, and provides insight into the effect of mutation and drift on common resources.
Collapse
|
17
|
A Prolonged Outbreak of KPC-3-Producing Enterobacter cloacae and Klebsiella pneumoniae Driven by Multiple Mechanisms of Resistance Transmission at a Large Academic Burn Center. Antimicrob Agents Chemother 2017; 61:AAC.01516-16. [PMID: 27919898 DOI: 10.1128/aac.01516-16] [Citation(s) in RCA: 58] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2016] [Accepted: 11/25/2016] [Indexed: 12/20/2022] Open
Abstract
Klebsiella pneumoniae carbapenemase (KPC)-producing Enterobacter cloacae has been recently recognized in the United States. Whole-genome sequencing (WGS) has become a useful tool for analysis of outbreaks and for determining transmission networks of multidrug-resistant organisms in health care settings, including carbapenem-resistant Enterobacteriaceae (CRE). We experienced a prolonged outbreak of CRE E. cloacae and K. pneumoniae over a 3-year period at a large academic burn center despite rigorous infection control measures. To understand the molecular mechanisms that sustained this outbreak, we investigated the CRE outbreak isolates by using WGS. Twenty-two clinical isolates of CRE, including E. cloacae (n = 15) and K. pneumoniae (n = 7), were sequenced and analyzed genetically. WGS revealed that this outbreak, which seemed epidemiologically unlinked, was in fact genetically linked over a prolonged period. Multiple mechanisms were found to account for the ongoing outbreak of KPC-3-producing E. cloacae and K. pneumoniae This outbreak was primarily maintained by a clonal expansion of E. cloacae sequence type 114 (ST114) with distribution of multiple resistance determinants. Plasmid and transposon analyses suggested that the majority of blaKPC-3 was transmitted via an identical Tn4401b element on part of a common plasmid. WGS analysis demonstrated complex transmission dynamics within the burn center at levels of the strain and/or plasmid in association with a transposon, highlighting the versatility of KPC-producing Enterobacteriaceae in their ability to utilize multiple modes to resistance gene propagation.
Collapse
|
18
|
Whole Genome Sequence of Two Wild-Derived Mus musculus domesticus Inbred Strains, LEWES/EiJ and ZALENDE/EiJ, with Different Diploid Numbers. G3-GENES GENOMES GENETICS 2016; 6:4211-4216. [PMID: 27765810 PMCID: PMC5144988 DOI: 10.1534/g3.116.034751] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
Wild-derived mouse inbred strains are becoming increasingly popular for complex traits analysis, evolutionary studies, and systems genetics. Here, we report the whole-genome sequencing of two wild-derived mouse inbred strains, LEWES/EiJ and ZALENDE/EiJ, of Mus musculus domesticus origin. These two inbred strains were selected based on their geographic origin, karyotype, and use in ongoing research. We generated 14× and 18× coverage sequence, respectively, and discovered over 1.1 million novel variants, most of which are private to one of these strains. This report expands the number of wild-derived inbred genomes in the Mus genus from six to eight. The sequence variation can be accessed via an online query tool; variant calls (VCF format) and alignments (BAM format) are available for download from a dedicated ftp site. Finally, the sequencing data have also been stored in a lossless, compressed, and indexed format using the multi-string Burrows-Wheeler transform. All data can be used without restriction.
Collapse
|
19
|
Morgan AP, Holt JM, McMullan RC, Bell TA, Clayshulte AMF, Didion JP, Yadgary L, Thybert D, Odom DT, Flicek P, McMillan L, de Villena FPM. The Evolutionary Fates of a Large Segmental Duplication in Mouse. Genetics 2016; 204:267-85. [PMID: 27371833 PMCID: PMC5012392 DOI: 10.1534/genetics.116.191007] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2016] [Accepted: 06/27/2016] [Indexed: 01/21/2023] Open
Abstract
Gene duplication and loss are major sources of genetic polymorphism in populations, and are important forces shaping the evolution of genome content and organization. We have reconstructed the origin and history of a 127-kbp segmental duplication, R2d, in the house mouse (Mus musculus). R2d contains a single protein-coding gene, Cwc22 De novo assembly of both the ancestral (R2d1) and the derived (R2d2) copies reveals that they have been subject to nonallelic gene conversion events spanning tens of kilobases. R2d2 is also a hotspot for structural variation: its diploid copy number ranges from zero in the mouse reference genome to >80 in wild mice sampled from around the globe. Hemizygosity for high copy-number alleles of R2d2 is associated in cis with meiotic drive; suppression of meiotic crossovers; and copy-number instability, with a mutation rate in excess of 1 per 100 transmissions in some laboratory populations. Our results provide a striking example of allelic diversity generated by duplication and demonstrate the value of de novo assembly in a phylogenetic context for understanding the mutational processes affecting duplicate genes.
Collapse
Affiliation(s)
- Andrew P Morgan
- Department of Genetics and Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina 27599
| | - J Matthew Holt
- Department of Computer Science, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Rachel C McMullan
- Department of Genetics and Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Timothy A Bell
- Department of Genetics and Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Amelia M-F Clayshulte
- Department of Genetics and Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina 27599
| | - John P Didion
- Department of Genetics and Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Liran Yadgary
- Department of Genetics and Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina 27599
| | - David Thybert
- European Bioinformatics Institute, European Molecular Biology Laboratory, Wellcome Genome Campus, Cambridge, CB10 1SD, United Kingdom
| | - Duncan T Odom
- Cancer Research United Kingdom Cambridge Institute, University of Cambridge, CB2 0RE, United Kingdom Wellcome Trust Sanger Institute, Wellcome Genome Campus, Cambridge, CB10 1SA, United Kingdom
| | - Paul Flicek
- European Bioinformatics Institute, European Molecular Biology Laboratory, Wellcome Genome Campus, Cambridge, CB10 1SD, United Kingdom Wellcome Trust Sanger Institute, Wellcome Genome Campus, Cambridge, CB10 1SA, United Kingdom
| | - Leonard McMillan
- Department of Computer Science, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Fernando Pardo-Manuel de Villena
- Department of Genetics and Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, North Carolina 27599
| |
Collapse
|
20
|
Morgan AP, Welsh CE. Informatics resources for the Collaborative Cross and related mouse populations. Mamm Genome 2015; 26:521-39. [PMID: 26135136 DOI: 10.1007/s00335-015-9581-z] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2015] [Accepted: 06/23/2015] [Indexed: 02/05/2023]
Affiliation(s)
- Andrew P Morgan
- Department of Genetics, University of North Carolina, Chapel Hill, NC, USA
| | - Catherine E Welsh
- Department of Mathematics & Computer Science, Rhodes College, Memphis, TN, USA.
| |
Collapse
|