1
|
Bahk K, Sung J. SigAlign: an alignment algorithm guided by explicit similarity criteria. Nucleic Acids Res 2024; 52:8717-8733. [PMID: 39011889 PMCID: PMC11347165 DOI: 10.1093/nar/gkae607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Revised: 06/19/2024] [Accepted: 07/01/2024] [Indexed: 07/17/2024] Open
Abstract
In biological sequence alignment, prevailing heuristic aligners achieve high-throughput by several approximation techniques, but at the cost of sacrificing the clarity of output criteria and creating complex parameter spaces. To surmount these challenges, we introduce 'SigAlign', a novel alignment algorithm that employs two explicit cutoffs for the results: minimum length and maximum penalty per length, alongside three affine gap penalties. Comparative analyses of SigAlign against leading database search tools (BLASTn, MMseqs2) and read mappers (BWA-MEM, bowtie2, HISAT2, minimap2) highlight its performance in read mapping and database searches. Our research demonstrates that SigAlign not only provides high sensitivity with a non-heuristic approach, but also surpasses the throughput of existing heuristic aligners, particularly for high-accuracy reads or genomes with few repetitive regions. As an open-source library, SigAlign is poised to become a foundational component to provide a transparent and customizable alignment process to new analytical algorithms, tools and pipelines in bioinformatics.
Collapse
Affiliation(s)
- Kunhyung Bahk
- Interdisciplinary Program in Bioinformatics, College of Natural Sciences, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Korea
| | - Joohon Sung
- Interdisciplinary Program in Bioinformatics, College of Natural Sciences, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Korea
- Genome and Health Big Data Laboratory, Graduate School of Public Health, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Korea
| |
Collapse
|
2
|
Wilton R, Szalay AS. Short-read aligner performance in germline variant identification. Bioinformatics 2023; 39:btad480. [PMID: 37527006 PMCID: PMC10421969 DOI: 10.1093/bioinformatics/btad480] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 06/01/2023] [Accepted: 07/31/2023] [Indexed: 08/03/2023] Open
Abstract
MOTIVATION Read alignment is an essential first step in the characterization of DNA sequence variation. The accuracy of variant-calling results depends not only on the quality of read alignment and variant-calling software but also on the interaction between these complex software tools. RESULTS In this review, we evaluate short-read aligner performance with the goal of optimizing germline variant-calling accuracy. We examine the performance of three general-purpose short-read aligners-BWA-MEM, Bowtie 2, and Arioc-in conjunction with three germline variant callers: DeepVariant, FreeBayes, and GATK HaplotypeCaller. We discuss the behavior of the read aligners with regard to the data elements on which the variant callers rely, and illustrate how the runtime configurations of these software tools combine to affect variant-calling performance. AVAILABILITY AND IMPLEMENTATION The quick brown fox jumps over the lazy dog.
Collapse
Affiliation(s)
- Richard Wilton
- Department of Physics and Astronomy, Johns Hopkins University, Baltimore, MD 21218, United States
| | - Alexander S Szalay
- Department of Physics and Astronomy, Johns Hopkins University, Baltimore, MD 21218, United States
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, United States
| |
Collapse
|
3
|
Domínguez-Maqueda M, Pérez-Gómez O, Grande-Pérez A, Esteve C, Seoane P, Tapia-Paniagua ST, Balebona MC, Moriñigo MA. Pathogenic strains of Shewanella putrefaciens contain plasmids that are absent in the probiotic strain Pdp11. PeerJ 2022; 10:e14248. [PMID: 36312754 PMCID: PMC9610664 DOI: 10.7717/peerj.14248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Accepted: 09/26/2022] [Indexed: 01/24/2023] Open
Abstract
Shewanella putrefaciens Pdp11 is a strain described as a probiotic for use in aquaculture. However, S. putrefaciens includes strains reported to be pathogenic or saprophytic to fish. Although the probiotic trait has been related to the presence of a group of genes in its genome, the existence of plasmids that could determine the probiotic or pathogenic character of this bacterium is unknown. In the present work, we searched for plasmids in several strains of S. putrefaciens that differ in their pathogenic and probiotic character. Under the different conditions tested, plasmids were only found in two of the five pathogenic strains, but not in the probiotic strain nor in the two saprophytic strains tested. Using a workflow integrating Sanger and Illumina reads, the complete consensus sequences of the plasmids were obtained. Plasmids differed in one ORF and encoded a putative replication initiator protein of the repB family, as well as proteins related to plasmid stability and a toxin-antitoxin system. Phylogenetic analysis showed some similarity to functional repB proteins of other Shewanella species. The implication of these plasmids in the probiotic or pathogenic nature of S. putrefaciens is discussed.
Collapse
Affiliation(s)
| | | | - Ana Grande-Pérez
- Área de Genética, Universidad de Málaga, Málaga, Spain,Instituto de Hortofruticultura Subtropical y Mediterránea “La Mayora”-Universidad de Málaga-Consejo Superior de Investigaciones Científicas (IHSM-UMA-CSIC), Universidad de Málaga, Málaga, Spain
| | - Consuelo Esteve
- Departmento de Microbiología y Ecología, Universidad de Valencia, Valencia, Spain
| | - Pedro Seoane
- Centro de Investigación Biomédica en Red de Enfermedades Raras, CIBERER, Madrid, Spain,Departmento de Biología Molecular y Bioquímica, Universidad de Málaga, Málaga, Spain
| | | | | | | |
Collapse
|
4
|
Darby CA, Gaddipati R, Schatz MC, Langmead B. Vargas: heuristic-free alignment for assessing linear and graph read aligners. Bioinformatics 2020; 36:3712-3718. [PMID: 32321164 PMCID: PMC7320598 DOI: 10.1093/bioinformatics/btaa265] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Revised: 03/19/2020] [Accepted: 04/15/2020] [Indexed: 12/31/2022] Open
Abstract
Motivation Read alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score. Results Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these ‘gold standard’ Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-maximal exact match and vg to align more reads correctly. Availability and implementation Source code implemented in C++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Michael C Schatz
- Department of Computer Science.,Department of Biology, Johns Hopkins University, Baltimore, MD 21218, USA.,Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | | |
Collapse
|
5
|
Methe BA, Hiltbrand D, Roach J, Xu W, Gordon SG, Goodner BW, Stapleton AE. Functional gene categories differentiate maize leaf drought-related microbial epiphytic communities. PLoS One 2020; 15:e0237493. [PMID: 32946440 PMCID: PMC7500591 DOI: 10.1371/journal.pone.0237493] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2019] [Accepted: 07/11/2020] [Indexed: 11/18/2022] Open
Abstract
The phyllosphere epiphytic microbiome is composed of microorganisms that colonize the external aerial portions of plants. Relationships of plant responses to specific microorganisms–both pathogenic and beneficial–have been examined, but the phyllosphere microbiome functional and metabolic profile responses are not well described. Changing crop growth conditions, such as increased drought, can have profound impacts on crop productivity. Also, epiphytic microbial communities provide a new target for crop yield optimization. We compared Zea mays leaf microbiomes collected under drought and well-watered conditions by examining functional gene annotation patterns across three physically disparate locations each with and without drought treatment, through the application of short read metagenomic sequencing. Drought samples exhibited different functional sequence compositions at each of the three field sites. Maize phyllosphere functional profiles revealed a wide variety of metabolic and regulatory processes that differed in drought and normal water conditions and provide key baseline information for future selective breeding.
Collapse
Affiliation(s)
- Barbara A. Methe
- J Craig Venter Institute, Medical Center Drive, Rockville, MD, United States of America
- Department of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
| | - David Hiltbrand
- Department of Biology and Marine Biology, University of North Carolina Wilmington, Wilmington, NC, United States of America
| | - Jeffrey Roach
- Research Computing, University of North Carolina Chapel Hill, Chapel Hill, NC, United States of America
| | - Wenwei Xu
- Agricultural and Extension Center, Texas A and M AgriLife Research, Lubbock, TX, United States of America
| | - Stuart G. Gordon
- Biology Department, Presbyterian College, Clinton, SC, United States of America
| | - Brad W. Goodner
- Department, Hiram College, Hiram, OH, United States of America
| | - Ann E. Stapleton
- Department of Biology and Marine Biology, University of North Carolina Wilmington, Wilmington, NC, United States of America
- * E-mail:
| |
Collapse
|
6
|
Tello D, Gil J, Loaiza CD, Riascos JJ, Cardozo N, Duitama J. NGSEP3: accurate variant calling across species and sequencing protocols. Bioinformatics 2020; 35:4716-4723. [PMID: 31099384 PMCID: PMC6853766 DOI: 10.1093/bioinformatics/btz275] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2018] [Revised: 03/16/2019] [Accepted: 04/17/2019] [Indexed: 01/09/2023] Open
Abstract
MOTIVATION Accurate detection, genotyping and downstream analysis of genomic variants from high-throughput sequencing data are fundamental features in modern production pipelines for genetic-based diagnosis in medicine or genomic selection in plant and animal breeding. Our research group maintains the Next-Generation Sequencing Experience Platform (NGSEP) as a precise, efficient and easy-to-use software solution for these features. RESULTS Understanding that incorrect alignments around short tandem repeats are an important source of genotyping errors, we implemented in NGSEP new algorithms for realignment and haplotype clustering of reads spanning indels and short tandem repeats. We performed extensive benchmark experiments comparing NGSEP to state-of-the-art software using real data from three sequencing protocols and four species with different distributions of repetitive elements. NGSEP consistently shows comparative accuracy and better efficiency compared to the existing solutions. We expect that this work will contribute to the continuous improvement of quality in variant calling needed for modern applications in medicine and agriculture. AVAILABILITY AND IMPLEMENTATION NGSEP is available as open source software at http://ngsep.sf.net. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Daniel Tello
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá 111711, Colombia
| | - Juanita Gil
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá 111711, Colombia
| | - Cristian D Loaiza
- Biotechnology lab, Centro de Investigación de la caña de azúcar de Colombia, CENICAÑA, Cali 760046, Colombia
- Present address: Department of Plants, Soils, and Climate, Utah State University, Logan, UT, USA
| | - John J Riascos
- Biotechnology lab, Centro de Investigación de la caña de azúcar de Colombia, CENICAÑA, Cali 760046, Colombia
| | - Nicolás Cardozo
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá 111711, Colombia
| | - Jorge Duitama
- Systems and Computing Engineering Department, Universidad de los Andes, Bogotá 111711, Colombia
- Agrobiodiversity Research Area, International Center for Tropical Agriculture, Cali 763537, Colombia
- To whom correspondence should be addressed. E-mail:
| |
Collapse
|
7
|
Halpin JC, Jangi R, Street TO. Multimapping confounds ribosome profiling analysis: A case-study of the Hsp90 molecular chaperone. Proteins 2019; 88:57-68. [PMID: 31254414 DOI: 10.1002/prot.25766] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2019] [Revised: 06/17/2019] [Accepted: 06/25/2019] [Indexed: 11/11/2022]
Abstract
Ribosome profiling (Ribo-seq) can potentially provide detailed information about ribosome position on transcripts and estimates of protein translation levels in vivo. Hsp90 chaperones, which play a critical role in stress tolerance, have characteristic patterns of differential expression under nonstressed and heat shock conditions. By analyzing published Ribo-seq data for the Hsp90 chaperones in S. cerevisiae, we find wide-ranging artifacts originating from "multimapping" reads (reads that cannot be uniquely assigned to one position), which constitute ~25% of typical S. cerevisiae Ribo-seq datasets and ~80% of the reads from HEK293 cells. Estimates of Hsp90 protein production as determined by Ribo-seq are reproducible but not robust, with inferred expression levels that can change 10-fold depending on how multimapping reads are processed. The differential expression of Hsp90 chaperones under nonstressed and heat shock conditions creates artificial peaks and valleys in their ribosome profiles that give a false impression of regulated translational pausing. Indeed, we find that multimapping can even create an appearance of reproducibility to the shape of the Hsp90 ribosome profiles from biological replicates. Adding further complexity, this artificial reproducibility is dependent on the computational method used to construct the ribosome profile. Given the ubiquity of multimapping reads in Ribo-seq experiments and the complexity of artifacts associated with multimapping, we developed a publicly available computational tool to identify transcripts most at risk for multimapping artifacts. In doing so, we identify biological pathways that are enriched in multimapping transcripts, meaning that particular biological pathways will be highly susceptible to multimapping artifacts.
Collapse
Affiliation(s)
- Jackson C Halpin
- Department of Biochemistry, Brandeis University, Waltham, Massachusetts
| | - Radhika Jangi
- Department of Biochemistry, Brandeis University, Waltham, Massachusetts
| | - Timothy O Street
- Department of Biochemistry, Brandeis University, Waltham, Massachusetts
| |
Collapse
|
8
|
Renaud G, Hanghøj K, Korneliussen TS, Willerslev E, Orlando L. Joint Estimates of Heterozygosity and Runs of Homozygosity for Modern and Ancient Samples. Genetics 2019; 212:587-614. [PMID: 31088861 PMCID: PMC6614887 DOI: 10.1534/genetics.119.302057] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2019] [Accepted: 05/01/2019] [Indexed: 11/18/2022] Open
Abstract
Both the total amount and the distribution of heterozygous sites within individual genomes are informative about the genetic diversity of the population they belong to. Detecting true heterozygous sites in ancient genomes is complicated by the generally limited coverage achieved and the presence of post-mortem damage inflating sequencing errors. Additionally, large runs of homozygosity found in the genomes of particularly inbred individuals and of domestic animals can skew estimates of genome-wide heterozygosity rates. Current computational tools aimed at estimating runs of homozygosity and genome-wide heterozygosity levels are generally sensitive to such limitations. Here, we introduce ROHan, a probabilistic method which substantially improves the estimate of heterozygosity rates both genome-wide and for genomic local windows. It combines a local Bayesian model and a Hidden Markov Model at the genome-wide level and can work both on modern and ancient samples. We show that our algorithm outperforms currently available methods for predicting heterozygosity rates for ancient samples. Specifically, ROHan can delineate large runs of homozygosity (at megabase scales) and produce a reliable confidence interval for the genome-wide rate of heterozygosity outside of such regions from modern genomes with a depth of coverage as low as 5-6× and down to 7-8× for ancient samples showing moderate DNA damage. We apply ROHan to a series of modern and ancient genomes previously published and revise available estimates of heterozygosity for humans, chimpanzees and horses.
Collapse
Affiliation(s)
- Gabriel Renaud
- Lundbeck Foundation GeoGenetics Center, Globe Institute, University of Copenhagen, 1350K, Denmark
| | - Kristian Hanghøj
- Lundbeck Foundation GeoGenetics Center, Globe Institute, University of Copenhagen, 1350K, Denmark
- Laboratoire d'Anthropobiologie Moléculaire et d'Imagerie de Synthèse, CNRS UMR 5288, Université de Toulouse, Université Paul Sabatier, 31000, France
| | | | - Eske Willerslev
- Lundbeck Foundation GeoGenetics Center, Globe Institute, University of Copenhagen, 1350K, Denmark
- Department of Zoology, University of Cambridge, CB2 3EJ, UK
- The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
- The Danish Institute for Advanced Study at The University of Southern Denmark, DK-5230 Odense M, Denmark
| | - Ludovic Orlando
- Lundbeck Foundation GeoGenetics Center, Globe Institute, University of Copenhagen, 1350K, Denmark
- Laboratoire d'Anthropobiologie Moléculaire et d'Imagerie de Synthèse, CNRS UMR 5288, Université de Toulouse, Université Paul Sabatier, 31000, France
| |
Collapse
|
9
|
Pritt J, Chen NC, Langmead B. FORGe: prioritizing variants for graph genomes. Genome Biol 2018; 19:220. [PMID: 30558649 PMCID: PMC6296055 DOI: 10.1186/s13059-018-1595-x] [Citation(s) in RCA: 51] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2018] [Accepted: 11/26/2018] [Indexed: 12/30/2022] Open
Abstract
There is growing interest in using genetic variants to augment the reference genome into a graph genome, with alternative sequences, to improve read alignment accuracy and reduce allelic bias. While adding a variant has the positive effect of removing an undesirable alignment score penalty, it also increases both the ambiguity of the reference genome and the cost of storing and querying the genome index. We introduce methods and a software tool called FORGe for modeling these effects and prioritizing variants accordingly. We show that FORGe enables a range of advantageous and measurable trade-offs between accuracy and computational overhead.
Collapse
Affiliation(s)
- Jacob Pritt
- Department of Computer Science, Johns Hopkins University, Baltimore, USA.,Center for Computational Biology, Johns Hopkins University, Baltimore, USA
| | - Nae-Chyun Chen
- Department of Computer Science, Johns Hopkins University, Baltimore, USA.,Center for Computational Biology, Johns Hopkins University, Baltimore, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, USA. .,Center for Computational Biology, Johns Hopkins University, Baltimore, USA.
| |
Collapse
|