1
|
Becker D, Champredon D, Chato C, Gugan G, Poon A. SUP: a probabilistic framework to propagate genome sequence uncertainty, with applications. NAR Genom Bioinform 2023; 5:lqad038. [PMID: 37101658 PMCID: PMC10124968 DOI: 10.1093/nargab/lqad038] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 02/15/2023] [Accepted: 04/06/2023] [Indexed: 04/28/2023] Open
Abstract
Genetic sequencing is subject to many different types of errors, but most analyses treat the resultant sequences as if they are known without error. Next generation sequencing methods rely on significantly larger numbers of reads than previous sequencing methods in exchange for a loss of accuracy in each individual read. Still, the coverage of such machines is imperfect and leaves uncertainty in many of the base calls. In this work, we demonstrate that the uncertainty in sequencing techniques will affect downstream analysis and propose a straightforward method to propagate the uncertainty. Our method (which we have dubbed Sequence Uncertainty Propagation, or SUP) uses a probabilistic matrix representation of individual sequences which incorporates base quality scores as a measure of uncertainty that naturally lead to resampling and replication as a framework for uncertainty propagation. With the matrix representation, resampling possible base calls according to quality scores provides a bootstrap- or prior distribution-like first step towards genetic analysis. Analyses based on these re-sampled sequences will include a more complete evaluation of the error involved in such analyses. We demonstrate our resampling method on SARS-CoV-2 data. The resampling procedures add a linear computational cost to the analyses, but the large impact on the variance in downstream estimates makes it clear that ignoring this uncertainty may lead to overly confident conclusions. We show that SARS-CoV-2 lineage designations via Pangolin are much less certain than the bootstrap support reported by Pangolin would imply and the clock rate estimates for SARS-CoV-2 are much more variable than reported.
Collapse
Affiliation(s)
- Devan Becker
- To whom correspondence should be addressed. Tel: +1 519 884 1970 (Ext 2464);
| | | | - Connor Chato
- Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada
| | - Gopi Gugan
- Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada
| | - Art Poon
- Department of Pathology and Laboratory Medicine, Schulich School of Medicine and Dentistry, Western University, London, Ontario, Canada
| |
Collapse
|
2
|
Purushothaman S, Meola M, Egli A. Combination of Whole Genome Sequencing and Metagenomics for Microbiological Diagnostics. Int J Mol Sci 2022; 23:9834. [PMID: 36077231 PMCID: PMC9456280 DOI: 10.3390/ijms23179834] [Citation(s) in RCA: 40] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Revised: 08/24/2022] [Accepted: 08/26/2022] [Indexed: 12/21/2022] Open
Abstract
Whole genome sequencing (WGS) provides the highest resolution for genome-based species identification and can provide insight into the antimicrobial resistance and virulence potential of a single microbiological isolate during the diagnostic process. In contrast, metagenomic sequencing allows the analysis of DNA segments from multiple microorganisms within a community, either using an amplicon- or shotgun-based approach. However, WGS and shotgun metagenomic data are rarely combined, although such an approach may generate additive or synergistic information, critical for, e.g., patient management, infection control, and pathogen surveillance. To produce a combined workflow with actionable outputs, we need to understand the pre-to-post analytical process of both technologies. This will require specific databases storing interlinked sequencing and metadata, and also involves customized bioinformatic analytical pipelines. This review article will provide an overview of the critical steps and potential clinical application of combining WGS and metagenomics together for microbiological diagnosis.
Collapse
Affiliation(s)
- Srinithi Purushothaman
- Applied Microbiology Research, Department of Biomedicine, University of Basel, 4031 Basel, Switzerland
- Institute of Medical Microbiology, University of Zurich, 8006 Zurich, Switzerland
| | - Marco Meola
- Applied Microbiology Research, Department of Biomedicine, University of Basel, 4031 Basel, Switzerland
- Institute of Medical Microbiology, University of Zurich, 8006 Zurich, Switzerland
- Swiss Institute of Bioinformatics, University of Basel, 4031 Basel, Switzerland
| | - Adrian Egli
- Applied Microbiology Research, Department of Biomedicine, University of Basel, 4031 Basel, Switzerland
- Institute of Medical Microbiology, University of Zurich, 8006 Zurich, Switzerland
- Clinical Bacteriology and Mycology, University Hospital Basel, 4031 Basel, Switzerland
| |
Collapse
|
3
|
Hidden Markov modeling for maximum probability neuron reconstruction. Commun Biol 2022; 5:388. [PMID: 35468989 PMCID: PMC9038756 DOI: 10.1038/s42003-022-03320-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Accepted: 03/24/2022] [Indexed: 11/08/2022] Open
Abstract
Recent advances in brain clearing and imaging have made it possible to image entire mammalian brains at sub-micron resolution. These images offer the potential to assemble brain-wide atlases of neuron morphology, but manual neuron reconstruction remains a bottleneck. Several automatic reconstruction algorithms exist, but most focus on single neuron images. In this paper, we present a probabilistic reconstruction method, ViterBrain, which combines a hidden Markov state process that encodes neuron geometry with a random field appearance model of neuron fluorescence. ViterBrain utilizes dynamic programming to compute the global maximizer of what we call the most probable neuron path. We applied our algorithm to imperfect image segmentations, and showed that it can follow axons in the presence of noise or nearby neurons. We also provide an interactive framework where users can trace neurons by fixing start and endpoints. ViterBrain is available in our open-source Python package brainlit. ViterBrain is an automated probabilistic reconstruction method that can reconstruct neuronal geometry and processes from microscopy images with code available in the open-source Python package, brainlit.
Collapse
|
4
|
Müller R, Nebel M. On the use of sequence-quality information in OTU clustering. PeerJ 2021; 9:e11717. [PMID: 34458017 PMCID: PMC8375510 DOI: 10.7717/peerj.11717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Accepted: 06/11/2021] [Indexed: 11/20/2022] Open
Abstract
Background High-throughput sequencing has become an essential technology in life science research. Despite continuous improvements in technology, the produced sequences are still not entirely accurate. Consequently, the sequences are usually equipped with error probabilities. The quality information is already employed to find better solutions to a number of bioinformatics problems (e.g. read mapping). Data processing pipelines benefit in particular (especially when incorporating the quality information early), since enhanced outcomes of one step can improve all subsequent ones. Preprocessing steps, thus, quite regularly consider the sequence quality to fix errors or discard low-quality data. Other steps, however, like clustering sequences into operational taxonomic units (OTUs), a common task in the analysis of microbial communities, are typically performed without making use of the available quality information. Results In this paper, we present quality-aware clustering methods inspired by quality-weighted alignments and model-based denoising, and explore their applicability to OTU clustering. We implemented the quality-aware methods in a revised version of our de novo clustering tool GeFaST and evaluated their clustering quality and performance on mock-community data sets. Quality-weighted alignments were able to improve the clustering quality of GeFaST by up to 10%. The examination of the model-supported methods provided a more diverse picture, hinting at a narrower applicability, but they were able to attain similar improvements. Considering the quality information enlarged both runtime and memory consumption, even though the increase of the former depended heavily on the applied method and clustering threshold. Conclusions The quality-aware methods expand the iterative, de novo clustering approach by new clustering and cluster refinement methods. Our results indicate that OTU clustering constitutes yet another analysis step benefiting from the integration of quality information. Beyond the shown potential, the quality-aware methods offer a range of opportunities for fine-tuning and further extensions.
Collapse
Affiliation(s)
- Robert Müller
- Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Markus Nebel
- Faculty of Technology, Bielefeld University, Bielefeld, Germany
| |
Collapse
|
5
|
Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal PI, Yang HT, Xue V, Knyazev S, Singer BD, Balliu B, Koslicki D, Skums P, Zelikovsky A, Alkan C, Mutlu O, Mangul S. Technology dictates algorithms: recent developments in read alignment. Genome Biol 2021; 22:249. [PMID: 34446078 PMCID: PMC8390189 DOI: 10.1186/s13059-021-02443-7] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 07/28/2021] [Indexed: 01/08/2023] Open
Abstract
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today's diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.
Collapse
Affiliation(s)
- Mohammed Alser
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Jeremy Rotman
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Dhrithi Deshpande
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA
| | - Kodi Taraszka
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Huwenbo Shi
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, 02115, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Pelin Icer Baykal
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Harry Taegyun Yang
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
- Bioinformatics Interdepartmental Ph.D. Program, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Victor Xue
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - Sergey Knyazev
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Benjamin D Singer
- Division of Pulmonary and Critical Care Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
- Department of Biochemistry & Molecular Genetics, Northwestern University Feinberg School of Medicine, Chicago, USA
- Simpson Querrey Institute for Epigenetics, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Brunilda Balliu
- Department of Computational Medicine, University of California Los Angeles, Los Angeles, CA, 90095, USA
| | - David Koslicki
- Computer Science and Engineering, Pennsylvania State University, University Park, PA, 16801, USA
- Biology Department, Pennsylvania State University, University Park, PA, 16801, USA
- The Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, 16801, USA
| | - Pavel Skums
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
| | - Alex Zelikovsky
- Department of Computer Science, Georgia State University, Atlanta, GA, 30302, USA
- The Laboratory of Bioinformatics, I.M. Sechenov First Moscow State Medical University, Moscow, 119991, Russia
| | - Can Alkan
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Bilkent-Hacettepe Health Sciences and Technologies Program, Ankara, Turkey
| | - Onur Mutlu
- Computer Science Department, ETH Zürich, 8092, Zürich, Switzerland
- Computer Engineering Department, Bilkent University, 06800 Bilkent, Ankara, Turkey
- Information Technology and Electrical Engineering Department, ETH Zürich, Zürich, 8092, Switzerland
| | - Serghei Mangul
- Department of Clinical Pharmacy, School of Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA.
| |
Collapse
|
6
|
Practical Workflow from High-Throughput Genotyping to Genomic Estimated Breeding Values (GEBVs). Methods Mol Biol 2021; 2264:119-135. [PMID: 33263907 DOI: 10.1007/978-1-0716-1201-9_9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
The global climate is changing, resulting in significant economic losses worldwide. It is thus necessary to speed up the plant selection process, especially for complex traits such as biotic and abiotic stresses. Nowadays, genomic selection (GS) is paving new ways to boost plant breeding, facilitating the rapid selection of superior genotypes based on the genomic estimated breeding value (GEBV). GEBVs consider all markers positioned throughout the genome, including those with minor effects. Indeed, although the effect of each marker may be very small, a large number of genome-wide markers retrieved by high-throughput genotyping (HTG) systems (mainly genotyping-by-sequencing, GBS) have the potential to explain all the genetic variance for a particular trait under selection. Although several workflows for GBS and GS data have been described, it is still hard for researchers without a bioinformatics background to carry out these analyses. This chapter has outlined some of the recently available bioinformatics resources that enable researchers to establish GBS applications for GS analysis in laboratories. Moreover, we provide useful scripts that could be used for this purpose and a description of key factors that need to be considered in these approaches.
Collapse
|
7
|
Anyansi C, Straub TJ, Manson AL, Earl AM, Abeel T. Computational Methods for Strain-Level Microbial Detection in Colony and Metagenome Sequencing Data. Front Microbiol 2020; 11:1925. [PMID: 33013732 PMCID: PMC7507117 DOI: 10.3389/fmicb.2020.01925] [Citation(s) in RCA: 59] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2020] [Accepted: 07/22/2020] [Indexed: 01/17/2023] Open
Abstract
Metagenomic sequencing is a powerful tool for examining the diversity and complexity of microbial communities. Most widely used tools for taxonomic profiling of metagenomic sequence data allow for a species-level overview of the composition. However, individual strains within a species can differ greatly in key genotypic and phenotypic characteristics, such as drug resistance, virulence and growth rate. Therefore, the ability to resolve microbial communities down to the level of individual strains within a species is critical to interpreting metagenomic data for clinical and environmental applications, where identifying a particular strain, or tracking a particular strain across a set of samples, can help aid in clinical diagnosis and treatment, or in characterizing yet unstudied strains across novel environmental locations. Recently published approaches have begun to tackle the problem of resolving strains within a particular species in metagenomic samples. In this review, we present an overview of these new algorithms and their uses, including methods based on assembly reconstruction and methods operating with or without a reference database. While existing metagenomic analysis methods show reasonable performance at the species and higher taxonomic levels, identifying closely related strains within a species presents a bigger challenge, due to the diversity of databases, genetic relatedness, and goals when conducting these analyses. Selection of which metagenomic tool to employ for a specific application should be performed on a case-by case basis as these tools have strengths and weaknesses that affect their performance on specific tasks. A comprehensive benchmark across different use case scenarios is vital to validate performance of these tools on microbial samples. Because strain-level metagenomic analysis is still in its infancy, development of more fine-grained, high-resolution algorithms will continue to be in demand for the future.
Collapse
Affiliation(s)
- Christine Anyansi
- Delft Bioinformatics Lab, Delft University of Technology, Delft, Netherlands
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, United States
| | - Timothy J. Straub
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, United States
- Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, MA, United States
| | - Abigail L. Manson
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, United States
| | - Ashlee M. Earl
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, United States
| | - Thomas Abeel
- Delft Bioinformatics Lab, Delft University of Technology, Delft, Netherlands
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, United States
| |
Collapse
|
8
|
Sun S, Sha Z, Wang Y. The complete mitochondrial genomes of two vent squat lobsters, Munidopsis lauensis and M. verrilli: Novel gene arrangements and phylogenetic implications. Ecol Evol 2019; 9:12390-12407. [PMID: 31788185 PMCID: PMC6875667 DOI: 10.1002/ece3.5542] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Revised: 01/31/2019] [Accepted: 07/19/2019] [Indexed: 12/14/2022] Open
Abstract
Hydrothermal vents are considered as one of the most extremely harsh environments on the Earth. In this study, the complete mitogenomes of hydrothermal vent squat lobsters, Munidopsis lauensis and M. verrilli, were determined through Illumina sequencing and compared with other available mitogenomes of anomurans. The mitogenomes of M. lauensis (17,483 bp) and M. verrilli (17,636 bp) are the largest among all Anomura mitogenomes, while the A+T contents of M. lauensis (62.40%) and M. verrilli (63.99%) are the lowest. The mitogenomes of M. lauensis and M. verrilli display novel gene arrangements, which might be the result of three tandem duplication-random loss (tdrl) events from the ancestral pancrustacean pattern. The mitochondrial gene orders of M. lauensis and M. verrilli shared the most similarities with S. crosnieri. The phylogenetic analyses based on both gene order data and nucleotide sequences (PCGs and rRNAs) revealed that the two species were closely related to Shinkaia crosnieri. Positive selection analysis revealed that eighteen residues in seven genes (atp8, Cytb, nad3, nad4, nad4l, nad5, and nad6) of the hydrothermal vent anomurans were positively selected sites.
Collapse
Affiliation(s)
- Shao'e Sun
- Deep Sea Research CenterInstitute of OceanologyChinese Academy of ScienceQingdaoChina
- Center for Ocean Mega‐ScienceChinese Academy of SciencesQingdaoChina
| | - Zhongli Sha
- Deep Sea Research CenterInstitute of OceanologyChinese Academy of ScienceQingdaoChina
- Center for Ocean Mega‐ScienceChinese Academy of SciencesQingdaoChina
- Laboratory for Marine Biology and BiotechnologyQingdao National Laboratory for Marine Science and TechnologyQingdaoChina
- University of Chinese Academy of SciencesBeijingChina
| | - Yanrong Wang
- Deep Sea Research CenterInstitute of OceanologyChinese Academy of ScienceQingdaoChina
- Center for Ocean Mega‐ScienceChinese Academy of SciencesQingdaoChina
| |
Collapse
|
9
|
Cho Y, Lee S, Hong JH, Kim BJ, Hong WY, Jung J, Lee HB, Sung J, Kim HN, Kim HL, Jung J. Development of the variant calling algorithm, ADIScan, and its use to estimate discordant sequences between monozygotic twins. Nucleic Acids Res 2019; 46:e92. [PMID: 29873758 PMCID: PMC6125643 DOI: 10.1093/nar/gky445] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2017] [Accepted: 05/15/2018] [Indexed: 12/30/2022] Open
Abstract
Calling variants from next-generation sequencing (NGS) data or discovering discordant sequences between two NGS data sets is challenging. We developed a computer algorithm, ADIScan1, to call variants by comparing the fractions of allelic reads in a tester to the universal reference genome. We then created ADIScan2 by modifying the algorithm to directly compare two sets of NGS data and predict discordant sequences between two testers. ADIScan1 detected >99.7% of variants called by GATK with an additional 724 393 SNVs. ADIScan2 identified ∼500 candidates of discordant sequences in each of two pairs of the monozygotic twins. About 200 of these candidates were included in the ∼2800 predicted by VarScan2. We verified 66 true discordant sequences among the candidates that ADIScan2 and VarScan2 exclusively predicted. ADIScan2 detected many discordant sequences overlooked by VarScan2 and Mutect, which specialize in detecting low frequency mutations in genetically heterogeneous cancerous tissues. Numbers of verified sequences alone were >5 times more than expected based on recently estimated mutation rates from whole genome sequences. Estimated post-zygotic mutation rates were 1.68 × 10−7 in this study. ADIScan1 and 2 would complement existing tools in screening causative mutations of diverse genetic diseases and comparing two sets of genome sequences, respectively.
Collapse
Affiliation(s)
- Yangrae Cho
- Syntekabio Incorporated, Techno-2ro B-512, Yuseong-gu, Daejeon 34025, Republic of Korea.,DFTBA, CALS, Chonnam National University, Gwangju 61186, Republic of Korea
| | - Sunho Lee
- Syntekabio Incorporated, Techno-2ro B-512, Yuseong-gu, Daejeon 34025, Republic of Korea.,School of Computer Science and Engineering, Seoul National University, Seoul, 151-742, Republic of Korea
| | - Jong Hui Hong
- Syntekabio Incorporated, Techno-2ro B-512, Yuseong-gu, Daejeon 34025, Republic of Korea.,Research Institute of Pharmaceutical Sciences, College of Pharmacy, Seoul National University, Seoul 08826, Republic of Korea
| | - Byong Joon Kim
- Syntekabio Incorporated, Techno-2ro B-512, Yuseong-gu, Daejeon 34025, Republic of Korea
| | - Woon-Young Hong
- Syntekabio Incorporated, Techno-2ro B-512, Yuseong-gu, Daejeon 34025, Republic of Korea
| | - Jongcheol Jung
- Syntekabio Incorporated, Techno-2ro B-512, Yuseong-gu, Daejeon 34025, Republic of Korea
| | - Hyang Burm Lee
- DFTBA, CALS, Chonnam National University, Gwangju 61186, Republic of Korea
| | - Joohon Sung
- Complex Disease and Genome Epidemiology Branch, Department of Epidemiology, School of Public Health, Seoul National University, Seoul 08826, Republic of Korea
| | - Han-Na Kim
- Department of Biochemistry, School of Medicine, Ewha Woman's University, Seoul 07985, Republic of Korea
| | - Hyung-Lae Kim
- Department of Biochemistry, School of Medicine, Ewha Woman's University, Seoul 07985, Republic of Korea
| | - Jongsun Jung
- Syntekabio Incorporated, Techno-2ro B-512, Yuseong-gu, Daejeon 34025, Republic of Korea
| |
Collapse
|
10
|
Prousalis K, Konofaos N. Α Quantum Pattern Recognition Method for Improving Pairwise Sequence Alignment. Sci Rep 2019; 9:7226. [PMID: 31076611 PMCID: PMC6510764 DOI: 10.1038/s41598-019-43697-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2018] [Accepted: 04/29/2019] [Indexed: 12/22/2022] Open
Abstract
Quantum pattern recognition techniques have recently raised attention as potential candidates in analyzing vast amount of data. The necessity to obtain faster ways to process data is imperative where data generation is rapid. The ever-growing size of sequence databases caused by the development of high throughput sequencing is unprecedented. Current alignment methods have blossomed overnight but there is still the need for more efficient methods that preserve accuracy in high levels. In this work, a complex method is proposed to treat the alignment problem better than its classical counterparts by means of quantum computation. The basic principal of the standard dot-plot method is combined with a quantum algorithm, giving insight into the effect of quantum pattern recognition on pairwise alignment. The central feature of quantum algorithmic -quantum parallelism- and the diffraction patterns of x-rays are synthesized to provide a clever array indexing structure on the growing sequence databases. A completely different approach is considered in contrast to contemporary conventional aligners and a variety of competitive classical counterparts are classified and organized in order to compare with the quantum setting. The proposed method seems to exhibit high alignment quality and prevail among the others in terms of time and space complexity.
Collapse
Affiliation(s)
| | - Nikos Konofaos
- Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
| |
Collapse
|
11
|
Abstract
Long double-stranded RNAs (dsRNAs) are abundantly expressed in animals, in which they frequently occur in introns and 3' untranslated regions of mRNAs. Functions of long, cellular dsRNAs are poorly understood, although deficiencies in adenosine deaminases that act on RNA, or ADARs, promote their recognition as viral dsRNA and an aberrant immune response. Diverse dsRNA-binding proteins bind cellular dsRNAs, hinting at additional roles. Understanding these roles is facilitated by mapping the genomic locations that express dsRNA in various tissues and organisms. ADAR editing provides a signature of dsRNA structure in cellular transcripts. In this review, we detail approaches to map ADAR editing sites and dsRNAs genome-wide, with particular focus on high-throughput sequencing methods and considerations for their successful application to the detection of editing sites and dsRNAs.
Collapse
Affiliation(s)
- Daniel P Reich
- Department of Biochemistry, University of Utah, Salt Lake City, Utah 84112
| | - Brenda L Bass
- Department of Biochemistry, University of Utah, Salt Lake City, Utah 84112
| |
Collapse
|
12
|
Abstract
BACKGROUND Targeted gene sequencing (TGS) and whole exome sequencing (WES) are being used in clinical testing in laboratories. We compared the performances of TGS and WES using the same DNA samples. METHODS DNA was extracted from 10 endometrial tumor tissue specimens. Sequencing were performed with an Illumina HiSeq 2000. We randomly selected variants to confirm through Sanger sequencing or mutant-enriched PCR with Sanger sequencing. RESULTS We found that the variants identified in both TGS and WES were true positives (47/47), regardless of the sequencing depth. Most variants found in TGS only were true positives (34/40), and most of the variants found by WES only were false positives (8/18). From these results, we suggest that the sequencing depth may not play important role in the accuracy of NGS-based methods. After analysis, we found that WES had a sensitivity of 72.70%, specificity of 96.27%, precision of 99.44%, and accuracy of 75.03%. CONCLUSIONS The results of NGS-based methods must currently be validated, especially for important reported variants regardless of the methods used, and for the use of WES in cancers a higher false negative rate must be considered. More sensitive methods should be used to confirm the NGS results in uneven cancer tissues.
Collapse
|
13
|
Brumme CJ, Poon AFY. Promises and pitfalls of Illumina sequencing for HIV resistance genotyping. Virus Res 2016; 239:97-105. [PMID: 27993623 DOI: 10.1016/j.virusres.2016.12.008] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2016] [Revised: 12/15/2016] [Accepted: 12/15/2016] [Indexed: 12/13/2022]
Abstract
Genetic sequencing ("genotyping") plays a critical role in the modern clinical management of HIV infection. This virus evolves rapidly within patients because of its error-prone reverse transcriptase and short generation time. Consequently, HIV variants with mutations that confer resistance to one or more antiretroviral drugs can emerge during sub-optimal treatment. There are now multiple HIV drug resistance interpretation algorithms that take the region of the HIV genome encoding the major drug targets as inputs; expert use of these algorithms can significantly improve to clinical outcomes in HIV treatment. Next-generation sequencing has the potential to revolutionize HIV resistance genotyping by lowering the threshold that rare but clinically significant HIV variants can be detected reproducibly, and by conferring improved cost-effectiveness in high-throughput scenarios. In this review, we discuss the relative merits and challenges of deploying the Illumina MiSeq instrument for clinical HIV genotyping.
Collapse
Affiliation(s)
- Chanson J Brumme
- BC Centre for Excellence in HIV/AIDS, Vancouver, British Columbia, Canada
| | - Art F Y Poon
- Department of Pathology & Laboratory Medicine, Western University, London, Ontario, Canada.
| |
Collapse
|
14
|
Han Y, He X. Integrating Epigenomics into the Understanding of Biomedical Insight. Bioinform Biol Insights 2016; 10:267-289. [PMID: 27980397 PMCID: PMC5138066 DOI: 10.4137/bbi.s38427] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2016] [Revised: 11/01/2016] [Accepted: 11/06/2016] [Indexed: 12/13/2022] Open
Abstract
Epigenetics is one of the most rapidly expanding fields in biomedical research, and the popularity of the high-throughput next-generation sequencing (NGS) highlights the accelerating speed of epigenomics discovery over the past decade. Epigenetics studies the heritable phenotypes resulting from chromatin changes but without alteration on DNA sequence. Epigenetic factors and their interactive network regulate almost all of the fundamental biological procedures, and incorrect epigenetic information may lead to complex diseases. A comprehensive understanding of epigenetic mechanisms, their interactions, and alterations in health and diseases genome widely has become a priority in biological research. Bioinformatics is expected to make a remarkable contribution for this purpose, especially in processing and interpreting the large-scale NGS datasets. In this review, we introduce the epigenetics pioneering achievements in health status and complex diseases; next, we give a systematic review of the epigenomics data generation, summarize public resources and integrative analysis approaches, and finally outline the challenges and future directions in computational epigenomics.
Collapse
Affiliation(s)
- Yixing Han
- Mouse Cancer Genetics Program, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Frederick, MD, USA.; Present address: Genetics and Biochemistry Branch, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, USA
| | - Ximiao He
- Laboratory of Metabolism, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.; Present address: Department of Medical Genetics, School of Basic Medicine, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China
| |
Collapse
|
15
|
Ziemann M, Kaspi A, El-Osta A. Evaluation of microRNA alignment techniques. RNA (NEW YORK, N.Y.) 2016; 22:1120-38. [PMID: 27284164 PMCID: PMC4931105 DOI: 10.1261/rna.055509.115] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/01/2015] [Accepted: 05/04/2016] [Indexed: 05/26/2023]
Abstract
Genomic alignment of small RNA (smRNA) sequences such as microRNAs poses considerable challenges due to their short length (∼21 nucleotides [nt]) as well as the large size and complexity of plant and animal genomes. While several tools have been developed for high-throughput mapping of longer mRNA-seq reads (>30 nt), there are few that are specifically designed for mapping of smRNA reads including microRNAs. The accuracy of these mappers has not been systematically determined in the case of smRNA-seq. In addition, it is unknown whether these aligners accurately map smRNA reads containing sequence errors and polymorphisms. By using simulated read sets, we determine the alignment sensitivity and accuracy of 16 short-read mappers and quantify their robustness to mismatches, indels, and nontemplated nucleotide additions. These were explored in the context of a plant genome (Oryza sativa, ∼500 Mbp) and a mammalian genome (Homo sapiens, ∼3.1 Gbp). Analysis of simulated and real smRNA-seq data demonstrates that mapper selection impacts differential expression results and interpretation. These results will inform on best practice for smRNA mapping and enable more accurate smRNA detection and quantification of expression and RNA editing.
Collapse
Affiliation(s)
- Mark Ziemann
- Epigenetics in Human Health and Disease Laboratory, Baker IDI Heart and Diabetes Institute, The Alfred Medical Research and Education Precinct, Melbourne, Victoria 3004, AustraliaEpigenomics Profiling Facility, Baker IDI Heart and Diabetes Institute, The Alfred Medical Research and Education Precinct, Melbourne, Victoria 3004, Australia
| | - Antony Kaspi
- Epigenetics in Human Health and Disease Laboratory, Baker IDI Heart and Diabetes Institute, The Alfred Medical Research and Education Precinct, Melbourne, Victoria 3004, AustraliaEpigenomics Profiling Facility, Baker IDI Heart and Diabetes Institute, The Alfred Medical Research and Education Precinct, Melbourne, Victoria 3004, Australia
| | - Assam El-Osta
- Epigenetics in Human Health and Disease Laboratory, Baker IDI Heart and Diabetes Institute, The Alfred Medical Research and Education Precinct, Melbourne, Victoria 3004, AustraliaEpigenomics Profiling Facility, Baker IDI Heart and Diabetes Institute, The Alfred Medical Research and Education Precinct, Melbourne, Victoria 3004, Australia
| |
Collapse
|
16
|
Kagale S, Koh C, Clarke WE, Bollina V, Parkin IAP, Sharpe AG. Analysis of Genotyping-by-Sequencing (GBS) Data. Methods Mol Biol 2016; 1374:269-284. [PMID: 26519412 DOI: 10.1007/978-1-4939-3167-5_15] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The development of genotyping-by-sequencing (GBS) to rapidly detect nucleotide variation at the whole genome level, in many individuals simultaneously, has provided a transformative genetic profiling technique. GBS can be carried out in species with or without reference genome sequences yields huge amounts of potentially informative data. One limitation with the approach is the paucity of tools to transform the raw data into a format that can be easily interrogated at the genetic level. In this chapter we describe bioinformatics tools developed to address this shortfall together with experimental design considerations to fully leverage the power of GBS for genetic analysis.
Collapse
Affiliation(s)
- Sateesh Kagale
- National Research Council Canada, 110 Gymnasium Place, Saskatoon, SK, Canada, S7N 0W9
| | - Chushin Koh
- National Research Council Canada, 110 Gymnasium Place, Saskatoon, SK, Canada, S7N 0W9
| | - Wayne E Clarke
- Agriculture and Agri-Food Canada, 107 Science Place, Saskatoon, SK, Canada, S7N 0X2
| | - Venkatesh Bollina
- Agriculture and Agri-Food Canada, 107 Science Place, Saskatoon, SK, Canada, S7N 0X2
| | - Isobel A P Parkin
- Agriculture and Agri-Food Canada, 107 Science Place, Saskatoon, SK, Canada, S7N 0X2
| | - Andrew G Sharpe
- National Research Council Canada, 110 Gymnasium Place, Saskatoon, SK, Canada, S7N 0W9.
| |
Collapse
|
17
|
Ye H, Meehan J, Tong W, Hong H. Alignment of Short Reads: A Crucial Step for Application of Next-Generation Sequencing Data in Precision Medicine. Pharmaceutics 2015; 7:523-41. [PMID: 26610555 PMCID: PMC4695832 DOI: 10.3390/pharmaceutics7040523] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2015] [Revised: 11/14/2015] [Accepted: 11/17/2015] [Indexed: 02/06/2023] Open
Abstract
Precision medicine or personalized medicine has been proposed as a modernized and promising medical strategy. Genetic variants of patients are the key information for implementation of precision medicine. Next-generation sequencing (NGS) is an emerging technology for deciphering genetic variants. Alignment of raw reads to a reference genome is one of the key steps in NGS data analysis. Many algorithms have been developed for alignment of short read sequences since 2008. Users have to make a decision on which alignment algorithm to use in their studies. Selection of the right alignment algorithm determines not only the alignment algorithm but also the set of suitable parameters to be used by the algorithm. Understanding these algorithms helps in selecting the appropriate alignment algorithm for different applications in precision medicine. Here, we review current available algorithms and their major strategies such as seed-and-extend and q-gram filter. We also discuss the challenges in current alignment algorithms, including alignment in multiple repeated regions, long reads alignment and alignment facilitated with known genetic variants.
Collapse
Affiliation(s)
- Hao Ye
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Joe Meehan
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA.
| |
Collapse
|
18
|
Cui Y, Liao X, Zhu X, Wang B, Peng S. B-MIC: An Ultrafast Three-Level Parallel Sequence Aligner Using MIC. Interdiscip Sci 2015; 8:28-34. [PMID: 26358141 DOI: 10.1007/s12539-015-0278-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2014] [Revised: 05/19/2014] [Accepted: 05/19/2014] [Indexed: 12/29/2022]
Abstract
Sequence alignment is the central process for sequence analysis, where mapping raw sequencing data to reference genome. The large amount of data generated by NGS is far beyond the process capabilities of existing alignment tools. Consequently, sequence alignment becomes the bottleneck of sequence analysis. Intensive computing power is required to address this challenge. Intel recently announced the MIC coprocessor, which can provide massive computing power. The Tianhe-2 is the world's fastest supercomputer now equipped with three MIC coprocessors each compute node. A key feature of sequence alignment is that different reads are independent. Considering this property, we proposed a MIC-oriented three-level parallelization strategy to speed up BWA, a widely used sequence alignment tool, and developed our ultrafast parallel sequence aligner: B-MIC. B-MIC contains three levels of parallelization: firstly, parallelization of data IO and reads alignment by a three-stage parallel pipeline; secondly, parallelization enabled by MIC coprocessor technology; thirdly, inter-node parallelization implemented by MPI. In this paper, we demonstrate that B-MIC outperforms BWA by a combination of those techniques using Inspur NF5280M server and the Tianhe-2 supercomputer. To the best of our knowledge, B-MIC is the first sequence alignment tool to run on Intel MIC and it can achieve more than fivefold speedup over the original BWA while maintaining the alignment precision.
Collapse
Affiliation(s)
- Yingbo Cui
- School of Computer Science, National University of Defense Technology, Changsha, 410073, China
| | - Xiangke Liao
- School of Computer Science, National University of Defense Technology, Changsha, 410073, China
| | - Xiaoqian Zhu
- School of Computer Science, National University of Defense Technology, Changsha, 410073, China.
| | | | - Shaoliang Peng
- School of Computer Science, National University of Defense Technology, Changsha, 410073, China.
| |
Collapse
|
19
|
Whipple JM, Youssef OA, Aruscavage PJ, Nix DA, Hong C, Johnson WE, Bass BL. Genome-wide profiling of the C. elegans dsRNAome. RNA (NEW YORK, N.Y.) 2015; 21:786-800. [PMID: 25805852 PMCID: PMC4408787 DOI: 10.1261/rna.048801.114] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/01/2014] [Accepted: 12/23/2014] [Indexed: 06/01/2023]
Abstract
Recent studies hint that endogenous dsRNA plays an unexpected role in cellular signaling. However, a complete understanding of endogenous dsRNA signaling is hindered by an incomplete annotation of dsRNA-producing genes. To identify dsRNAs expressed in Caenorhabditis elegans, we developed a bioinformatics pipeline that identifies dsRNA by detecting clustered RNA editing sites, which are strictly limited to long dsRNA substrates of Adenosine Deaminases that act on RNA (ADAR). We compared two alignment algorithms for mapping both unique and repetitive reads and detected as many as 664 editing-enriched regions (EERs) indicative of dsRNA loci. EERs are visually enriched on the distal arms of autosomes and are predicted to possess strong internal secondary structures as well as sequence complementarity with other EERs, indicative of both intramolecular and intermolecular duplexes. Most EERs were associated with protein-coding genes, with ∼1.7% of all C. elegans mRNAs containing an EER, located primarily in very long introns and in annotated, as well as unannotated, 3' UTRs. In addition to numerous EERs associated with coding genes, we identified a population of prospective noncoding EERs that were distant from protein-coding genes and that had little or no coding potential. Finally, subsets of EERs are differentially expressed during development as well as during starvation and infection with bacterial or fungal pathogens. By combining RNA-seq with freely available bioinformatics tools, our workflow provides an easily accessible approach for the identification of dsRNAs, and more importantly, a catalog of the C. elegans dsRNAome.
Collapse
Affiliation(s)
- Joseph M Whipple
- Department of Biochemistry, University of Utah, Salt Lake City, Utah 84112-5650, USA
| | - Osama A Youssef
- Department of Biochemistry, University of Utah, Salt Lake City, Utah 84112-5650, USA
| | - P Joseph Aruscavage
- Department of Biochemistry, University of Utah, Salt Lake City, Utah 84112-5650, USA
| | - David A Nix
- Department of Oncological Sciences, Huntsman Cancer Institute, University of Utah School of Medicine, Salt Lake City, Utah 84112-5775, USA
| | - Changjin Hong
- Division of Computational Biomedicine, Boston University School of Medicine, Boston, Massachusetts 02118, USA
| | - W Evan Johnson
- Division of Computational Biomedicine, Boston University School of Medicine, Boston, Massachusetts 02118, USA
| | - Brenda L Bass
- Department of Biochemistry, University of Utah, Salt Lake City, Utah 84112-5650, USA
| |
Collapse
|
20
|
Abstract
Background Genome assemblers to date have predominantly targeted haploid reference reconstruction from homozygous data. When applied to diploid genome assembly, these assemblers perform poorly, owing to the violation of assumptions during both the contigging and scaffolding phases. Effective tools to overcome these problems are in growing demand. Increasing parameter stringency during contigging is an effective solution to obtaining haplotype-specific contigs; however, effective algorithms for scaffolding such contigs are lacking. Methods We present a stand-alone scaffolding algorithm, ScaffoldScaffolder, designed specifically for scaffolding diploid genomes. The algorithm identifies homologous sequences as found in "bubble" structures in scaffold graphs. Machine learning classification is used to then classify sequences in partial bubbles as homologous or non-homologous sequences prior to reconstructing haplotype-specific scaffolds. We define four new metrics for assessing diploid scaffolding accuracy: contig sequencing depth, contig homogeneity, phase group homogeneity, and heterogeneity between phase groups. Results We demonstrate the viability of using bubbles to identify heterozygous homologous contigs, which we term homolotigs. We show that machine learning classification trained on these homolotig pairs can be used effectively for identifying homologous sequences elsewhere in the data with high precision (assuming error-free reads). Conclusion More work is required to comparatively analyze this approach on real data with various parameters and classifiers against other diploid genome assembly methods. However, the initial results of ScaffoldScaffolder supply validity to the idea of employing machine learning in the difficult task of diploid genome assembly. Software is available at http://bioresearch.byu.edu/scaffoldscaffolder.
Collapse
|
21
|
Zhang W, Soika V, Meehan J, Su Z, Ge W, Ng HW, Perkins R, Simonyan V, Tong W, Hong H. Quality control metrics improve repeatability and reproducibility of single-nucleotide variants derived from whole-genome sequencing. THE PHARMACOGENOMICS JOURNAL 2014; 15:298-309. [PMID: 25384574 DOI: 10.1038/tpj.2014.70] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2014] [Revised: 07/16/2014] [Accepted: 09/19/2014] [Indexed: 12/18/2022]
Abstract
Although many quality control (QC) methods have been developed to improve the quality of single-nucleotide variants (SNVs) in SNV-calling, QC methods for use subsequent to single-nucleotide polymorphism-calling have not been reported. We developed five QC metrics to improve the quality of SNVs using the whole-genome-sequencing data of a monozygotic twin pair from the Korean Personal Genome Project. The QC metrics improved both repeatability between the monozygotic twin pair and reproducibility between SNV-calling pipelines. We demonstrated the QC metrics improve reproducibility of SNVs derived from not only whole-genome-sequencing data but also whole-exome-sequencing data. The QC metrics are calculated based on the reference genome used in the alignment without accessing the raw and intermediate data or knowing the SNV-calling details. Therefore, the QC metrics can be easily adopted in downstream association analysis.
Collapse
Affiliation(s)
- W Zhang
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - V Soika
- Office of The Center Director, Center for Biologics Evaluation and Research, US Food and Drug Administration, Rockville, MD, USA
| | - J Meehan
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - Z Su
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - W Ge
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - H W Ng
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - R Perkins
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - V Simonyan
- Office of The Center Director, Center for Biologics Evaluation and Research, US Food and Drug Administration, Rockville, MD, USA
| | - W Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| | - H Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR, USA
| |
Collapse
|
22
|
Clement NL, Thompson LP, Miranker DP. ADaM: augmenting existing approximate fast matching algorithms with efficient and exact range queries. BMC Bioinformatics 2014; 15 Suppl 7:S1. [PMID: 25079667 PMCID: PMC4110726 DOI: 10.1186/1471-2105-15-s7-s1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Drug discovery, disease detection, and personalized medicine are fast-growing areas of genomic research. With the advancement of next-generation sequencing techniques, researchers can obtain an abundance of data for many different biological assays in a short period of time. When this data is error-free, the result is a high-quality base-pair resolution picture of the genome. However, when the data is lossy the heuristic algorithms currently used when aligning next-generation sequences causes the corresponding accuracy to drop. Results This paper describes a program, ADaM (APF DNA Mapper) which significantly increases final alignment accuracy. ADaM works by first using an existing program to align "easy" sequences, and then using an algorithm with accuracy guarantees (the APF) to align the remaining sequences. The final result is a technique that increases the mapping accuracy from only 60% to over 90% for harder-to-align sequences.
Collapse
|
23
|
Validation and assessment of variant calling pipelines for next-generation sequencing. Hum Genomics 2014; 8:14. [PMID: 25078893 PMCID: PMC4129436 DOI: 10.1186/1479-7364-8-14] [Citation(s) in RCA: 83] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Accepted: 07/15/2014] [Indexed: 02/07/2023] Open
Abstract
Background The processing and analysis of the large scale data generated by next-generation sequencing (NGS) experiments is challenging and is a burgeoning area of new methods development. Several new bioinformatics tools have been developed for calling sequence variants from NGS data. Here, we validate the variant calling of these tools and compare their relative accuracy to determine which data processing pipeline is optimal. Results We developed a unified pipeline for processing NGS data that encompasses four modules: mapping, filtering, realignment and recalibration, and variant calling. We processed 130 subjects from an ongoing whole exome sequencing study through this pipeline. To evaluate the accuracy of each module, we conducted a series of comparisons between the single nucleotide variant (SNV) calls from the NGS data and either gold-standard Sanger sequencing on a total of 700 variants or array genotyping data on a total of 9,935 single-nucleotide polymorphisms. A head to head comparison showed that Genome Analysis Toolkit (GATK) provided more accurate calls than SAMtools (positive predictive value of 92.55% vs. 80.35%, respectively). Realignment of mapped reads and recalibration of base quality scores before SNV calling proved to be crucial to accurate variant calling. GATK HaplotypeCaller algorithm for variant calling outperformed the UnifiedGenotype algorithm. We also showed a relationship between mapping quality, read depth and allele balance, and SNV call accuracy. However, if best practices are used in data processing, then additional filtering based on these metrics provides little gains and accuracies of >99% are achievable. Conclusions Our findings will help to determine the best approach for processing NGS data to confidently call variants for downstream analyses. To enable others to implement and replicate our results, all of our codes are freely available at http://metamoodics.org/wes.
Collapse
|
24
|
Evaluation and comparison of multiple aligners for next-generation sequencing data analysis. BIOMED RESEARCH INTERNATIONAL 2014; 2014:309650. [PMID: 24779008 PMCID: PMC3980841 DOI: 10.1155/2014/309650] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/17/2013] [Accepted: 02/04/2014] [Indexed: 12/23/2022]
Abstract
Next-generation sequencing (NGS) technology has rapidly advanced and generated the massive data volumes. To align and map the NGS data, biologists often randomly select a number of aligners without concerning their suitable feature, high performance, and high accuracy as well as sequence variations and polymorphisms existing on reference genome. This study aims to systematically evaluate and compare the capability of multiple aligners for NGS data analysis. To explore this capability, we firstly performed alignment algorithms comparison and classification. We further used long-read and short-read datasets from both real-life and in silico NGS data for comparative analysis and evaluation of these aligners focusing on three criteria, namely, application-specific alignment feature, computational performance, and alignment accuracy. Our study demonstrated the overall evaluation and comparison of multiple aligners for NGS data analysis. This serves as an important guiding resource for biologists to gain further insight into suitable selection of aligners for specific and broad applications.
Collapse
|
25
|
Qian X, Ba Y, Zhuang Q, Zhong G. RNA-Seq technology and its application in fish transcriptomics. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2013; 18:98-110. [PMID: 24380445 DOI: 10.1089/omi.2013.0110] [Citation(s) in RCA: 196] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
High-throughput sequencing technologies, also known as next-generation sequencing (NGS) technologies, have revolutionized the way that genomic research is advancing. In addition to the static genome, these state-of-art technologies have been recently exploited to analyze the dynamic transcriptome, and the resulting technology is termed RNA sequencing (RNA-seq). RNA-seq is free from many limitations of other transcriptomic approaches, such as microarray and tag-based sequencing method. Although RNA-seq has only been available for a short time, studies using this method have completely changed our perspective of the breadth and depth of eukaryotic transcriptomes. In terms of the transcriptomics of teleost fishes, both model and non-model species have benefited from the RNA-seq approach and have undergone tremendous advances in the past several years. RNA-seq has helped not only in mapping and annotating fish transcriptome but also in our understanding of many biological processes in fish, such as development, adaptive evolution, host immune response, and stress response. In this review, we first provide an overview of each step of RNA-seq from library construction to the bioinformatic analysis of the data. We then summarize and discuss the recent biological insights obtained from the RNA-seq studies in a variety of fish species.
Collapse
Affiliation(s)
- Xi Qian
- 1 Department of Animal Science, University of Vermont , Burlington, Vermont
| | | | | | | |
Collapse
|
26
|
Hong C, Clement NL, Clement S, Hammoud SS, Carrell DT, Cairns BR, Snell Q, Clement MJ, Johnson WE. Probabilistic alignment leads to improved accuracy and read coverage for bisulfite sequencing data. BMC Bioinformatics 2013; 14:337. [PMID: 24261665 PMCID: PMC3924334 DOI: 10.1186/1471-2105-14-337] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2013] [Accepted: 11/19/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND DNA methylation has been linked to many important biological phenomena. Researchers have recently begun to sequence bisulfite treated DNA to determine its pattern of methylation. However, sequencing reads from bisulfite-converted DNA can vary significantly from the reference genome because of incomplete bisulfite conversion, genome variation, sequencing errors, and poor quality bases. Therefore, it is often difficult to align reads to the correct locations in the reference genome. Furthermore, bisulfite sequencing experiments have the additional complexity of having to estimate the DNA methylation levels within the sample. RESULTS Here, we present a highly accurate probabilistic algorithm, which is an extension of the Genomic Next-generation Universal MAPper to accommodate bisulfite sequencing data (GNUMAP-bs), that addresses the computational problems associated with aligning bisulfite sequencing data to a reference genome. GNUMAP-bs integrates uncertainty from read and mapping qualities to help resolve the difference between poor quality bases and the ambiguity inherent in bisulfite conversion. We tested GNUMAP-bs and other commonly-used bisulfite alignment methods using both simulated and real bisulfite reads and found that GNUMAP-bs and other dynamic programming methods were more accurate than the more heuristic methods. CONCLUSIONS The GNUMAP-bs aligner is a highly accurate alignment approach for processing the data from bisulfite sequencing experiments. The GNUMAP-bs algorithm is freely available for download at: http://dna.cs.byu.edu/gnumap. The software runs on multiple threads and multiple processors to increase the alignment speed.
Collapse
Affiliation(s)
- Changjin Hong
- Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA
| | - Nathan L Clement
- Department of Computer Science, University of Texas, Austin, TX, USA
| | - Spencer Clement
- Department of Computer Science, Brigham Young University, Provo, UT, USA
| | - Saher Sue Hammoud
- IVF and Andrology Laboratories, Departments of Surgery, Obstetrics and Gynecology, and Physiology, University of Utah School of Medicine, Salt Lake City, UT, USA
- Department of Oncological Sciences, Huntsman Cancer Institute, Salt Lake City, UT, USA
| | - Douglas T Carrell
- IVF and Andrology Laboratories, Departments of Surgery, Obstetrics and Gynecology, and Physiology, University of Utah School of Medicine, Salt Lake City, UT, USA
| | - Bradley R Cairns
- Department of Oncological Sciences, Huntsman Cancer Institute, Salt Lake City, UT, USA
| | - Quinn Snell
- Department of Computer Science, Brigham Young University, Provo, UT, USA
| | - Mark J Clement
- Department of Computer Science, Brigham Young University, Provo, UT, USA
| | - William Evan Johnson
- Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA, USA
| |
Collapse
|
27
|
Multiplatform single-sample estimates of transcriptional activation. Proc Natl Acad Sci U S A 2013; 110:17778-83. [PMID: 24128763 DOI: 10.1073/pnas.1305823110] [Citation(s) in RCA: 70] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Over the past two decades, many biotechnology platforms have been developed for high-throughput gene expression profiling. However, because each platform is subject to technology-specific biases and produces distinct raw-data distributions, researchers have experienced difficulty in integrating data across platforms. Data integration is crucial to data-generating consortiums, researchers transitioning to newer profiling technologies, and individuals seeking to aggregate data across experiments. We address this need with our Universal exPression Code (UPC) approach, which corrects for platform-specific background noise using models that account for the genomic base composition and length of target regions; this approach also uses a mixture model to estimate whether a gene is active in a particular profiling sample. The latter produces standardized UPC values on a zero-to-one scale, so that they can be interpreted consistently, irrespective of profiling technology, thus enabling downstream analysis pipelines to be developed in a platform-agnostic manner. The UPC method can be applied to one- and two-channel expression microarrays and to next-generation sequencing data (RNA sequencing). Furthermore, UPCs are derived using information from within a given sample only--no ancillary samples are required at processing time. Thus, UPCs are suitable for personalized-medicine workflows where samples must be processed individually rather than in batches. In a variety of analyses and comparisons, UPCs perform comparably to other methods designed specifically for microarrays or RNA sequencing in most settings. Software for calculating UPCs is freely available at www.bioconductor.org/packages/release/bioc/html/SCAN.UPC.html.
Collapse
|
28
|
Francis OE, Bendall M, Manimaran S, Hong C, Clement NL, Castro-Nallar E, Snell Q, Schaalje GB, Clement MJ, Crandall KA, Johnson WE. Pathoscope: species identification and strain attribution with unassembled sequencing data. Genome Res 2013; 23:1721-9. [PMID: 23843222 PMCID: PMC3787268 DOI: 10.1101/gr.150151.112] [Citation(s) in RCA: 100] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Emerging next-generation sequencing technologies have revolutionized the collection of genomic data for applications in bioforensics, biosurveillance, and for use in clinical settings. However, to make the most of these new data, new methodology needs to be developed that can accommodate large volumes of genetic data in a computationally efficient manner. We present a statistical framework to analyze raw next-generation sequence reads from purified or mixed environmental or targeted infected tissue samples for rapid species identification and strain attribution against a robust database of known biological agents. Our method, Pathoscope, capitalizes on a Bayesian statistical framework that accommodates information on sequence quality, mapping quality, and provides posterior probabilities of matches to a known database of target genomes. Importantly, our approach also incorporates the possibility that multiple species can be present in the sample and considers cases when the sample species/strain is not in the reference database. Furthermore, our approach can accurately discriminate between very closely related strains of the same species with very little coverage of the genome and without the need for multiple alignment steps, extensive homology searches, or genome assembly--which are time-consuming and labor-intensive steps. We demonstrate the utility of our approach on genomic data from purified and in silico "environmental" samples from known bacterial agents impacting human health for accuracy assessment and comparison with other approaches.
Collapse
Affiliation(s)
- Owen E Francis
- Department of Statistics, Brigham Young University, Provo, Utah 84602, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
29
|
O'Rawe J, Jiang T, Sun G, Wu Y, Wang W, Hu J, Bodily P, Tian L, Hakonarson H, Johnson WE, Wei Z, Wang K, Lyon GJ. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med 2013; 5:28. [PMID: 23537139 PMCID: PMC3706896 DOI: 10.1186/gm432] [Citation(s) in RCA: 307] [Impact Index Per Article: 25.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2012] [Revised: 03/23/2013] [Accepted: 03/27/2013] [Indexed: 12/18/2022] Open
Abstract
Background To facilitate the clinical implementation of genomic medicine by next-generation sequencing, it will be critically important to obtain accurate and consistent variant calls on personal genomes. Multiple software tools for variant calling are available, but it is unclear how comparable these tools are or what their relative merits in real-world scenarios might be. Methods We sequenced 15 exomes from four families using commercial kits (Illumina HiSeq 2000 platform and Agilent SureSelect version 2 capture kit), with approximately 120X mean coverage. We analyzed the raw data using near-default parameters with five different alignment and variant-calling pipelines (SOAP, BWA-GATK, BWA-SNVer, GNUMAP, and BWA-SAMtools). We additionally sequenced a single whole genome using the sequencing and analysis pipeline from Complete Genomics (CG), with 95% of the exome region being covered by 20 or more reads per base. Finally, we validated 919 single-nucleotide variations (SNVs) and 841 insertions and deletions (indels), including similar fractions of GATK-only, SOAP-only, and shared calls, on the MiSeq platform by amplicon sequencing with approximately 5000X mean coverage. Results SNV concordance between five Illumina pipelines across all 15 exomes was 57.4%, while 0.5 to 5.1% of variants were called as unique to each pipeline. Indel concordance was only 26.8% between three indel-calling pipelines, even after left-normalizing and intervalizing genomic coordinates by 20 base pairs. There were 11% of CG variants falling within targeted regions in exome sequencing that were not called by any of the Illumina-based exome analysis pipelines. Based on targeted amplicon sequencing on the MiSeq platform, 97.1%, 60.2%, and 99.1% of the GATK-only, SOAP-only and shared SNVs could be validated, but only 54.0%, 44.6%, and 78.1% of the GATK-only, SOAP-only and shared indels could be validated. Additionally, our analysis of two families (one with four individuals and the other with seven), demonstrated additional accuracy gained in variant discovery by having access to genetic data from a multi-generational family. Conclusions Our results suggest that more caution should be exercised in genomic medicine settings when analyzing individual genomes, including interpreting positive and negative findings with scrutiny, especially for indels. We advocate for renewed collection and sequencing of multi-generational families to increase the overall accuracy of whole genomes.
Collapse
Affiliation(s)
- Jason O'Rawe
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, One Bungtown Rd, Cold Spring Harbor, 11724, USA ; Stony Brook University, 100 Nicolls Rd, Stony Brook, 11794, USA
| | - Tao Jiang
- BGI-Shenzhen, Shenzhen 518000, China
| | | | - Yiyang Wu
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, One Bungtown Rd, Cold Spring Harbor, 11724, USA ; Stony Brook University, 100 Nicolls Rd, Stony Brook, 11794, USA
| | - Wei Wang
- New Jersey Institute of Technology, Martin Luther King Jr. Blvd, Newark, 07103, USA
| | | | - Paul Bodily
- Brigham Young University, N University Ave, Provo, 84606, USA
| | - Lifeng Tian
- Children's Hospital of Philadelphia, Civic Center Blvd, Philadelphia, 19104, USA
| | - Hakon Hakonarson
- Children's Hospital of Philadelphia, Civic Center Blvd, Philadelphia, 19104, USA
| | - W Evan Johnson
- Boston University School of Medicine, E Concord St, Boston, 02118, USA
| | - Zhi Wei
- New Jersey Institute of Technology, Martin Luther King Jr. Blvd, Newark, 07103, USA
| | - Kai Wang
- University of Southern California, 1501 San Pablo Street, Los Angeles, 90089, USA ; Utah Foundation for Biomedical Research, E 3300 S, Salt Lake City, 84106, USA
| | - Gholson J Lyon
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, One Bungtown Rd, Cold Spring Harbor, 11724, USA ; Stony Brook University, 100 Nicolls Rd, Stony Brook, 11794, USA ; Utah Foundation for Biomedical Research, E 3300 S, Salt Lake City, 84106, USA
| |
Collapse
|
30
|
Hong H, Zhang W, Shen J, Su Z, Ning B, Han T, Perkins R, Shi L, Tong W. Critical role of bioinformatics in translating huge amounts of next-generation sequencing data into personalized medicine. SCIENCE CHINA-LIFE SCIENCES 2013; 56:110-8. [PMID: 23393026 DOI: 10.1007/s11427-013-4439-7] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/08/2012] [Accepted: 11/29/2012] [Indexed: 01/12/2023]
Abstract
Realizing personalized medicine requires integrating diverse data types with bioinformatics. The most vital data are genomic information for individuals that are from advanced next-generation sequencing (NGS) technologies at present. The technologies continue to advance in terms of both decreasing cost and sequencing speed with concomitant increase in the amount and complexity of the data. The prodigious data together with the requisite computational pipelines for data analysis and interpretation are stressors to IT infrastructure and the scientists conducting the work alike. Bioinformatics is increasingly becoming the rate-limiting step with numerous challenges to be overcome for translating NGS data for personalized medicine. We review some key bioinformatics tasks, issues, and challenges in contexts of IT requirements, data quality, analysis tools and pipelines, and validation of biomarkers.
Collapse
Affiliation(s)
- Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US Food and Drug Administration, Jefferson, AR 72079, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
31
|
Pérez-Losada M, Cabezas P, Castro-Nallar E, Crandall KA. Pathogen typing in the genomics era: MLST and the future of molecular epidemiology. INFECTION GENETICS AND EVOLUTION 2013; 16:38-53. [PMID: 23357583 DOI: 10.1016/j.meegid.2013.01.009] [Citation(s) in RCA: 128] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/25/2012] [Revised: 01/11/2013] [Accepted: 01/15/2013] [Indexed: 10/27/2022]
Abstract
Multi-locus sequence typing (MLST) is a high-resolution genetic typing approach to identify species and strains of pathogens impacting human health, agriculture (animals and plants), and biosafety. In this review, we outline the general concepts behind MLST, molecular approaches for obtaining MLST data, analytical approaches for MLST data, and the contributions MLST studies have made in a wide variety of areas. We then look at the future of MLST and their relative strengths and weaknesses with respect to whole genome sequence typing approaches that are moving into the research arena at an ever-increasing pace. Throughout the paper, we provide exemplar references of these various aspects of MLST. The literature is simply too vast to make this review comprehensive, nevertheless, we have attempted to include enough references in a variety of key areas to introduce the reader to the broad applications and complications of MLST data.
Collapse
Affiliation(s)
- Marcos Pérez-Losada
- CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, 4485-661 Vairão, Portugal.
| | | | | | | |
Collapse
|
32
|
Lindner R, Friedel CC. A comprehensive evaluation of alignment algorithms in the context of RNA-seq. PLoS One 2012; 7:e52403. [PMID: 23300661 PMCID: PMC3530550 DOI: 10.1371/journal.pone.0052403] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2012] [Accepted: 11/16/2012] [Indexed: 11/25/2022] Open
Abstract
Transcriptome sequencing (RNA-Seq) overcomes limitations of previously used RNA quantification methods and provides one experimental framework for both high-throughput characterization and quantification of transcripts at the nucleotide level. The first step and a major challenge in the analysis of such experiments is the mapping of sequencing reads to a transcriptomic origin including the identification of splicing events. In recent years, a large number of such mapping algorithms have been developed, all of which have in common that they require algorithms for aligning a vast number of reads to genomic or transcriptomic sequences. Although the FM-index based aligner Bowtie has become a de facto standard within mapping pipelines, a much larger number of possible alignment algorithms have been developed also including other variants of FM-index based aligners. Accordingly, developers and users of RNA-seq mapping pipelines have the choice among a large number of available alignment algorithms. To provide guidance in the choice of alignment algorithms for these purposes, we evaluated the performance of 14 widely used alignment programs from three different algorithmic classes: algorithms using either hashing of the reference transcriptome, hashing of reads, or a compressed FM-index representation of the genome. Here, special emphasis was placed on both precision and recall and the performance for different read lengths and numbers of mismatches and indels in a read. Our results clearly showed the significant reduction in memory footprint and runtime provided by FM-index based aligners at a precision and recall comparable to the best hash table based aligners. Furthermore, the recently developed Bowtie 2 alignment algorithm shows a remarkable tolerance to both sequencing errors and indels, thus, essentially making hash-based aligners obsolete.
Collapse
Affiliation(s)
- Robert Lindner
- Institute of Pharmacy and Molecular Biotechnology, Heidelberg University, Heidelberg, Germany
| | - Caroline C. Friedel
- Institute for Informatics, Ludwig-Maximilians-Universität München, Munich, Germany
- * E-mail:
| |
Collapse
|
33
|
Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high-throughput sequencing data. Bioinformatics 2012; 28:3169-77. [DOI: 10.1093/bioinformatics/bts605] [Citation(s) in RCA: 207] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
|
34
|
Crook MB, Lindsay DP, Biggs MB, Bentley JS, Price JC, Clement SC, Clement MJ, Long SR, Griffitts JS. Rhizobial plasmids that cause impaired symbiotic nitrogen fixation and enhanced host invasion. MOLECULAR PLANT-MICROBE INTERACTIONS : MPMI 2012; 25:1026-33. [PMID: 22746823 PMCID: PMC4406224 DOI: 10.1094/mpmi-02-12-0052-r] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
The genetic rules that dictate legume-rhizobium compatibility have been investigated for decades, but the causes of incompatibility occurring at late stages of the nodulation process are not well understood. An evaluation of naturally diverse legume (genus Medicago) and rhizobium (genus Sinorhizobium) isolates has revealed numerous instances in which Sinorhizobium strains induce and occupy nodules that are only minimally beneficial to certain Medicago hosts. Using these ineffective strain-host pairs, we identified gain-of-compatibility (GOC) rhizobial variants. We show that GOC variants arise by loss of specific large accessory plasmids, which we call HR plasmids due to their effect on symbiotic host range. Transfer of HR plasmids to a symbiotically effective rhizobium strain can convert it to incompatibility, indicating that HR plasmids can act autonomously in diverse strain backgrounds. We provide evidence that HR plasmids may encode machinery for their horizontal transfer. On hosts in which HR plasmids impair N fixation, the plasmids also enhance competitiveness for nodule occupancy, showing that naturally occurring, transferrable accessory genes can convert beneficial rhizobia to a more exploitative lifestyle. This observation raises important questions about agricultural management, the ecological stability of mutualisms, and the genetic factors that distinguish beneficial symbionts from parasites.
Collapse
Affiliation(s)
- Matthew B Crook
- Department of Microbiology and Molecular Biology, Brigham Young University, Provo, UT, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Warf MB, Shepherd BA, Johnson WE, Bass BL. Effects of ADARs on small RNA processing pathways in C. elegans. Genome Res 2012; 22:1488-98. [PMID: 22673872 PMCID: PMC3409262 DOI: 10.1101/gr.134841.111] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2011] [Accepted: 04/02/2012] [Indexed: 11/24/2022]
Abstract
Adenosine deaminases that act on RNA (ADARs) are RNA editing enzymes that convert adenosine to inosine in double-stranded RNA (dsRNA). To evaluate effects of ADARs on small RNAs that derive from dsRNA precursors, we performed deep-sequencing, comparing small RNAs from wild-type and ADAR mutant Caenorhabditis elegans. While editing in small RNAs was rare, at least 40% of microRNAs had altered levels in at least one ADAR mutant strain, and miRNAs with significantly altered levels had mRNA targets with correspondingly affected levels. About 40% of siRNAs derived from endogenous genes (endo-siRNAs) also had altered levels in at least one mutant strain, including 63% of Dicer-dependent endo-siRNAs. The 26G class of endo-siRNAs was significantly affected by ADARs, and many altered 26G loci had intronic reads and histone modifications associated with transcriptional silencing. Our data indicate that ADARs, through both direct and indirect mechanisms, are important for maintaining wild-type levels of many small RNAs in C. elegans.
Collapse
Affiliation(s)
- M. Bryan Warf
- Department of Biochemistry, University of Utah, Salt Lake City, Utah 84112, USA
| | - Brent A. Shepherd
- Department of Statistics, Brigham Young University, Provo, Utah 84602, USA
| | - W. Evan Johnson
- Department of Medicine, Boston University, Boston, Massachusetts 02118, USA
| | - Brenda L. Bass
- Department of Biochemistry, University of Utah, Salt Lake City, Utah 84112, USA
| |
Collapse
|
36
|
Iorizzo M, Senalik D, Szklarczyk M, Grzebelus D, Spooner D, Simon P. De novo assembly of the carrot mitochondrial genome using next generation sequencing of whole genomic DNA provides first evidence of DNA transfer into an angiosperm plastid genome. BMC PLANT BIOLOGY 2012; 12:61. [PMID: 22548759 PMCID: PMC3413510 DOI: 10.1186/1471-2229-12-61] [Citation(s) in RCA: 95] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2011] [Accepted: 05/01/2012] [Indexed: 05/02/2023]
Abstract
BACKGROUND Sequence analysis of organelle genomes has revealed important aspects of plant cell evolution. The scope of this study was to develop an approach for de novo assembly of the carrot mitochondrial genome using next generation sequence data from total genomic DNA. RESULTS Sequencing data from a carrot 454 whole genome library were used to develop a de novo assembly of the mitochondrial genome. Development of a new bioinformatic tool allowed visualizing contig connections and elucidation of the de novo assembly. Southern hybridization demonstrated recombination across two large repeats. Genome annotation allowed identification of 44 protein coding genes, three rRNA and 17 tRNA. Identification of the plastid genome sequence allowed organelle genome comparison. Mitochondrial intergenic sequence analysis allowed detection of a fragment of DNA specific to the carrot plastid genome. PCR amplification and sequence analysis across different Apiaceae species revealed consistent conservation of this fragment in the mitochondrial genomes and an insertion in Daucus plastid genomes, giving evidence of a mitochondrial to plastid transfer of DNA. Sequence similarity with a retrotransposon element suggests a possibility that a transposon-like event transferred this sequence into the plastid genome. CONCLUSIONS This study confirmed that whole genome sequencing is a practical approach for de novo assembly of higher plant mitochondrial genomes. In addition, a new aspect of intercompartmental genome interaction was reported providing the first evidence for DNA transfer into an angiosperm plastid genome. The approach used here could be used more broadly to sequence and assemble mitochondrial genomes of diverse species. This information will allow us to better understand intercompartmental interactions and cell evolution.
Collapse
Affiliation(s)
- Massimo Iorizzo
- Department of Horticulture, University of Wisconsin-Madison, 1575 Linden Drive, Madison, WI 53706, USA
| | - Douglas Senalik
- Department of Horticulture, University of Wisconsin-Madison, 1575 Linden Drive, Madison, WI 53706, USA
- USDA-Agricultural Research Service, Vegetable Crops Research Unit, University of Wisconsin, 1575 Linden Drive, Madison, WI 53706, USA
| | - Marek Szklarczyk
- Department of Genetics, Plant Breeding and Seed Science, University of Agriculture Krakow, Al. 29 Listopada 54, 31-425, Krakow, Poland
| | - Dariusz Grzebelus
- Department of Genetics, Plant Breeding and Seed Science, University of Agriculture Krakow, Al. 29 Listopada 54, 31-425, Krakow, Poland
| | - David Spooner
- Department of Horticulture, University of Wisconsin-Madison, 1575 Linden Drive, Madison, WI 53706, USA
- USDA-Agricultural Research Service, Vegetable Crops Research Unit, University of Wisconsin, 1575 Linden Drive, Madison, WI 53706, USA
| | - Philipp Simon
- Department of Horticulture, University of Wisconsin-Madison, 1575 Linden Drive, Madison, WI 53706, USA
- USDA-Agricultural Research Service, Vegetable Crops Research Unit, University of Wisconsin, 1575 Linden Drive, Madison, WI 53706, USA
| |
Collapse
|
37
|
Pinto AC, Melo-Barbosa HP, Miyoshi A, Silva A, Azevedo V. Application of RNA-seq to reveal the transcript profile in bacteria. GENETICS AND MOLECULAR RESEARCH 2012; 10:1707-18. [PMID: 21863565 DOI: 10.4238/vol10-3gmr1554] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
The large number of microbial genomes deposited in databanks has opened the door for in-depth studies of organisms, including post-genomics investigations. Thanks to new generation sequencing technology, these studies have made advances that have lead to extraordinary discoveries in bacterial transcriptomics. In this review, we describe bacterial RNA sequencing studies that use these new techniques. We also examined the advantages and biases of these new generation technologies; advances in bioinformatics make it possible to overcome the biases, providing interesting and surprising results.
Collapse
Affiliation(s)
- A C Pinto
- Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Belo Horizonte, MG, Brasil
| | | | | | | | | |
Collapse
|
38
|
Abstract
The access of transcription factors and the replication machinery to DNA is regulated by the epigenetic state of chromatin. In eukaryotes, this complex layer of regulatory processes includes the direct methylation of DNA, as well as covalent modifications to histones. Using next-generation sequencers, it is now possible to obtain profiles of epigenetic modifications across a genome using chromatin immunoprecipitation followed by sequencing (ChIP-seq). This technique permits the detection of the binding of proteins to specific regions of the genome with high resolution. It can be used to determine the target sequences of transcription factors, as well as the positions of histones with specific modification of their N-terminal tails. Antibodies that selectively bind methylated DNA may also be used to determine the position of methylated cytosines. Here, we present a data analysis pipeline for processing ChIP-seq data, and discuss the limitations and idiosyncrasies of these approaches.
Collapse
Affiliation(s)
- Matteo Pellegrini
- Department of Molecular, Cell and Developmental, University of California, Los Angeles, CA, USA.
| | | |
Collapse
|
39
|
Pelleymounter LL, Moon I, Johnson JA, Laederach A, Halvorsen M, Eckloff B, Abo R, Rossetti S. A novel application of pattern recognition for accurate SNP and indel discovery from high-throughput data: targeted resequencing of the glucocorticoid receptor co-chaperone FKBP5 in a Caucasian population. Mol Genet Metab 2011; 104:457-69. [PMID: 21917492 PMCID: PMC3224211 DOI: 10.1016/j.ymgme.2011.08.019] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/24/2011] [Revised: 08/18/2011] [Accepted: 08/18/2011] [Indexed: 11/28/2022]
Abstract
The detection of single nucleotide polymorphisms (SNPs) and insertion/deletions (indels) with precision from high-throughput data remains a significant bioinformatics challenge. Accurate detection is necessary before next-generation sequencing can routinely be used in the clinic. In research, scientific advances are inhibited by gaps in data, exemplified by the underrepresented discovery of rare variants, variants in non-coding regions and indels. The continued presence of false positives and false negatives prevents full automation and requires additional manual verification steps. Our methodology presents applications of both pattern recognition and sensitivity analysis to eliminate false positives and aid in the detection of SNP/indel loci and genotypes from high-throughput data. We chose FK506-binding protein 51(FKBP5) (6p21.31) for our clinical target because of its role in modulating pharmacological responses to physiological and synthetic glucocorticoids and because of the complexity of the genomic region. We detected genetic variation across a 160 kb region encompassing FKBP5. 613 SNPs and 57 indels, including a 3.3 kb deletion were discovered. We validated our method using three independent data sets and, with Sanger sequencing and Affymetrix and Illumina microarrays, achieved 99% concordance. Furthermore we were able to detect 267 novel rare variants and assess linkage disequilibrium. Our results showed both a sensitivity and specificity of 98%, indicating near perfect classification between true and false variants. The process is scalable and amenable to automation, with the downstream filters taking only 1.5h to analyze 96 individuals simultaneously. We provide examples of how our level of precision uncovered the interactions of multiple loci, their predicted influences on mRNA stability, perturbations of the hsp90 binding site, and individual variation in FKBP5 expression. Finally we show how our discovery of rare variants may change current conceptions of evolution at this locus.
Collapse
Affiliation(s)
- Linda L Pelleymounter
- Department of Pharmacology, Department of Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN 55905, USA.
| | | | | | | | | | | | | | | |
Collapse
|
40
|
Bybee SM, Bracken-Grissom HD, Hermansen RA, Clement MJ, Crandall KA, Felder DL. Directed next generation sequencing for phylogenetics: An example using Decapoda (Crustacea). ZOOL ANZ 2011. [DOI: 10.1016/j.jcz.2011.05.010] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
41
|
Using VAAST to identify an X-linked disorder resulting in lethality in male infants due to N-terminal acetyltransferase deficiency. Am J Hum Genet 2011; 89:28-43. [PMID: 21700266 DOI: 10.1016/j.ajhg.2011.05.017] [Citation(s) in RCA: 202] [Impact Index Per Article: 14.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2011] [Revised: 05/18/2011] [Accepted: 05/19/2011] [Indexed: 01/26/2023] Open
Abstract
We have identified two families with a previously undescribed lethal X-linked disorder of infancy; the disorder comprises a distinct combination of an aged appearance, craniofacial anomalies, hypotonia, global developmental delays, cryptorchidism, and cardiac arrhythmias. Using X chromosome exon sequencing and a recently developed probabilistic algorithm aimed at discovering disease-causing variants, we identified in one family a c.109T>C (p.Ser37Pro) variant in NAA10, a gene encoding the catalytic subunit of the major human N-terminal acetyltransferase (NAT). A parallel effort on a second unrelated family converged on the same variant. The absence of this variant in controls, the amino acid conservation of this region of the protein, the predicted disruptive change, and the co-occurrence in two unrelated families with the same rare disorder suggest that this is the pathogenic mutation. We confirmed this by demonstrating a significantly impaired biochemical activity of the mutant hNaa10p, and from this we conclude that a reduction in acetylation by hNaa10p causes this disease. Here we provide evidence of a human genetic disorder resulting from direct impairment of N-terminal acetylation, one of the most common protein modifications in humans.
Collapse
|
42
|
Lyon GJ, Jiang T, Van Wijk R, Wang W, Bodily PM, Xing J, Tian L, Robison RJ, Clement M, Lin Y, Zhang P, Liu Y, Moore B, Glessner JT, Elia J, Reimherr F, van Solinge WW, Yandell M, Hakonarson H, Wang J, Johnson WE, Wei Z, Wang K. Exome sequencing and unrelated findings in the context of complex disease research: ethical and clinical implications. DISCOVERY MEDICINE 2011; 12:41-55. [PMID: 21794208 PMCID: PMC3544941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Exome sequencing has identified the causes of several Mendelian diseases, although it has rarely been used in a clinical setting to diagnose the genetic cause of an idiopathic disorder in a single patient. We performed exome sequencing on a pedigree with several members affected with attention deficit/hyperactivity disorder (ADHD), in an effort to identify candidate variants predisposing to this complex disease. While we did identify some rare variants that might predispose to ADHD, we have not yet proven the causality for any of them. However, over the course of the study, one subject was discovered to have idiopathic hemolytic anemia (IHA), which was suspected to be genetic in origin. Analysis of this subject's exome readily identified two rare non-synonymous mutations in PKLR gene as the most likely cause of the IHA, although these two mutations had not been documented before in a single individual. We further confirmed the deficiency by functional biochemical testing, consistent with a diagnosis of red blood cell pyruvate kinase deficiency. Our study implies that exome and genome sequencing will certainly reveal additional rare variation causative for even well-studied classical Mendelian diseases, while also revealing variants that might play a role in complex diseases. Furthermore, our study has clinical and ethical implications for exome and genome sequencing in a research setting; how to handle unrelated findings of clinical significance, in the context of originally planned complex disease research, remains a largely uncharted area for clinicians and researchers.
Collapse
Affiliation(s)
- Gholson J Lyon
- Department of Psychiatry, University of Utah, Salt Lake City, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
43
|
Bao S, Jiang R, Kwan W, Wang B, Ma X, Song YQ. WITHDRAWN: Evaluation of next-generation sequencing software in mapping and assembly. J Hum Genet 2011:jhg201162. [PMID: 21677664 DOI: 10.1038/jhg.2011.62] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Next-generation high-throughput DNA sequencing technologies have advanced progressively in sequence-based genomic research and novel biological applications with the promise of sequencing DNA at unprecedented speed. These new non-Sanger-based technologies feature several advantages, when compared with traditional sequencing methods in terms of higher sequencing speed, lower per run cost and higher accuracy. However, reads from next-generation sequencing (NGS) platforms, such as 454/Roche, ABI/SOLiD and Illumina/Solexa, are usually short, thereby restricting the applications of NGS platforms in genome assembly and annotation. We presented an overview of the challenges that these novel technologies meet and particularly illustrated various bioinformatics attempts on mapping and assembly for problem solving. We then compared the performance of several programs in these two fields and further provided advices on selecting suitable tools for specific biological applications.Journal of Human Genetics advance online publication, 16 June 2011; doi:10.1038/jhg.2011.62.
Collapse
Affiliation(s)
- Suying Bao
- Department of Biochemistry, Center for Reproduction, Development and Growth, The University of Hong Kong, Hong Kong
| | | | | | | | | | | |
Collapse
|
44
|
Abstract
Next-generation high-throughput DNA sequencing technologies have advanced progressively in sequence-based genomic research and novel biological applications with the promise of sequencing DNA at unprecedented speed. These new non-Sanger-based technologies feature several advantages when compared with traditional sequencing methods in terms of higher sequencing speed, lower per run cost and higher accuracy. However, reads from next-generation sequencing (NGS) platforms, such as 454/Roche, ABI/SOLiD and Illumina/Solexa, are usually short, thereby restricting the applications of NGS platforms in genome assembly and annotation. We presented an overview of the challenges that these novel technologies meet and particularly illustrated various bioinformatics attempts on mapping and assembly for problem solving. We then compared the performance of several programs in these two fields, and further provided advices on selecting suitable tools for specific biological applications.
Collapse
|
45
|
Evan Johnson W, Welker NC, Bass BL. Dynamic linear model for the identification of miRNAs in next-generation sequencing data. Biometrics 2011; 67:1206-14. [PMID: 21385162 DOI: 10.1111/j.1541-0420.2010.01570.x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
Next-generation sequencing technologies are poised to revolutionize the field of biomedical research. The increased resolution of these data promise to provide a greater understanding of the molecular processes that control the morphology and behavior of a cell. However, the increased amounts of data require innovative statistical procedures that are powerful while still being computationally feasible. In this article, we present a method for identifying small RNA molecules, called miRNAs, which regulate genes by targeting their mRNAs for degradation or translational repression. In the first step of our modeling procedure, we apply an innovative dynamic linear model that identifies candidate miRNA genes in high-throughput sequencing data. The model is flexible and can accurately identify interesting biological features while accounting for both the read count, read spacing, and sequencing depth. Additionally, miRNA candidates are also processed using a modified Smith-Waterman sequence alignment that scores the regions for potential RNA hairpins, one of the defining features of miRNAs. We illustrate our method on simulated datasets as well as on a small RNA Caenorhabditis elegans dataset from the Illumina sequencing platform. These examples show that our method is highly sensitive for identifying known and novel miRNA genes.
Collapse
|
46
|
Clement NL, Clement MJ, Snell Q, Johnson WE. Parallel Mapping Approaches for GNUMAP. PROCEEDINGS. IPDPS (CONFERENCE) 2011; 2011:435-443. [PMID: 23396612 DOI: 10.1109/ipdps.2011.184] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Mapping short next-generation reads to reference genomes is an important element in SNP calling and expression studies. A major limitation to large-scale whole-genome mapping is the large memory requirements for the algorithm and the long run-time necessary for accurate studies. Several parallel implementations have been performed to distribute memory on different processors and to equally share the processing requirements. These approaches are compared with respect to their memory footprint, load balancing, and accuracy. When using MPI with multi-threading, linear speedup can be achieved for up to 256 processors.
Collapse
Affiliation(s)
- Nathan L Clement
- Department of Computer Science, University of Texas, Austin, TX 78712
| | | | | | | |
Collapse
|
47
|
Abstract
Many methods and tools are available for preprocessing high-throughput RNA sequencing data and detecting differential expression.
Collapse
Affiliation(s)
- Alicia Oshlack
- Bioinformatics Division, Walter and Eliza Hall Institute, 1G Royal Parade, Parkville 3052, Australia.
| | | | | |
Collapse
|
48
|
Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 2010; 11:473-83. [PMID: 20460430 DOI: 10.1093/bib/bbq015] [Citation(s) in RCA: 418] [Impact Index Per Article: 27.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Rapidly evolving sequencing technologies produce data on an unparalleled scale. A central challenge to the analysis of this data is sequence alignment, whereby sequence reads must be compared to a reference. A wide variety of alignment algorithms and software have been subsequently developed over the past two years. In this article, we will systematically review the current development of these algorithms and introduce their practical applications on different types of experimental data. We come to the conclusion that short-read alignment is no longer the bottleneck of data analyses. We also consider future development of alignment algorithms with respect to emerging long sequence reads and the prospect of cloud computing.
Collapse
Affiliation(s)
- Heng Li
- Broad Institute, Cambridge, MA 02142, USA.
| | | |
Collapse
|