1
|
Abstract
Background Single nucleotide polymorphisms (SNP) have been applied as important molecular markers in genetics and breeding studies. The rapid advance of next generation sequencing (NGS) provides a high-throughput means of SNP discovery. However, SNP development is limited by the availability of reliable SNP discovery methods. Especially, the optimum assembler and SNP caller for accurate SNP prediction from next generation sequencing data are not known. Results Herein we performed SNP prediction based on RNA-seq data of peach and mandarin peel tissue under a comprehensive comparison of two paired-end read lengths (125 bp and 150 bp), five assemblers (Trinity, IDBA, oases, SOAPdenovo, Trans-abyss) and two SNP callers (GATK and GBS). The predicted SNPs were compared with the authentic SNPs identified via PCR amplification followed by gene cloning and sequencing procedures. A total of 40 and 240 authentic SNPs were presented in five anthocyanin biosynthesis related genes in peach and in nine carotenogenic genes in mandarin. Putative SNPs predicted from the same RNA-seq data with different strategies led to quite divergent results. The rate of false positive SNPs was significantly lower when the paired-end read length was 150 bp compared with 125 bp. Trinity was superior to the other four assemblers and GATK was substantially superior to GBS due to a low rate of missing authentic SNPs. The combination of assembler Trinity, SNP caller GATK, and the paired-end read length 150 bp had the best performance in SNP discovery with 100% accuracy both in peach and in mandarin cases. This strategy was applied to the characterization of SNPs in peach and mandarin transcriptomes. Conclusions Through comparison of authentic SNPs obtained by PCR cloning strategy and putative SNPs predicted from different combinations of five assemblers, two SNP callers, and two paired-end read lengths, we provided a reliable and efficient strategy, Trinity-GATK with 150 bp paired-end read length, for SNP discovery from RNA-seq data. This strategy discovered SNP at 100% accuracy in peach and mandarin cases and might be applicable to a wide range of plants and other organisms. Electronic supplementary material The online version of this article (10.1186/s12864-019-5533-4) contains supplementary material, which is available to authorized users.
Collapse
|
2
|
Guo F, Wang D, Wang L. Progressive approach for SNP calling and haplotype assembly using single molecular sequencing data. Bioinformatics 2018; 34:2012-2018. [DOI: 10.1093/bioinformatics/bty059] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Accepted: 02/17/2018] [Indexed: 12/30/2022] Open
Affiliation(s)
- Fei Guo
- School of Computer Science and Technology, Tianjin University, Tianjin Haihe Education Park, Tianjin, China
| | - Dan Wang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong
| | - Lusheng Wang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong
- University of Hong Kong Shenzhen Research Institute, Shenzhen Hi-Tech Industrial Park, Shenzhen, Guangdong, China
| |
Collapse
|
3
|
SNP Discovery Using a Pangenome: Has the Single Reference Approach Become Obsolete? BIOLOGY 2017; 6:biology6010021. [PMID: 28287462 PMCID: PMC5372014 DOI: 10.3390/biology6010021] [Citation(s) in RCA: 57] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/13/2017] [Revised: 03/07/2017] [Accepted: 03/08/2017] [Indexed: 12/22/2022]
Abstract
Increasing evidence suggests that a single individual is insufficient to capture the genetic diversity within a species due to gene presence absence variation. In order to understand the extent to which genomic variation occurs in a species, the construction of its pangenome is necessary. The pangenome represents the complete set of genes of a species; it is composed of core genes, which are present in all individuals, and variable genes, which are present only in some individuals. Aside from variations at the gene level, single nucleotide polymorphisms (SNPs) are also an important form of genetic variation. The advent of next-generation sequencing (NGS) coupled with the heritability of SNPs make them ideal markers for genetic analysis of human, animal, and microbial data. SNPs have also been extensively used in crop genetics for association mapping, quantitative trait loci (QTL) analysis, analysis of genetic diversity, and phylogenetic analysis. This review focuses on the use of pangenomes for SNP discovery. It highlights the advantages of using a pangenome rather than a single reference for this purpose. This review also demonstrates how extra information not captured in a single reference alone can be used to provide additional support for linking genotypic data to phenotypic data.
Collapse
|
4
|
Single-cell SNP analyses and interpretations based on RNA-Seq data for colon cancer research. Sci Rep 2016; 6:34420. [PMID: 27677461 PMCID: PMC5039670 DOI: 10.1038/srep34420] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2015] [Accepted: 09/13/2016] [Indexed: 01/26/2023] Open
Abstract
Single-cell sequencing is useful for illustrating the cellular heterogeneities inherent in many intricate biological systems, particularly in human cancer. However, owing to the difficulties in acquiring, amplifying and analyzing single-cell genetic material, obstacles remain for single-cell diversity assessments such as single nucleotide polymorphism (SNP) analyses, rendering biological interpretations of single-cell omics data elusive. We used RNA-Seq data from single-cell and bulk colon cancer samples to analyze the SNP profiles for both structural and functional comparisons. Colon cancer-related pathways with single-cell level SNP enrichment, including the TGF-β and p53 signaling pathways, were also investigated based on both their SNP enrichment patterns and gene expression. We also detected a certain number of fusion transcripts, which may promote tumorigenesis, at the single-cell level. Based on these results, single-cell analyses not only recapitulated the SNP analysis results from the bulk samples but also detected cell-to-cell and cell-to-bulk variations, thereby aiding in early diagnosis and in identifying the precise mechanisms underlying cancers at the single-cell level.
Collapse
|
5
|
Liu Y, Loewer M, Aluru S, Schmidt B. SNVSniffer: an integrated caller for germline and somatic single-nucleotide and indel mutations. BMC SYSTEMS BIOLOGY 2016; 10 Suppl 2:47. [PMID: 27489955 PMCID: PMC4977481 DOI: 10.1186/s12918-016-0300-5] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
BACKGROUND Various approaches to calling single-nucleotide variants (SNVs) or insertion-or-deletion (indel) mutations have been developed based on next-generation sequencing (NGS). However, most of them are dedicated to a particular type of mutation, e.g. germline SNVs in normal cells, somatic SNVs in cancer/tumor cells, or indels only. In the literature, efficient and integrated callers for both germline and somatic SNVs/indels have not yet been extensively investigated. RESULTS We present SNVSniffer, an efficient and integrated caller identifying both germline and somatic SNVs/indels from NGS data. In this algorithm, we propose the use of Bayesian probabilistic models to identify SNVs and investigate a multiple ungapped alignment approach to call indels. For germline variant calling, we model allele counts per site to follow a multinomial conditional distribution. For somatic variant calling, we rely on paired tumor-normal pairs from identical individuals and introduce a hybrid subtraction and joint sample analysis approach by modeling tumor-normal allele counts per site to follow a joint multinomial conditional distribution. A comprehensive performance evaluation has been conducted using a diversity of variant calling benchmarks. For germline variant calling, SNVSniffer demonstrates highly competitive accuracy with superior speed in comparison with the state-of-the-art FaSD, GATK and SAMtools. For somatic variant calling, our algorithm achieves comparable or even better accuracy, at fast speed, than the leading VarScan2, SomaticSniper, JointSNVMix2 and MuTect. CONCLUSIONS SNVSniffers demonstrates the feasibility to develop integrated solutions to fast and efficient identification of germline and somatic variants. Nonetheless, accurate discovery of genetic variations is critical yet challenging, and still requires substantially more research efforts being devoted. SNVSniffer and synthetic samples are publicly available at http://snvsniffer.sourceforge.net .
Collapse
Affiliation(s)
- Yongchao Liu
- School of Computational Science & Engineering, Georgia Institute of Technology, Atlanta, 30332, Georgia, USA.
| | - Martin Loewer
- Translational Oncology, Johannes Gutenberg University Medical Center gGmbH Mainz, Mainz, 55131, Germany
| | - Srinivas Aluru
- School of Computational Science & Engineering, Georgia Institute of Technology, Atlanta, 30332, Georgia, USA
| | - Bertil Schmidt
- Institute of Computer Science, Johannes Gutenberg University Mainz, Mainz, 55128, Germany
| |
Collapse
|
6
|
Huang G, Wang S, Wang X, You N. An empirical Bayes method for genotyping and SNP detection using multi-sample next-generation sequencing data. Bioinformatics 2016; 32:3240-3245. [DOI: 10.1093/bioinformatics/btw409] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2016] [Accepted: 06/20/2016] [Indexed: 12/30/2022] Open
|
7
|
Murillo GH, You N, Su X, Cui W, Reilly MP, Li M, Ning K, Cui X. MultiGeMS: detection of SNVs from multiple samples using model selection on high-throughput sequencing data. Bioinformatics 2016; 32:1486-92. [PMID: 26787661 DOI: 10.1093/bioinformatics/btv753] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2015] [Accepted: 12/21/2015] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION Single nucleotide variant (SNV) detection procedures are being utilized as never before to analyze the recent abundance of high-throughput DNA sequencing data, both on single and multiple sample datasets. Building on previously published work with the single sample SNV caller genotype model selection (GeMS), a multiple sample version of GeMS (MultiGeMS) is introduced. Unlike other popular multiple sample SNV callers, the MultiGeMS statistical model accounts for enzymatic substitution sequencing errors. It also addresses the multiple testing problem endemic to multiple sample SNV calling and utilizes high performance computing (HPC) techniques. RESULTS A simulation study demonstrates that MultiGeMS ranks highest in precision among a selection of popular multiple sample SNV callers, while showing exceptional recall in calling common SNVs. Further, both simulation studies and real data analyses indicate that MultiGeMS is robust to low-quality data. We also demonstrate that accounting for enzymatic substitution sequencing errors not only improves SNV call precision at low mapping quality regions, but also improves recall at reference allele-dominated sites with high mapping quality. AVAILABILITY AND IMPLEMENTATION The MultiGeMS package can be downloaded from https://github.com/cui-lab/multigems CONTACT xinping.cui@ucr.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gabriel H Murillo
- Department of Statistics, University of California, Riverside, CA 92521, USA
| | - Na You
- Department of Statistical Science, School of Mathematics and Computational Science, Sun Yat-Sen University, Guangzhou, Guangdong 510275, China
| | - Xiaoquan Su
- Qingdao Institute of BioEnergy and Bioprocess Technology, Chinese Academy of Sciences, Qingdao, Shandong 266101, China
| | - Wei Cui
- Department of Statistics, University of California, Riverside, CA 92521, USA
| | | | - Mingyao Li
- Department of Biostatistics and Epidemiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA
| | - Kang Ning
- Key Laboratory of Molecular Biophysics of the Ministry of Education, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China and
| | - Xinping Cui
- Department of Statistics, University of California, Riverside, CA 92521, USA, Center for Plant Cell Biology, Institute for Integrative Genome Biology, University of California, Riverside, CA 92521, USA
| |
Collapse
|
8
|
Monovar: single-nucleotide variant detection in single cells. Nat Methods 2016; 13:505-7. [PMID: 27088313 PMCID: PMC4887298 DOI: 10.1038/nmeth.3835] [Citation(s) in RCA: 105] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2015] [Accepted: 03/18/2016] [Indexed: 12/31/2022]
Abstract
Current variant callers are not suitable for single-cell DNA sequencing, as they do not account for allelic dropout, false-positive errors and coverage nonuniformity. We developed Monovar (https://bitbucket.org/hamimzafar/monovar), a statistical method for detecting and genotyping single-nucleotide variants in single-cell data. Monovar exhibited superior performance over standard algorithms on benchmarks and in identifying driver mutations and delineating clonal substructure in three different human tumor data sets.
Collapse
|
9
|
Ribeiro A, Golicz A, Hackett CA, Milne I, Stephen G, Marshall D, Flavell AJ, Bayer M. An investigation of causes of false positive single nucleotide polymorphisms using simulated reads from a small eukaryote genome. BMC Bioinformatics 2015; 16:382. [PMID: 26558718 PMCID: PMC4642669 DOI: 10.1186/s12859-015-0801-z] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2015] [Accepted: 10/29/2015] [Indexed: 12/30/2022] Open
Abstract
Background Single Nucleotide Polymorphisms (SNPs) are widely used molecular markers, and their use has increased massively since the inception of Next Generation Sequencing (NGS) technologies, which allow detection of large numbers of SNPs at low cost. However, both NGS data and their analysis are error-prone, which can lead to the generation of false positive (FP) SNPs. We explored the relationship between FP SNPs and seven factors involved in mapping-based variant calling — quality of the reference sequence, read length, choice of mapper and variant caller, mapping stringency and filtering of SNPs by read mapping quality and read depth. This resulted in 576 possible factor level combinations. We used error- and variant-free simulated reads to ensure that every SNP found was indeed a false positive. Results The variation in the number of FP SNPs generated ranged from 0 to 36,621 for the 120 million base pairs (Mbp) genome. All of the experimental factors tested had statistically significant effects on the number of FP SNPs generated and there was a considerable amount of interaction between the different factors. Using a fragmented reference sequence led to a dramatic increase in the number of FP SNPs generated, as did relaxed read mapping and a lack of SNP filtering. The choice of reference assembler, mapper and variant caller also significantly affected the outcome. The effect of read length was more complex and suggests a possible interaction between mapping specificity and the potential for contributing more false positives as read length increases. Conclusions The choice of tools and parameters involved in variant calling can have a dramatic effect on the number of FP SNPs produced, with particularly poor combinations of software and/or parameter settings yielding tens of thousands in this experiment. Between-factor interactions make simple recommendations difficult for a SNP discovery pipeline but the quality of the reference sequence is clearly of paramount importance. Our findings are also a stark reminder that it can be unwise to use the relaxed mismatch settings provided as defaults by some read mappers when reads are being mapped to a relatively unfinished reference sequence from e.g. a non-model organism in its early stages of genomic exploration. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0801-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Antonio Ribeiro
- The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, Scotland, UK. .,Division of Plant Sciences, University of Dundee at JHI, Invergowrie, Dundee, DD2 5DA, Scotland, UK.
| | - Agnieszka Golicz
- School of Agriculture and Food Sciences, University of Queensland, Brisbane, Queensland, 4072, Australia. .,Australian Centre for Plant Functional Genomics and School of Agriculture and Food Sciences, University of Queensland, Brisbane, Queensland, 4072, Australia.
| | | | - Iain Milne
- The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, Scotland, UK.
| | - Gordon Stephen
- The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, Scotland, UK.
| | - David Marshall
- The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, Scotland, UK.
| | - Andrew J Flavell
- Division of Plant Sciences, University of Dundee at JHI, Invergowrie, Dundee, DD2 5DA, Scotland, UK.
| | - Micha Bayer
- The James Hutton Institute, Invergowrie, Dundee, DD2 5DA, Scotland, UK.
| |
Collapse
|
10
|
A hidden Markov approach for ascertaining cSNP genotypes from RNA sequence data in the presence of allelic imbalance by exploiting linkage disequilibrium. BMC Bioinformatics 2015; 16:61. [PMID: 25887316 PMCID: PMC4351697 DOI: 10.1186/s12859-015-0479-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2014] [Accepted: 01/27/2015] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Allelic specific expression (ASE) increases our understanding of the genetic control of gene expression and its links to phenotypic variation. ASE testing is implemented through binomial or beta-binomial tests of sequence read counts of alternative alleles at a cSNP of interest in heterozygous individuals. This requires prior ascertainment of the cSNP genotypes for all individuals. To meet the needs, we propose hidden Markov methods to call SNPs from next generation RNA sequence data when ASE possibly exists. RESULTS We propose two hidden Markov models (HMMs), HMM-ASE and HMM-NASE that consider or do not consider ASE, respectively, in order to improve genotyping accuracy. Both HMMs have the advantages of calling the genotypes of several SNPs simultaneously and allow mapping error which, respectively, utilize the dependence among SNPs and correct the bias due to mapping error. In addition, HMM-ASE exploits ASE information to further improve genotype accuracy when the ASE is likely to be present. Simulation results indicate that the HMMs proposed demonstrate a very good prediction accuracy in terms of controlling both the false discovery rate (FDR) and the false negative rate (FNR). When ASE is present, the HMM-ASE had a lower FNR than HMM-NASE, while both can control the false discovery rate (FDR) at a similar level. By exploiting linkage disequilibrium (LD), a real data application demonstrate that the proposed methods have better sensitivity and similar FDR in calling heterozygous SNPs than the VarScan method. Sensitivity and FDR are similar to that of the BCFtools and Beagle methods. The resulting genotypes show good properties for the estimation of the genetic parameters and ASE ratios. CONCLUSIONS We introduce HMMs, which are able to exploit LD and account for the ASE and mapping errors, to simultaneously call SNPs from the next generation RNA sequence data. The method introduced can reliably call for cSNP genotypes even in the presence of ASE and under low sequencing coverage. As a byproduct, the proposed method is able to provide predictions of ASE ratios for the heterozygous genotypes, which can then be used for ASE testing.
Collapse
|
11
|
Lindgreen S, Krogh A, Pedersen JS. SNPest: a probabilistic graphical model for estimating genotypes. BMC Res Notes 2014; 7:698. [PMID: 25294605 PMCID: PMC4203901 DOI: 10.1186/1756-0500-7-698] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2014] [Accepted: 10/02/2014] [Indexed: 12/30/2022] Open
Abstract
Background As the use of next-generation sequencing technologies is becoming more widespread, the need for robust software to help with the analysis is growing as well. A key challenge when analyzing sequencing data is the prediction of genotypes from the reads, i.e. correct inference of the underlying DNA sequences that gave rise to the sequenced fragments. For diploid organisms, the genotyper should be able to predict both alleles in the individual. Variations between the individual and the population can then be analyzed by looking for SNPs (single nucleotide polymorphisms) in order to investigate diseases or phenotypic features. To perform robust and high confidence genotyping and SNP calling, methods are needed that take the technology specific limitations into account and can model different sources of error. As an example, ancient DNA poses special challenges as the data is often shallow and subject to errors induced by post mortem damage. Findings We present a novel approach to the genotyping problem where a probabilistic framework describing the process from sampling to sequencing is implemented as a graphical model. This makes it possible to model technology specific errors and other sources of variation that can affect the result. The inferred genotype is given a posterior probability to signify the confidence in the result. SNPest has already been used to genotype large scale projects such as the first ancient human genome published in 2010. Conclusions We compare the performance of SNPest to a number of other widely used genotypers on both real and simulated data, covering both haploid and diploid genomes. We investigate the effects of read depth, of removing adapters before mapping and genotyping, of using different mapping tools, and of using the correct model in the genotyping process. We show that the performance of SNPest is comparable to existing methods, and we also illustrate cases where SNPest has an advantage over other methods, e.g. when dealing with simulated ancient DNA. Electronic supplementary material The online version of this article (doi:10.1186/1756-0500-7-698) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Stinus Lindgreen
- Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, Ole Maaloes Vej, 2200 Copenhagen, Denmark.
| | | | | |
Collapse
|
12
|
Manwar Hussain MR, Khan A, Ali Mohamoud HS. From genes to health - challenges and opportunities. Front Pediatr 2014; 2:12. [PMID: 24624370 PMCID: PMC3939617 DOI: 10.3389/fped.2014.00012] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/04/2013] [Accepted: 02/10/2014] [Indexed: 11/13/2022] Open
Abstract
In genome science, the advancement in high-throughput sequencing technologies and bioinformatics analysis is facilitating the better understanding of Mendelian and complex trait inheritance. Charting the genetic basis of complex diseases - including pediatric cancer, and interpreting huge amount of next-generation sequencing data are among the major technical challenges to be overcome in order to understand the molecular basis of various diseases and genetic disorders. In this review, we provide insights into some major challenges currently hindering a better understanding of Mendelian and complex trait inheritance, and thus impeding medical benefits to patients.
Collapse
Affiliation(s)
- Muhammad Ramzan Manwar Hussain
- Princess Al-Jawhara Al-Brahim Center of Excellence in Research of Hereditary Diseases (PACER-HD), Department of Genetic Medicine, King Abdulaziz University , Jeddah , Saudi Arabia
| | - Asifullah Khan
- Department of Biochemistry, Abdul Wali Khan University , Mardan , Pakistan
| | - Hussein Sheikh Ali Mohamoud
- Princess Al-Jawhara Al-Brahim Center of Excellence in Research of Hereditary Diseases (PACER-HD), Department of Genetic Medicine, King Abdulaziz University , Jeddah , Saudi Arabia
| |
Collapse
|
13
|
Wang S, Xing J. A primer for disease gene prioritization using next-generation sequencing data. Genomics Inform 2013; 11:191-9. [PMID: 24465230 PMCID: PMC3897846 DOI: 10.5808/gi.2013.11.4.191] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2013] [Revised: 11/18/2013] [Accepted: 11/21/2013] [Indexed: 01/21/2023] Open
Abstract
High-throughput next-generation sequencing (NGS) technology produces a tremendous amount of raw sequence data. The challenges for researchers are to process the raw data, to map the sequences to genome, to discover variants that are different from the reference genome, and to prioritize/rank the variants for the question of interest. The recent development of many computational algorithms and programs has vastly improved the ability to translate sequence data into valuable information for disease gene identification. However, the NGS data analysis is complex and could be overwhelming for researchers who are not familiar with the process. Here, we outline the analysis pipeline and describe some of the most commonly used principles and tools for analyzing NGS data for disease gene identification.
Collapse
Affiliation(s)
- Shuoguo Wang
- Department of Genetics, The State University of New Jersey, Piscataway, NJ 08854, USA. ; Human Genetics Institute of New Jersey, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Jinchuan Xing
- Department of Genetics, The State University of New Jersey, Piscataway, NJ 08854, USA. ; Human Genetics Institute of New Jersey, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| |
Collapse
|
14
|
Barturen G, Rueda A, Oliver JL, Hackenberg M. MethylExtract: High-Quality methylation maps and SNV calling from whole genome bisulfite sequencing data. F1000Res 2013; 2:217. [PMID: 24627790 PMCID: PMC3938178 DOI: 10.12688/f1000research.2-217.v2] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/19/2014] [Indexed: 01/10/2023] Open
Abstract
Whole genome methylation profiling at a single cytosine resolution is now feasible due to the advent of high-throughput sequencing techniques together with bisulfite treatment of the DNA. To obtain the methylation value of each individual cytosine, the bisulfite-treated sequence reads are first aligned to a reference genome, and then the profiling of the methylation levels is done from the alignments. A huge effort has been made to quickly and correctly align the reads and many different algorithms and programs to do this have been created. However, the second step is just as crucial and non-trivial, but much less attention has been paid to the final inference of the methylation states. Important error sources do exist, such as sequencing errors, bisulfite failure, clonal reads, and single nucleotide variants. We developed
MethylExtract, a user friendly tool to: i) generate high quality, whole genome methylation maps and ii) detect sequence variation within the same sample preparation. The program is implemented into a single script and takes into account all major error sources.
MethylExtract detects variation (SNVs – Single Nucleotide Variants) in a similar way to
VarScan, a very sensitive method extensively used in SNV and genotype calling based on non-bisulfite-treated reads. The usefulness of
MethylExtract is shown by means of extensive benchmarking based on artificial bisulfite-treated reads and a comparison to a recently published method, called
Bis-SNP. MethylExtract is able to detect SNVs within High-Throughput Sequencing experiments of bisulfite treated DNA at the same time as it generates high quality methylation maps. This simultaneous detection of DNA methylation and sequence variation is crucial for many downstream analyses, for example when deciphering the impact of SNVs on differential methylation. An exclusive feature of
MethylExtract, in comparison with existing software, is the possibility to assess the bisulfite failure in a statistical way. The source code, tutorial and artificial bisulfite datasets are available at
http://bioinfo2.ugr.es/MethylExtract/ and
http://sourceforge.net/projects/methylextract/, and also permanently accessible from
10.5281/zenodo.7144.
Collapse
Affiliation(s)
- Guillermo Barturen
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Granada, 18071, Spain ; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, Granada, 18016, Spain
| | - Antonio Rueda
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Granada, 18071, Spain ; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, Granada, 18016, Spain
| | - José L Oliver
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Granada, 18071, Spain ; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, Granada, 18016, Spain
| | - Michael Hackenberg
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Granada, 18071, Spain ; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, Granada, 18016, Spain
| |
Collapse
|
15
|
Barturen G, Rueda A, Oliver JL, Hackenberg M. MethylExtract: High-Quality methylation maps and SNV calling from whole genome bisulfite sequencing data. F1000Res 2013; 2:217. [PMID: 24627790 DOI: 10.12688/f1000research.2-217.v1] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/09/2013] [Indexed: 01/30/2023] Open
Abstract
Whole genome methylation profiling at a single cytosine resolution is now feasible due to the advent of high-throughput sequencing techniques together with bisulfite treatment of the DNA. To obtain the methylation value of each individual cytosine, the bisulfite-treated sequence reads are first aligned to a reference genome, and then the profiling of the methylation levels is done from the alignments. A huge effort has been made to quickly and correctly align the reads and many different algorithms and programs to do this have been created. However, the second step is just as crucial and non-trivial, but much less attention has been paid to the final inference of the methylation states. Important error sources do exist, such as sequencing errors, bisulfite failure, clonal reads, and single nucleotide variants. We developed MethylExtract, a user friendly tool to: i) generate high quality, whole genome methylation maps and ii) detect sequence variation within the same sample preparation. The program is implemented into a single script and takes into account all major error sources. MethylExtract detects variation (SNVs - Single Nucleotide Variants) in a similar way to VarScan, a very sensitive method extensively used in SNV and genotype calling based on non-bisulfite-treated reads. The usefulness of MethylExtract is shown by means of extensive benchmarking based on artificial bisulfite-treated reads and a comparison to a recently published method, called Bis-SNP. MethylExtract is able to detect SNVs within High-Throughput Sequencing experiments of bisulfite treated DNA at the same time as it generates high quality methylation maps. This simultaneous detection of DNA methylation and sequence variation is crucial for many downstream analyses, for example when deciphering the impact of SNVs on differential methylation. An exclusive feature of MethylExtract, in comparison with existing software, is the possibility to assess the bisulfite failure in a statistical way. The source code, tutorial and artificial bisulfite datasets are available at http://bioinfo2.ugr.es/MethylExtract/ and http://sourceforge.net/projects/methylextract/, and also permanently accessible from 10.5281/zenodo.7144.
Collapse
Affiliation(s)
- Guillermo Barturen
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Granada, 18071, Spain ; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, Granada, 18016, Spain
| | - Antonio Rueda
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Granada, 18071, Spain ; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, Granada, 18016, Spain
| | - José L Oliver
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Granada, 18071, Spain ; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, Granada, 18016, Spain
| | - Michael Hackenberg
- Dpto. de Genética, Facultad de Ciencias, Universidad de Granada, Granada, 18071, Spain ; Lab. de Bioinformática, Inst. de Biotecnología, Centro de Investigación Biomédica, Granada, 18016, Spain
| |
Collapse
|
16
|
Kosugi S, Natsume S, Yoshida K, MacLean D, Cano L, Kamoun S, Terauchi R. Coval: improving alignment quality and variant calling accuracy for next-generation sequencing data. PLoS One 2013; 8:e75402. [PMID: 24116042 PMCID: PMC3792961 DOI: 10.1371/journal.pone.0075402] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2013] [Accepted: 08/14/2013] [Indexed: 11/26/2022] Open
Abstract
Accurate identification of DNA polymorphisms using next-generation sequencing technology is challenging because of a high rate of sequencing error and incorrect mapping of reads to reference genomes. Currently available short read aligners and DNA variant callers suffer from these problems. We developed the Coval software to improve the quality of short read alignments. Coval is designed to minimize the incidence of spurious alignment of short reads, by filtering mismatched reads that remained in alignments after local realignment and error correction of mismatched reads. The error correction is executed based on the base quality and allele frequency at the non-reference positions for an individual or pooled sample. We demonstrated the utility of Coval by applying it to simulated genomes and experimentally obtained short-read data of rice, nematode, and mouse. Moreover, we found an unexpectedly large number of incorrectly mapped reads in ‘targeted’ alignments, where the whole genome sequencing reads had been aligned to a local genomic segment, and showed that Coval effectively eliminated such spurious alignments. We conclude that Coval significantly improves the quality of short-read sequence alignments, thereby increasing the calling accuracy of currently available tools for SNP and indel identification. Coval is available at http://sourceforge.net/projects/coval105/.
Collapse
Affiliation(s)
- Shunichi Kosugi
- Iwate Biotechnology Research Center, Kitakami, Iwate, Japan
- Kazusa DNA Research Institute, Kisarazu, Chiba, Japan
- * E-mail: (SK); (RT)
| | | | | | - Daniel MacLean
- The Sainsbury Laboratory, Norwich Research Park, Norwich, United Kingdom
| | - Liliana Cano
- The Sainsbury Laboratory, Norwich Research Park, Norwich, United Kingdom
| | - Sophien Kamoun
- The Sainsbury Laboratory, Norwich Research Park, Norwich, United Kingdom
| | - Ryohei Terauchi
- Iwate Biotechnology Research Center, Kitakami, Iwate, Japan
- * E-mail: (SK); (RT)
| |
Collapse
|
17
|
Kojima K, Nariai N, Mimori T, Takahashi M, Yamaguchi-Kabata Y, Sato Y, Nagasaki M. A statistical variant calling approach from pedigree information and local haplotyping with phase informative reads. ACTA ACUST UNITED AC 2013; 29:2835-43. [PMID: 24002111 DOI: 10.1093/bioinformatics/btt503] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Variant calling from genome-wide sequencing data is essential for the analysis of disease-causing mutations and elucidation of disease mechanisms. However, variant calling in low coverage regions is difficult due to sequence read errors and mapping errors. Hence, variant calling approaches that are robust to low coverage data are demanded. RESULTS We propose a new variant calling approach that considers pedigree information and haplotyping based on sequence reads spanning two or more heterozygous positions termed phase informative reads. In our approach, genotyping and haplotyping by the assignment of each read to a haplotype based on phase informative reads are simultaneously performed. Therefore, positions with low evidence for heterozygosity are rescued by phase informative reads, and such rescued positions contribute to haplotyping in a synergistic way. In addition, pedigree information supports more accurate haplotyping as well as genotyping, especially in low coverage regions. Although heterozygous positions are useful for haplotyping, homozygous positions are not informative and weaken the information from heterozygous positions, as majority of positions are homozygous. Thus, we introduce latent variables that determine zygosity at each position to filter out homozygous positions for haplotyping. In performance evaluation with a parent-offspring trio sequencing data, our approach outperforms existing approaches in accuracy on the agreement with single nucleotide polymorphism array genotyping results. Also, performance analysis considering distance between variants showed that the use of phase informative reads is effective for accurate variant calling, and further performance improvement is expected with longer sequencing data. CONTACT kojima@megabank.tohoku.ac.jp .
Collapse
Affiliation(s)
- Kaname Kojima
- Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi 980-8573, Japan
| | | | | | | | | | | | | |
Collapse
|