51
|
Qin M, Liu B, Conroy JM, Morrison CD, Hu Q, Cheng Y, Murakami M, Odunsi AO, Johnson CS, Wei L, Liu S, Wang J. SCNVSim: somatic copy number variation and structure variation simulator. BMC Bioinformatics 2015; 16:66. [PMID: 25886838 PMCID: PMC4349766 DOI: 10.1186/s12859-015-0502-7] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2014] [Accepted: 02/20/2015] [Indexed: 12/31/2022] Open
Abstract
Background Somatically acquired structure variations (SVs) and copy number variations (CNVs) can induce genetic changes that are directly related to tumor genesis. Somatic SV/CNV detection using next-generation sequencing (NGS) data still faces major challenges introduced by tumor sample characteristics, such as ploidy, heterogeneity, and purity. A simulated cancer genome with known SVs and CNVs can serve as a benchmark for evaluating the performance of existing somatic SV/CNV detection tools and developing new methods. Results SCNVSim is a tool for simulating somatic CNVs and structure variations SVs. Other than multiple types of SV and CNV events, the tool is capable of simulating important features related to tumor samples including aneuploidy, heterogeneity and purity. Conclusions SCNVSim generates the genomes of a cancer cell population with detailed information of copy number status, loss of heterozygosity (LOH), and event break points, which is essential for developing and evaluating somatic CNV and SV detection methods in cancer genomics studies.
Collapse
Affiliation(s)
- Maochun Qin
- Department of Biostatistics and Bioinformatics, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA.
| | - Biao Liu
- Center for Personalized Medicine, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA.
| | - Jeffrey M Conroy
- Center for Personalized Medicine, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA.
| | - Carl D Morrison
- Center for Personalized Medicine, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA.
| | - Qiang Hu
- Department of Biostatistics and Bioinformatics, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA.
| | - Yubo Cheng
- Department of Biostatistics and Bioinformatics, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA.
| | - Mitsuko Murakami
- Center for Personalized Medicine, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA.
| | - Adekunle O Odunsi
- Department of Gynecologic Oncology, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA.
| | - Candace S Johnson
- Department of Pharmacology and Therapeutics, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA.
| | - Lei Wei
- Department of Biostatistics and Bioinformatics, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA.
| | - Song Liu
- Department of Biostatistics and Bioinformatics, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA.
| | - Jianmin Wang
- Department of Biostatistics and Bioinformatics, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA.
| |
Collapse
|
52
|
Supek F, Lehner B. Differential DNA mismatch repair underlies mutation rate variation across the human genome. Nature 2015; 521:81-4. [PMID: 25707793 PMCID: PMC4425546 DOI: 10.1038/nature14173] [Citation(s) in RCA: 246] [Impact Index Per Article: 24.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2014] [Accepted: 12/19/2014] [Indexed: 12/26/2022]
Abstract
Cancer genome sequencing has revealed considerable variation in somatic mutation rates across the human genome, with mutation rates elevated in heterochromatic late replicating regions and reduced in early replicating euchromatin. Multiple mechanisms have been suggested to underlie this, but the actual cause is unknown. Here we identify variable DNA mismatch repair (MMR) as the basis of this variation. Analysing ∼17 million single-nucleotide variants from the genomes of 652 tumours, we show that regional autosomal mutation rates at megabase resolution are largely stable across cancer types, with differences related to changes in replication timing and gene expression. However, mutations arising after the inactivation of MMR are no longer enriched in late replicating heterochromatin relative to early replicating euchromatin. Thus, differential DNA repair and not differential mutation supply is the primary cause of the large-scale regional mutation rate variation across the human genome.
Collapse
Affiliation(s)
- Fran Supek
- 1] EMBL-CRG Systems Biology Unit, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain [2] Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain [3] Division of Electronics, Rudjer Boskovic Institute, 10000 Zagreb, Croatia
| | - Ben Lehner
- 1] EMBL-CRG Systems Biology Unit, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain [2] Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain [3] Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain
| |
Collapse
|
53
|
Tan IB, Malik S, Ramnarayanan K, McPherson JR, Ho DL, Suzuki Y, Ng SB, Yan S, Lim KH, Koh D, Hoe CM, Chan CY, Ten R, Goh BK, Chung AY, Tan J, Chan CX, Tay ST, Alexander L, Nagarajan N, Hillmer AM, Tang CL, Chua C, Teh BT, Rozen S, Tan P. High-depth sequencing of over 750 genes supports linear progression of primary tumors and metastases in most patients with liver-limited metastatic colorectal cancer. Genome Biol 2015; 16:32. [PMID: 25808843 PMCID: PMC4365969 DOI: 10.1186/s13059-015-0589-1] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2014] [Accepted: 01/20/2015] [Indexed: 12/12/2022] Open
Abstract
Background Colorectal cancer with metastases limited to the liver (liver-limited mCRC) is a distinct clinical subset characterized by possible cure with surgery. We performed high-depth sequencing of over 750 cancer-associated genes and copy number profiling in matched primary, metastasis and normal tissues to characterize genomic progression in 18 patients with liver-limited mCRC. Results High depth Illumina sequencing and use of three different variant callers enable comprehensive and accurate identification of somatic variants down to 2.5% variant allele frequency. We identify a median of 11 somatic single nucleotide variants (SNVs) per tumor. Across patients, a median of 79.3% of somatic SNVs present in the primary are present in the metastasis and 81.7% of all alterations present in the metastasis are present in the primary. Private alterations are found at lower allele frequencies; a different mutational signature characterized shared and private variants, suggesting distinct mutational processes. Using B-allele frequencies of heterozygous germline SNPs and copy number profiling, we find that broad regions of allelic imbalance and focal copy number changes, respectively, are generally shared between the primary tumor and metastasis. Conclusions Our analyses point to high genomic concordance of primary tumor and metastasis, with a thick common trunk and smaller genomic branches in general support of the linear progression model in most patients with liver-limited mCRC. More extensive studies are warranted to further characterize genomic progression in this important clinical population. Electronic supplementary material The online version of this article (doi:10.1186/s13059-015-0589-1) contains supplementary material, which is available to authorized users.
Collapse
|
54
|
Abstract
BACKGROUND Tumor genomes are often highly heterogeneous, consisting of genomes from multiple subclonal types. Complete characterization of all subclonal types is a fundamental need in tumor genome analysis. With the advancement of next-generation sequencing, computational methods have recently been developed to infer tumor subclonal populations directly from cancer genome sequencing data. Most of these methods are based on sequence information from somatic point mutations, However, the accuracy of these algorithms depends crucially on the quality of the somatic mutations returned by variant calling algorithms, and usually requires a deep coverage to achieve a reasonable level of accuracy. RESULTS We describe a novel probabilistic mixture model, MixClone, for inferring the cellular prevalences of subclonal populations directly from whole genome sequencing of paired normal-tumor samples. MixClone integrates sequence information of somatic copy number alterations and allele frequencies within a unified probabilistic framework. We demonstrate the utility of the method using both simulated and real cancer sequencing datasets, and show that it significantly outperforms existing methods for inferring tumor subclonal populations. The MixClone package is written in Python and is publicly available at https://github.com/uci-cbcl/MixClone. CONCLUSIONS The probabilistic mixture model proposed here provides a new framework for subclonal analysis based on cancer genome sequencing data. By applying the method to both simulated and real cancer sequencing data, we show that integrating sequence information from both somatic copy number alterations and allele frequencies can significantly improve the accuracy of inferring tumor subclonal populations.
Collapse
|
55
|
The hidden genomic landscape of acute myeloid leukemia: subclonal structure revealed by undetected mutations. Blood 2014; 125:600-5. [PMID: 25499761 DOI: 10.1182/blood-2014-05-576157] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
The analyses carried out using 2 different bioinformatics pipelines (SomaticSniper and MuTect) on the same set of genomic data from 133 acute myeloid leukemia (AML) patients, sequenced inside the Cancer Genome Atlas project, gave discrepant results. We subsequently tested these 2 variant-calling pipelines on 20 leukemia samples from our series (19 primary AMLs and 1 secondary AML). By validating many of the predicted somatic variants (variant allele frequencies ranging from 100% to 5%), we observed significantly different calling efficiencies. In particular, despite relatively high specificity, sensitivity was poor in both pipelines resulting in a high rate of false negatives. Our findings raise the possibility that landscapes of AML genomes might be more complex than previously reported and characterized by the presence of hundreds of genes mutated at low variant allele frequency, suggesting that the application of genome sequencing to the clinic requires a careful and critical evaluation. We think that improvements in technology and workflow standardization, through the generation of clear experimental and bioinformatics guidelines, are fundamental to translate the use of next-generation sequencing from research to the clinic and to transform genomic information into better diagnosis and outcomes for the patient.
Collapse
|
56
|
Abstract
BACKGROUND Next generation sequencing (NGS)-based assays continue to redefine the field of genetic testing. Owing to the complexity of the data, bioinformatics has become a necessary component in any laboratory implementing a clinical NGS test. CONTENT The computational components of an NGS-based work flow can be conceptualized as primary, secondary, and tertiary analytics. Each of these components addresses a necessary step in the transformation of raw data into clinically actionable knowledge. Understanding the basic concepts of these analysis steps is important in assessing and addressing the informatics needs of a molecular diagnostics laboratory. Equally critical is a familiarity with the regulatory requirements addressing the bioinformatics analyses. These and other topics are covered in this review article. SUMMARY Bioinformatics has become an important component in clinical laboratories generating, analyzing, maintaining, and interpreting data from molecular genetics testing. Given the rapid adoption of NGS-based clinical testing, service providers must develop informatics work flows that adhere to the rigor of clinical laboratory standards, yet are flexible to changes as the chemistry and software for analyzing sequencing data mature.
Collapse
Affiliation(s)
- Gavin R Oliver
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| | - Steven N Hart
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| | - Eric W Klee
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN.
| |
Collapse
|
57
|
Radenbaugh AJ, Ma S, Ewing A, Stuart JM, Collisson EA, Zhu J, Haussler D. RADIA: RNA and DNA integrated analysis for somatic mutation detection. PLoS One 2014; 9:e111516. [PMID: 25405470 PMCID: PMC4236012 DOI: 10.1371/journal.pone.0111516] [Citation(s) in RCA: 63] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2014] [Accepted: 09/30/2014] [Indexed: 01/30/2023] Open
Abstract
The detection of somatic single nucleotide variants is a crucial component to the characterization of the cancer genome. Mutation calling algorithms thus far have focused on comparing the normal and tumor genomes from the same individual. In recent years, it has become routine for projects like The Cancer Genome Atlas (TCGA) to also sequence the tumor RNA. Here we present RADIA (RNA and DNA Integrated Analysis), a novel computational method combining the patient-matched normal and tumor DNA with the tumor RNA to detect somatic mutations. The inclusion of the RNA increases the power to detect somatic mutations, especially at low DNA allelic frequencies. By integrating an individual's DNA and RNA, we are able to detect mutations that would otherwise be missed by traditional algorithms that examine only the DNA. We demonstrate high sensitivity (84%) and very high precision (98% and 99%) for RADIA in patient data from endometrial carcinoma and lung adenocarcinoma from TCGA. Mutations with both high DNA and RNA read support have the highest validation rate of over 99%. We also introduce a simulation package that spikes in artificial mutations to patient data, rather than simulating sequencing data from a reference genome. We evaluate sensitivity on the simulation data and demonstrate our ability to rescue back mutations at low DNA allelic frequencies by including the RNA. Finally, we highlight mutations in important cancer genes that were rescued due to the incorporation of the RNA.
Collapse
Affiliation(s)
- Amie J. Radenbaugh
- University of California Santa Cruz Genomics Institute, Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Singer Ma
- University of California Santa Cruz Genomics Institute, Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Adam Ewing
- University of California Santa Cruz Genomics Institute, Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Joshua M. Stuart
- University of California Santa Cruz Genomics Institute, Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - Eric A. Collisson
- Division of Hematology/Oncology, University of California San Francisco, San Francisco, California, United States of America
| | - Jingchun Zhu
- University of California Santa Cruz Genomics Institute, Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
| | - David Haussler
- University of California Santa Cruz Genomics Institute, Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, California, United States of America
- Howard Hughes Medical Institute, Chevy Chase, Maryland, United States of America
| |
Collapse
|
58
|
Warden CD, Adamson AW, Neuhausen SL, Wu X. Detailed comparison of two popular variant calling packages for exome and targeted exon studies. PeerJ 2014; 2:e600. [PMID: 25289185 PMCID: PMC4184249 DOI: 10.7717/peerj.600] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2014] [Accepted: 09/09/2014] [Indexed: 12/22/2022] Open
Abstract
The Genome Analysis Toolkit (GATK) is commonly used for variant calling of single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) from short-read sequencing data aligned against a reference genome. There have been a number of variant calling comparisons against GATK, but an equally comprehensive comparison for VarScan not yet been performed. More specifically, we compare (1) the effects of different pre-processing steps prior to variant calling with both GATK and VarScan, (2) VarScan variants called with increasingly conservative parameters, and (3) filtered and unfiltered GATK variant calls (for both the UnifiedGenotyper and the HaplotypeCaller). Variant calling was performed on three datasets (1 targeted exon dataset and 2 exome datasets), each with approximately a dozen subjects. In most cases, pre-processing steps (e.g., indel realignment and quality score base recalibration using GATK) had only a modest impact on the variant calls, but the importance of the pre-processing steps varied between datasets and variant callers. Based upon concordance statistics presented in this study, we recommend GATK users focus on “high-quality” GATK variants by filtering out variants flagged as low-quality. We also found that running VarScan with a conservative set of parameters (referred to as “VarScan-Cons”) resulted in a reproducible list of variants, with high concordance (>97%) to high-quality variants called by the GATK UnifiedGenotyper and HaplotypeCaller. These conservative parameters result in decreased sensitivity, but the VarScan-Cons variant list could still recover 84–88% of the high-quality GATK SNPs in the exome datasets. This study also provides limited evidence that VarScan-Cons has a decreased false positive rate among novel variants (relative to high-quality GATK SNPs) and that the GATK HaplotypeCaller has an increased false positive rate for indels (relative to VarScan-Cons and high-quality GATK UnifiedGenotyper indels). More broadly, we believe the metrics used for comparison in this study can be useful in assessing the quality of variant calls in the context of a specific experimental design. As an example, a limited number of variant calling comparisons are also performed on two additional variant callers.
Collapse
Affiliation(s)
- Charles D Warden
- Department of Computational Medicine and Bioinformatics, University of Michigan , Ann Arbor, MI , USA
| | - Aaron W Adamson
- Department of Population Sciences, City of Hope National Medical Center , Duarte, CA , USA
| | - Susan L Neuhausen
- Department of Population Sciences, City of Hope National Medical Center , Duarte, CA , USA
| | - Xiwei Wu
- Integrative Genomics Core, Department of Molecular and Cellular Biology, City of Hope National Medical Center , Duarte, CA , USA
| |
Collapse
|
59
|
Bao R, Huang L, Andrade J, Tan W, Kibbe WA, Jiang H, Feng G. Review of current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing. Cancer Inform 2014; 13:67-82. [PMID: 25288881 PMCID: PMC4179624 DOI: 10.4137/cin.s13779] [Citation(s) in RCA: 85] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2014] [Revised: 07/06/2014] [Accepted: 07/07/2014] [Indexed: 12/21/2022] Open
Abstract
The advent of next-generation sequencing technologies has greatly promoted advances in the study of human diseases at the genomic, transcriptomic, and epigenetic levels. Exome sequencing, where the coding region of the genome is captured and sequenced at a deep level, has proven to be a cost-effective method to detect disease-causing variants and discover gene targets. In this review, we outline the general framework of whole exome sequence data analysis. We focus on established bioinformatics tools and applications that support five analytical steps: raw data quality assessment, pre-processing, alignment, post-processing, and variant analysis (detection, annotation, and prioritization). We evaluate the performance of open-source alignment programs and variant calling tools using simulated and benchmark datasets, and highlight the challenges posed by the lack of concordance among variant detection tools. Based on these results, we recommend adopting multiple tools and resources to reduce false positives and increase the sensitivity of variant calling. In addition, we briefly discuss the current status and solutions for big data management, analysis, and summarization in the field of bioinformatics.
Collapse
Affiliation(s)
- Riyue Bao
- Center for Research Informatics, The University of Chicago, Chicago, IL, USA
| | - Lei Huang
- Center for Research Informatics, The University of Chicago, Chicago, IL, USA
| | - Jorge Andrade
- Center for Research Informatics, The University of Chicago, Chicago, IL, USA
| | - Wei Tan
- IBM Thomas J. Watson Research Center, Yorktown Heights, New York, USA
| | - Warren A Kibbe
- Biomedical Informatics Center (NUBIC), Clinical and Translational Sciences Institute (NUCATS), Northwestern University, Chicago, IL, USA
| | - Hongmei Jiang
- Department of Statistics, Northwestern University, Evanston, IL, USA
| | - Gang Feng
- Biomedical Informatics Center (NUBIC), Clinical and Translational Sciences Institute (NUCATS), Northwestern University, Chicago, IL, USA
| |
Collapse
|
60
|
Boutros PC, Margolin AA, Stuart JM, Califano A, Stolovitzky G. Toward better benchmarking: challenge-based methods assessment in cancer genomics. Genome Biol 2014; 15:462. [PMID: 25314947 PMCID: PMC4318527 DOI: 10.1186/s13059-014-0462-7] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Rapid technological development has created an urgent need for improved evaluation of algorithms for the analysis of cancer genomics data. We outline how challenge-based assessment may help fill this gap by leveraging crowd-sourcing to distribute effort and reduce bias.
Collapse
|
61
|
Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. ACTA ACUST UNITED AC 2014; 30:2843-51. [PMID: 24974202 DOI: 10.1093/bioinformatics/btu356] [Citation(s) in RCA: 576] [Impact Index Per Article: 52.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
MOTIVATION Whole-genome high-coverage sequencing has been widely used for personal and cancer genomics as well as in various research areas. However, in the lack of an unbiased whole-genome truth set, the global error rate of variant calls and the leading causal artifacts still remain unclear even given the great efforts in the evaluation of variant calling methods. RESULTS We made 10 single nucleotide polymorphism and INDEL call sets with two read mappers and five variant callers, both on a haploid human genome and a diploid genome at a similar coverage. By investigating false heterozygous calls in the haploid genome, we identified the erroneous realignment in low-complexity regions and the incomplete reference genome with respect to the sample as the two major sources of errors, which press for continued improvements in these two areas. We estimated that the error rate of raw genotype calls is as high as 1 in 10-15 kb, but the error rate of post-filtered calls is reduced to 1 in 100-200 kb without significant compromise on the sensitivity. AVAILABILITY AND IMPLEMENTATION BWA-MEM alignment and raw variant calls are available at http://bit.ly/1g8XqRt scripts and miscellaneous data at https://github.com/lh3/varcmp. CONTACT hengli@broadinstitute.org SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Heng Li
- Medical Population Genetics Program, Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA
| |
Collapse
|
62
|
Wilkerson MD, Cabanski CR, Sun W, Hoadley KA, Walter V, Mose LE, Troester MA, Hammerman PS, Parker JS, Perou CM, Hayes DN. Integrated RNA and DNA sequencing improves mutation detection in low purity tumors. Nucleic Acids Res 2014; 42:e107. [PMID: 24970867 PMCID: PMC4117748 DOI: 10.1093/nar/gku489] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Identifying somatic mutations is critical for cancer genome characterization and for prioritizing patient treatment. DNA whole exome sequencing (DNA-WES) is currently the most popular technology; however, this yields low sensitivity in low purity tumors. RNA sequencing (RNA-seq) covers the expressed exome with depth proportional to expression. We hypothesized that integrating DNA-WES and RNA-seq would enable superior mutation detection versus DNA-WES alone. We developed a first-of-its-kind method, called UNCeqR, that detects somatic mutations by integrating patient-matched RNA-seq and DNA-WES. In simulation, the integrated DNA and RNA model outperformed the DNA-WES only model. Validation by patient-matched whole genome sequencing demonstrated superior performance of the integrated model over DNA-WES only models, including a published method and published mutation profiles. Genome-wide mutational analysis of breast and lung cancer cohorts (n = 871) revealed remarkable tumor genomics properties. Low purity tumors experienced the largest gains in mutation detection by integrating RNA-seq and DNA-WES. RNA provided greater mutation signal than DNA in expressed mutations. Compared to earlier studies on this cohort, UNCeqR increased mutation rates of driver and therapeutically targeted genes (e.g. PIK3CA, ERBB2 and FGFR2). In summary, integrating RNA-seq with DNA-WES increases mutation detection performance, especially for low purity tumors.
Collapse
Affiliation(s)
- Matthew D Wilkerson
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Christopher R Cabanski
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA The Genome Institute at Washington University, St. Louis, MO 63108, USA
| | - Wei Sun
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Katherine A Hoadley
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Vonn Walter
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Lisle E Mose
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Melissa A Troester
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA Department of Epidemiology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Peter S Hammerman
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA 02215, USA Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA
| | - Joel S Parker
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Charles M Perou
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - D Neil Hayes
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA Department of Internal Medicine, Division of Medical Oncology, Multidisciplinary Thoracic Oncology Program, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
63
|
Zhu P, He L, Li Y, Huang W, Xi F, Lin L, Zhi Q, Zhang W, Tang YT, Geng C, Lu Z, Xu X. OTG-snpcaller: an optimized pipeline based on TMAP and GATK for SNP calling from ion torrent data. PLoS One 2014; 9:e97507. [PMID: 24824529 PMCID: PMC4019570 DOI: 10.1371/journal.pone.0097507] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2013] [Accepted: 04/16/2014] [Indexed: 12/30/2022] Open
Abstract
Because the new Proton platform from Life Technologies produced markedly different data from those of the Illumina platform, the conventional Illumina data analysis pipeline could not be used directly. We developed an optimized SNP calling method using TMAP and GATK (OTG-snpcaller). This method combined our own optimized processes, Remove Duplicates According to AS Tag (RDAST) and Alignment Optimize Structure (AOS), together with TMAP and GATK, to call SNPs from Proton data. We sequenced four sets of exomes captured by Agilent SureSelect and NimbleGen SeqCap EZ Kit, using Life Technology’s Ion Proton sequencer. Then we applied OTG-snpcaller and compared our results with the results from Torrent Variants Caller. The results indicated that OTG-snpcaller can reduce both false positive and false negative rates. Moreover, we compared our results with Illumina results generated by GATK best practices, and we found that the results of these two platforms were comparable. The good performance in variant calling using GATK best practices can be primarily attributed to the high quality of the Illumina sequences.
Collapse
Affiliation(s)
| | | | | | | | - Feng Xi
- BGI-Shenzhen, Shenzhen, China
| | - Lin Lin
- BGI-Shenzhen, Shenzhen, China
| | | | | | - Y. Tom Tang
- Complete Genomics, Inc., Mountain View, California, United States of America
| | | | - Zhiyuan Lu
- BGI-Shenzhen, Shenzhen, China
- * E-mail: (ZL); (XX)
| | - Xun Xu
- BGI-Shenzhen, Shenzhen, China
- * E-mail: (ZL); (XX)
| |
Collapse
|
64
|
Cantarel BL, Weaver D, McNeill N, Zhang J, Mackey AJ, Reese J. BAYSIC: a Bayesian method for combining sets of genome variants with improved specificity and sensitivity. BMC Bioinformatics 2014; 15:104. [PMID: 24725768 PMCID: PMC3999887 DOI: 10.1186/1471-2105-15-104] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2013] [Accepted: 03/31/2014] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Accurate genomic variant detection is an essential step in gleaning medically useful information from genome data. However, low concordance among variant-calling methods reduces confidence in the clinical validity of whole genome and exome sequence data, and confounds downstream analysis for applications in genome medicine.Here we describe BAYSIC (BAYeSian Integrated Caller), which combines SNP variant calls produced by different methods (e.g. GATK, FreeBayes, Atlas, SamTools, etc.) into a more accurate set of variant calls. BAYSIC differs from majority voting, consensus or other ad hoc intersection-based schemes for combining sets of genome variant calls. Unlike other classification methods, the underlying BAYSIC model does not require training using a "gold standard" of true positives. Rather, with each new dataset, BAYSIC performs an unsupervised, fully Bayesian latent class analysis to estimate false positive and false negative error rates for each input method. The user specifies a posterior probability threshold according to the user's tolerance for false positive and false negative errors; lowering the posterior probability threshold allows the user to trade specificity for sensitivity while raising the threshold increases specificity in exchange for sensitivity. RESULTS We assessed the performance of BAYSIC in comparison to other variant detection methods using ten low coverage (~5X) samples from The 1000 Genomes Project, a tumor/normal exome pair (40X), and exome sequences (40X) from positive control samples previously identified to contain clinically relevant SNPs. We demonstrated BAYSIC's superior variant-calling accuracy, both for somatic mutation detection and germline variant detection. CONCLUSIONS BAYSIC provides a method for combining sets of SNP variant calls produced by different variant calling programs. The integrated set of SNP variant calls produced by BAYSIC improves the sensitivity and specificity of the variant calls used as input. In addition to combining sets of germline variants, BAYSIC can also be used to combine sets of somatic mutations detected in the context of tumor/normal sequencing experiments.
Collapse
|
65
|
Li Y, Xie X. Deconvolving tumor purity and ploidy by integrating copy number alterations and loss of heterozygosity. ACTA ACUST UNITED AC 2014; 30:2121-9. [PMID: 24695406 DOI: 10.1093/bioinformatics/btu174] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
MOTIVATION Next-generation sequencing (NGS) has revolutionized the study of cancer genomes. However, the reads obtained from NGS of tumor samples often consist of a mixture of normal and tumor cells, which themselves can be of multiple clonal types. A prominent problem in the analysis of cancer genome sequencing data is deconvolving the mixture to identify the reads associated with tumor cells or a particular subclone of tumor cells. Solving the problem is, however, challenging because of the so-called 'identifiability problem', where different combinations of tumor purity and ploidy often explain the sequencing data equally well. RESULTS We propose a new model to resolve the identifiability problem by integrating two types of sequencing information-somatic copy number alterations and loss of heterozygosity-within a unified probabilistic framework. We derive algorithms to solve our model, and implement them in a software package called PyLOH. We benchmark the performance of PyLOH using both simulated data and 12 breast cancer sequencing datasets and show that PyLOH outperforms existing methods in disambiguating the identifiability problem and estimating tumor purity. AVAILABILITY AND IMPLEMENTATION The PyLOH package is written in Python and is publicly available at https://github.com/uci-cbcl/PyLOH. CONTACT xhx@ics.uci.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yi Li
- Department of Computer Science, Institute for Genomics and Bioinformatics and Center for Machine Learning and Intelligent Systems, University of California, Irvine, CA 92697, USA
| | - Xiaohui Xie
- Department of Computer Science, Institute for Genomics and Bioinformatics and Center for Machine Learning and Intelligent Systems, University of California, Irvine, CA 92697, USADepartment of Computer Science, Institute for Genomics and Bioinformatics and Center for Machine Learning and Intelligent Systems, University of California, Irvine, CA 92697, USADepartment of Computer Science, Institute for Genomics and Bioinformatics and Center for Machine Learning and Intelligent Systems, University of California, Irvine, CA 92697, USA
| |
Collapse
|
66
|
Xu H, DiCarlo J, Satya RV, Peng Q, Wang Y. Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genomics 2014; 15:244. [PMID: 24678773 PMCID: PMC3986649 DOI: 10.1186/1471-2164-15-244] [Citation(s) in RCA: 113] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2013] [Accepted: 03/20/2014] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND High-throughput sequencing is rapidly becoming common practice in clinical diagnosis and cancer research. Many algorithms have been developed for somatic single nucleotide variant (SNV) detection in matched tumor-normal DNA sequencing. Although numerous studies have compared the performance of various algorithms on exome data, there has not yet been a systematic evaluation using PCR-enriched amplicon data with a range of variant allele fractions. The recently developed gold standard variant set for the reference individual NA12878 by the NIST-led "Genome in a Bottle" Consortium (NIST-GIAB) provides a good resource to evaluate admixtures with various SNV fractions. RESULTS Using the NIST-GIAB gold standard, we compared the performance of five popular somatic SNV calling algorithms (GATK UnifiedGenotyper followed by simple subtraction, MuTect, Strelka, SomaticSniper and VarScan2) for matched tumor-normal amplicon and exome sequencing data. CONCLUSIONS We demonstrated that the five commonly used somatic SNV calling methods are applicable to both targeted amplicon and exome sequencing data. However, the sensitivities of these methods vary based on the allelic fraction of the mutation in the tumor sample. Our analysis can assist researchers in choosing a somatic SNV calling method suitable for their specific needs.
Collapse
Affiliation(s)
| | | | | | | | - Yexun Wang
- Research and Foundation Department, QIAGEN Sciences, Inc,, Frederick, MD, USA.
| |
Collapse
|
67
|
Supek F, Miñana B, Valcárcel J, Gabaldón T, Lehner B. Synonymous Mutations Frequently Act as Driver Mutations in Human Cancers. Cell 2014; 156:1324-1335. [DOI: 10.1016/j.cell.2014.01.051] [Citation(s) in RCA: 331] [Impact Index Per Article: 30.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2013] [Revised: 11/20/2013] [Accepted: 01/15/2014] [Indexed: 01/05/2023]
|
68
|
Abstract
Cancer is a complex disease driven by multiple mutations acquired over the lifetime of the cancer cells. These alterations, termed somatic mutations to distinguish them from inherited germline mutations, can include single-nucleotide substitutions, insertions, deletions, copy number alterations, and structural rearrangements. A patient's cancer can contain a combination of these aberrations, and the ability to generate a comprehensive genetic profile should greatly improve patient diagnosis and treatment. Next-generation sequencing has become the tool of choice to uncover multiple cancer mutations from a single tumor source, and the falling costs of this rapid high-throughput technology are encouraging its transition from basic research into a clinical setting. However, the detection of mutations in sequencing data is still an evolving area and cancer genomic data requires some special considerations. This chapter discusses these aspects and gives an overview of current bioinformatics methods for the detection of somatic mutations in cancer sequencing data.
Collapse
|
69
|
Berglund EC, Lindqvist CM, Hayat S, Övernäs E, Henriksson N, Nordlund J, Wahlberg P, Forestier E, Lönnerholm G, Syvänen AC. Accurate detection of subclonal single nucleotide variants in whole genome amplified and pooled cancer samples using HaloPlex target enrichment. BMC Genomics 2013; 14:856. [PMID: 24314227 PMCID: PMC4046713 DOI: 10.1186/1471-2164-14-856] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2013] [Accepted: 11/25/2013] [Indexed: 01/21/2023] Open
Abstract
BACKGROUND Target enrichment and resequencing is a widely used approach for identification of cancer genes and genetic variants associated with diseases. Although cost effective compared to whole genome sequencing, analysis of many samples constitutes a significant cost, which could be reduced by pooling samples before capture. Another limitation to the number of cancer samples that can be analyzed is often the amount of available tumor DNA. We evaluated the performance of whole genome amplified DNA and the power to detect subclonal somatic single nucleotide variants in non-indexed pools of cancer samples using the HaloPlex technology for target enrichment and next generation sequencing. RESULTS We captured a set of 1528 putative somatic single nucleotide variants and germline SNPs, which were identified by whole genome sequencing, with the HaloPlex technology and sequenced to a depth of 792-1752. We found that the allele fractions of the analyzed variants are well preserved during whole genome amplification and that capture specificity or variant calling is not affected. We detected a large majority of the known single nucleotide variants present uniquely in one sample with allele fractions as low as 0.1 in non-indexed pools of up to ten samples. We also identified and experimentally validated six novel variants in the samples included in the pools. CONCLUSION Our work demonstrates that whole genome amplified DNA can be used for target enrichment equally well as genomic DNA and that accurate variant detection is possible in non-indexed pools of cancer samples. These findings show that analysis of a large number of samples is feasible at low cost, even when only small amounts of DNA is available, and thereby significantly increases the chances of indentifying recurrent mutations in cancer samples.
Collapse
Affiliation(s)
- Eva C Berglund
- Department of Medical Sciences, Molecular Medicine and Science for Life Laboratory, Uppsala University, Uppsala, Sweden.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
70
|
Wang Q, Jia P, Li F, Chen H, Ji H, Hucks D, Dahlman KB, Pao W, Zhao Z. Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers. Genome Med 2013; 5:91. [PMID: 24112718 PMCID: PMC3971343 DOI: 10.1186/gm495] [Citation(s) in RCA: 128] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2013] [Accepted: 10/02/2013] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND Driven by high throughput next generation sequencing technologies and the pressing need to decipher cancer genomes, computational approaches for detecting somatic single nucleotide variants (sSNVs) have undergone dramatic improvements during the past 2 years. The recently developed tools typically compare a tumor sample directly with a matched normal sample at each variant locus in order to increase the accuracy of sSNV calling. These programs also address the detection of sSNVs at low allele frequencies, allowing for the study of tumor heterogeneity, cancer subclones, and mutation evolution in cancer development. METHODS We used whole genome sequencing (Illumina Genome Analyzer IIx platform) of a melanoma sample and matched blood, whole exome sequencing (Illumina HiSeq 2000 platform) of 18 lung tumor-normal pairs and seven lung cancer cell lines to evaluate six tools for sSNV detection: EBCall, JointSNVMix, MuTect, SomaticSniper, Strelka, and VarScan 2, with a focus on MuTect and VarScan 2, two widely used publicly available software tools. Default/suggested parameters were used to run these tools. The missense sSNVs detected in these samples were validated through PCR and direct sequencing of genomic DNA from the samples. We also simulated 10 tumor-normal pairs to explore the ability of these programs to detect low allelic-frequency sSNVs. RESULTS Out of the 237 sSNVs successfully validated in our cancer samples, VarScan 2 and MuTect detected the most of any tools (that is, 204 and 192, respectively). MuTect identified 11 more low-coverage validated sSNVs than VarScan 2, but missed 11 more sSNVs with alternate alleles in normal samples than VarScan 2. When examining the false calls of each tool using 169 invalidated sSNVs, we observed >63% false calls detected in the lung cancer cell lines had alternate alleles in normal samples. Additionally, from our simulation data, VarScan 2 identified more sSNVs than other tools, while MuTect characterized most low allelic-fraction sSNVs. CONCLUSIONS Our study explored the typical false-positive and false-negative detections that arise from the use of sSNV-calling tools. Our results suggest that despite recent progress, these tools have significant room for improvement, especially in the discrimination of low coverage/allelic-frequency sSNVs and sSNVs with alternate alleles in normal samples.
Collapse
Affiliation(s)
- Qingguo Wang
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Peilin Jia
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA ; Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Fei Li
- State Key Laboratory of Cell Biology, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Haiquan Chen
- Department of Thoracic Surgery, Fudan University Shanghai Cancer Center, Shanghai, China ; Department of Oncology, Shanghai Medical College, Shanghai, China
| | - Hongbin Ji
- State Key Laboratory of Cell Biology, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
| | - Donald Hucks
- Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Kimberly Brown Dahlman
- Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, Nashville, TN, USA ; Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - William Pao
- Vanderbilt-Ingram Cancer Center, Vanderbilt University Medical Center, Nashville, TN, USA ; Department of Medicine/Division of Hematology-Oncology, Vanderbilt University School of Medicine, Nashville, TN, USA
| | - Zhongming Zhao
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, USA ; Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN, USA ; Department of Cancer Biology, Vanderbilt University School of Medicine, Nashville, TN, USA ; Department of Psychiatry, Vanderbilt University School of Medicine, Nashville, TN, USA
| |
Collapse
|