1
|
Ngoot-Chin T, Zulkifli MA, van de Weg E, Zaki NM, Serdari NM, Mustaffa S, Zainol Abidin MI, Sanusi NSNM, Smulders MJM, Low ETL, Ithnin M, Singh R. Detection of ploidy and chromosomal aberrations in commercial oil palm using high-throughput SNP markers. PLANTA 2021; 253:63. [PMID: 33544231 DOI: 10.1007/s00425-021-03567-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Accepted: 01/04/2021] [Indexed: 05/14/2023]
Abstract
Karyotyping using high-density genome-wide SNP markers identified various chromosomal aberrations in oil palm (Elaeis guineensis Jacq.) with supporting evidence from the 2C DNA content measurements (determined using FCM) and chromosome counts. Oil palm produces a quarter of the world's total vegetable oil. In line with its global importance, an initiative to sequence the oil palm genome was carried out successfully, producing huge amounts of sequence information, allowing SNP discovery. High-capacity SNP genotyping platforms have been widely used for marker-trait association studies in oil palm. Besides genotyping, a SNP array is also an attractive tool for understanding aberrations in chromosome inheritance. Exploiting this, the present study utilized chromosome-wide SNP allelic distributions to determine the ploidy composition of over 1,000 oil palms from a commercial F1 family, including 197 derived from twin-embryo seeds. Our method consisted of an inspection of the allelic intensity ratio using SNP markers. For palms with a shifted or abnormal distribution ratio, the SNP allelic frequencies were plotted along the pseudo-chromosomes. This method proved to be efficient in identifying whole genome duplication (triploids) and aneuploidy. We also detected several loss of heterozygosity regions which may indicate small chromosomal deletions and/or inheritance of identical by descent regions from both parents. The SNP analysis was validated by flow cytometry and chromosome counts. The triploids were all derived from twin-embryo seeds. This is the first report on the efficiency and reliability of SNP array data for karyotyping oil palm chromosomes, as an alternative to the conventional cytogenetic technique. Information on the ploidy composition and chromosomal structural variation can help to better understand the genetic makeup of samples and lead to a more robust interpretation of the genomic data in marker-trait association analyses.
Collapse
Affiliation(s)
- Ting Ngoot-Chin
- Malaysian Palm Oil Board (MPOB), 6, Persiaran Institusi, Bandar Baru Bangi, 43000, Kajang, Selangor, Malaysia
| | - Muhammad Azwan Zulkifli
- Malaysian Palm Oil Board (MPOB), 6, Persiaran Institusi, Bandar Baru Bangi, 43000, Kajang, Selangor, Malaysia
| | - Eric van de Weg
- Plant Breeding, Wageningen University and Research, Wageningen, The Netherlands
| | - Noorhariza Mohd Zaki
- Malaysian Palm Oil Board (MPOB), 6, Persiaran Institusi, Bandar Baru Bangi, 43000, Kajang, Selangor, Malaysia
| | - Norhalida Mohamed Serdari
- Malaysian Palm Oil Board (MPOB), 6, Persiaran Institusi, Bandar Baru Bangi, 43000, Kajang, Selangor, Malaysia
| | - Suzana Mustaffa
- Malaysian Palm Oil Board (MPOB), 6, Persiaran Institusi, Bandar Baru Bangi, 43000, Kajang, Selangor, Malaysia
| | - Mohd Isa Zainol Abidin
- Plant Breeding and Services Department, KULIM Plantations Berhad, 81900, Kota Tinggi, Johor, Malaysia
| | - Nik Shazana Nik Mohd Sanusi
- Malaysian Palm Oil Board (MPOB), 6, Persiaran Institusi, Bandar Baru Bangi, 43000, Kajang, Selangor, Malaysia
| | | | - Eng Ti Leslie Low
- Malaysian Palm Oil Board (MPOB), 6, Persiaran Institusi, Bandar Baru Bangi, 43000, Kajang, Selangor, Malaysia
| | - Maizura Ithnin
- Malaysian Palm Oil Board (MPOB), 6, Persiaran Institusi, Bandar Baru Bangi, 43000, Kajang, Selangor, Malaysia
| | - Rajinder Singh
- Malaysian Palm Oil Board (MPOB), 6, Persiaran Institusi, Bandar Baru Bangi, 43000, Kajang, Selangor, Malaysia.
| |
Collapse
|
2
|
Ruan J, Liu Z, Sun M, Wang Y, Yue J, Yu G. DBS: a fast and informative segmentation algorithm for DNA copy number analysis. BMC Bioinformatics 2019; 20:1. [PMID: 30606105 PMCID: PMC6318921 DOI: 10.1186/s12859-018-2565-8] [Citation(s) in RCA: 115] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2017] [Accepted: 12/07/2018] [Indexed: 12/02/2022] Open
Abstract
Background Genome-wide DNA copy number changes are the hallmark events in the initiation and progression of cancers. Quantitative analysis of somatic copy number alterations (CNAs) has broad applications in cancer research. With the increasing capacity of high-throughput sequencing technologies, fast and efficient segmentation algorithms are required when characterizing high density CNAs data. Results A fast and informative segmentation algorithm, DBS (Deviation Binary Segmentation), is developed and discussed. The DBS method is based on the least absolute error principles and is inspired by the segmentation method rooted in the circular binary segmentation procedure. DBS uses point-by-point model calculation to ensure the accuracy of segmentation and combines a binary search algorithm with heuristics derived from the Central Limit Theorem. The DBS algorithm is very efficient requiring a computational complexity of O(n*log n), and is faster than its predecessors. Moreover, DBS measures the change-point amplitude of mean values of two adjacent segments at a breakpoint, where the significant degree of change-point amplitude is determined by the weighted average deviation at breakpoints. Accordingly, using the constructed binary tree of significant degree, DBS informs whether the results of segmentation are over- or under-segmented. Conclusion DBS is implemented in a platform-independent and open-source Java application (ToolSeg), including a graphical user interface and simulation data generation, as well as various segmentation methods in the native Java language.
Collapse
Affiliation(s)
- Jun Ruan
- School of Information Engineering, Wuhan University of Technology, Wuhan, Hubei, 430070, China
| | - Zhen Liu
- School of Information Engineering, Wuhan University of Technology, Wuhan, Hubei, 430070, China
| | - Ming Sun
- School of Information Engineering, Wuhan University of Technology, Wuhan, Hubei, 430070, China
| | - Yue Wang
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, 22203, USA
| | - Junqiu Yue
- Department of Pathology, Hubei Cancer Hospital, Wuhan, Hubei, 430079, China
| | - Guoqiang Yu
- Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, 22203, USA.
| |
Collapse
|
3
|
Tran HV, Kiemer AK, Helms V. Copy Number Alterations in Tumor Genomes Deleting Antineoplastic Drug Targets Partially Compensated by Complementary Amplifications. Cancer Genomics Proteomics 2018; 15:365-378. [PMID: 30194077 PMCID: PMC6199575 DOI: 10.21873/cgp.20095] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 07/14/2018] [Accepted: 07/17/2018] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND/AIM Genomic DNA copy number alterations (CNAs) are frequent in tumors and have been catalogued by The Cancer Genome Atlas project. Emergence of chemoresistance frequently renders drug therapies ineffective. MATERIALS AND METHODS We analyzed how CNAs recurrently found in the genomes of TCGA patients of thirty-one tumor types affect protein targets of antineoplastic (AN) agents. RESULTS CNA deletions more frequently affected the targets of AN agents than CNA amplifications. Interestingly, in seven tumors we observed signs of compensatory CNAs. For example, in glioblastoma multiforme, two target genes (FLT1, FLT3) of the experimental drug sorafenib were recurrently deleted, whereas another target (KDR) of sorafenib was recurrently amplified. In renal clear cell carcinoma, the target FLT1 of pazopanib, sunitinib, sorafenib, and axitinib was recurrently deleted, whereas FLT4 bound by the same drugs, was recurrently amplified. CONCLUSION Deletions of AN target proteins can be compensated by amplification of alternative targets.
Collapse
Affiliation(s)
- Ha Vu Tran
- Saarland University, Center for Bioinformatics, Saarbruecken, Germany
- Department of Computer Science, Faculty of Information Technology, Vietnam National University of Agriculture, Hanoi, Vietnam
| | - Alexandra K Kiemer
- Saarland University, Department of Pharmacy, Pharmaceutical Biology, Saarbruecken, Germany
| | - Volkhard Helms
- Saarland University, Center for Bioinformatics, Saarbruecken, Germany
| |
Collapse
|
4
|
|
5
|
Chen H, Jiang Y, Maxwell KN, Nathanson KL, Zhang N. ALLELE-SPECIFIC COPY NUMBER ESTIMATION BY WHOLE EXOME SEQUENCING. Ann Appl Stat 2017; 11:1169-1192. [PMID: 28989557 DOI: 10.1214/17-aoas1043] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Whole exome sequencing is currently a technology of choice in large-scale cancer genomics studies, where the priority is to identify cancer-associated variants in coding regions. We describe a method for estimating allele-specific copy number using whole exome sequencing data from tumor and matched normal.
Collapse
|
6
|
Titsias MK, Holmes CC, Yau C. Statistical Inference in Hidden Markov Models Using k-Segment Constraints. J Am Stat Assoc 2016; 111:200-215. [PMID: 27226674 PMCID: PMC4867884 DOI: 10.1080/01621459.2014.998762] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2013] [Revised: 11/01/2014] [Indexed: 11/24/2022]
Abstract
Hidden Markov models (HMMs) are one of the most widely used statistical methods for analyzing sequence data. However, the reporting of output from HMMs has largely been restricted to the presentation of the most-probable (MAP) hidden state sequence, found via the Viterbi algorithm, or the sequence of most probable marginals using the forward–backward algorithm. In this article, we expand the amount of information we could obtain from the posterior distribution of an HMM by introducing linear-time dynamic programming recursions that, conditional on a user-specified constraint in the number of segments, allow us to (i) find MAP sequences, (ii) compute posterior probabilities, and (iii) simulate sample paths. We collectively call these recursions k-segment algorithms and illustrate their utility using simulated and real examples. We also highlight the prospective and retrospective use of k-segment constraints for fitting HMMs or exploring existing model fits. Supplementary materials for this article are available online.
Collapse
|
7
|
Chen C, Zhang Y, Loomis MM, Upton MP, Lohavanichbutr P, Houck JR, Doody DR, Mendez E, Futran N, Schwartz SM, Wang P. Genome-Wide Loss of Heterozygosity and DNA Copy Number Aberration in HPV-Negative Oral Squamous Cell Carcinoma and Their Associations with Disease-Specific Survival. PLoS One 2015; 10:e0135074. [PMID: 26247464 PMCID: PMC4527746 DOI: 10.1371/journal.pone.0135074] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2014] [Accepted: 07/17/2015] [Indexed: 01/15/2023] Open
Abstract
Oral squamous cell cancer of the oral cavity and oropharynx (OSCC) is associated with high case-fatality. For reasons that are largely unknown, patients with the same clinical and pathologic staging have heterogeneous response to treatment and different probability of recurrence and survival, with patients with Human Papillomavirus (HPV)-positive oropharyngeal tumors having the most favorable survival. To gain insight into the complexity of OSCC and to identify potential chromosomal changes that may be associated with OSCC mortality, we used Affymtrix 6.0 SNP arrays to examine paired DNA from peripheral blood and tumor cell populations isolated by laser capture microdissection to assess genome-wide loss of heterozygosity (LOH) and DNA copy number aberration (CNA) and their associations with risk factors, tumor characteristics, and oral cancer-specific mortality among 75 patients with HPV-negative OSCC. We found a highly heterogeneous and complex genomic landscape of HPV-negative tumors, and identified regions in 4q, 8p, 9p and 11q that seem to play an important role in oral cancer biology and survival from this disease. If confirmed, these findings could assist in designing personalized treatment or in the creation of models to predict survival in patients with HPV-negative OSCC.
Collapse
Affiliation(s)
- Chu Chen
- Program in Epidemiology, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
- Department of Otolaryngology–Head and Neck Surgery, University of Washington, Seattle, Washington, United States of America
- Department of Epidemiology, University of Washington, Seattle, Washington, United States of America
- * E-mail:
| | - Yuzheng Zhang
- Program in Biostatistics and Biomathematics, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Melissa M. Loomis
- Program in Epidemiology, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Melissa P. Upton
- Department of Pathology, University of Washington, Seattle, Washington, United States of America
| | - Pawadee Lohavanichbutr
- Program in Epidemiology, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - John R. Houck
- Program in Epidemiology, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - David R. Doody
- Program in Epidemiology, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Eduardo Mendez
- Program in Epidemiology, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
- Department of Otolaryngology–Head and Neck Surgery, University of Washington, Seattle, Washington, United States of America
- Clinical Research Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Neal Futran
- Department of Otolaryngology–Head and Neck Surgery, University of Washington, Seattle, Washington, United States of America
| | - Stephen M. Schwartz
- Program in Epidemiology, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
- Department of Epidemiology, University of Washington, Seattle, Washington, United States of America
| | - Pei Wang
- Program in Biostatistics and Biomathematics, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
- Department of Genetics and Genomics Sciences, Mt. Sinai School of Medicine, New York, New York, United States of America
| |
Collapse
|
8
|
Lai Y, Gastwirth JL. Outlier reset CUSUM for the exploration of copy number alteration data. Stat Appl Genet Mol Biol 2015; 14:333-45. [PMID: 26087068 DOI: 10.1515/sagmb-2014-0027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Copy number alteration (CNA) data have been collected to study disease related chromosomal amplifications and deletions. The CUSUM procedure and related plots have been used to explore CNA data. In practice, it is possible to observe outliers. Then, modifications of the CUSUM procedure may be required. An outlier reset modification of the CUSUM (ORCUSUM) procedure is developed in this paper. The threshold value for detecting outliers or significant CUSUMs can be derived using results for sums of independent truncated normal random variables. Bartel's non-parametric test for autocorrelation is also introduced to the analysis of copy number variation data. Our simulation results indicate that the ORCUSUM procedure can still be used even in the situation where the degree of autocorrelation level is low. Furthermore, the results show the outlier's impact on the traditional CUSUM's performance and illustrate the advantage of the ORCUSUM's outlier reset feature. Additionally, we discuss how the ORCUSUM can be applied to examine CNA data with a simulated data set. To illustrate the procedure, recently collected single nucleotide polymorphism (SNP) based CNA data from The Cancer Genome Atlas (TCGA) Research Network is analyzed. The method is applied to a data set collected in an ovarian cancer study. Three cytogenetic bands (cytobands) are considered to illustrate the method. The cytobands 11q13 and 9p21 have been shown to be related to ovarian cancer. They are presented as positive examples. The cytoband 3q22, which is less likely to be disease related, is presented as a negative example. These results illustrate the usefulness of the ORCUSUM procedure as an exploratory tool for the analysis of SNP based CNA data.
Collapse
|
9
|
Xia H, Liu Y, Wang M, Li A. Identification of Genomic Aberrations in Cancer Subclones from Heterogeneous Tumor Samples. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:679-685. [PMID: 26357278 DOI: 10.1109/tcbb.2014.2366114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Tumor samples are usually heterogeneous, containing admixture of more than one kind of tumor subclones. Studies of genomic aberrations from heterogeneous tumor data are hindered by the mixed signal of tumor subclone cells. Most of the existing algorithms cannot distinguish contributions of different subclones from the measured single nucleotide polymorphism (SNP) array signals, which may cause erroneous estimation of genomic aberrations. Here, we have introduced a computational method, Cancer Heterogeneity Analysis from SNP-array Experiments (CHASE), to automatically detect subclone proportions and genomic aberrations from heterogeneous tumor samples. Our method is based on HMM, and incorporates EM algorithm to build a statistical model for modeling mixed signal of multiple tumor subclones. We tested the proposed approach on simulated datasets and two real datasets, and the results show that the proposed method can efficiently estimate tumor subclone proportions and recovery the genomic aberrations.
Collapse
|
10
|
A hidden Markov approach for ascertaining cSNP genotypes from RNA sequence data in the presence of allelic imbalance by exploiting linkage disequilibrium. BMC Bioinformatics 2015; 16:61. [PMID: 25887316 PMCID: PMC4351697 DOI: 10.1186/s12859-015-0479-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2014] [Accepted: 01/27/2015] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Allelic specific expression (ASE) increases our understanding of the genetic control of gene expression and its links to phenotypic variation. ASE testing is implemented through binomial or beta-binomial tests of sequence read counts of alternative alleles at a cSNP of interest in heterozygous individuals. This requires prior ascertainment of the cSNP genotypes for all individuals. To meet the needs, we propose hidden Markov methods to call SNPs from next generation RNA sequence data when ASE possibly exists. RESULTS We propose two hidden Markov models (HMMs), HMM-ASE and HMM-NASE that consider or do not consider ASE, respectively, in order to improve genotyping accuracy. Both HMMs have the advantages of calling the genotypes of several SNPs simultaneously and allow mapping error which, respectively, utilize the dependence among SNPs and correct the bias due to mapping error. In addition, HMM-ASE exploits ASE information to further improve genotype accuracy when the ASE is likely to be present. Simulation results indicate that the HMMs proposed demonstrate a very good prediction accuracy in terms of controlling both the false discovery rate (FDR) and the false negative rate (FNR). When ASE is present, the HMM-ASE had a lower FNR than HMM-NASE, while both can control the false discovery rate (FDR) at a similar level. By exploiting linkage disequilibrium (LD), a real data application demonstrate that the proposed methods have better sensitivity and similar FDR in calling heterozygous SNPs than the VarScan method. Sensitivity and FDR are similar to that of the BCFtools and Beagle methods. The resulting genotypes show good properties for the estimation of the genetic parameters and ASE ratios. CONCLUSIONS We introduce HMMs, which are able to exploit LD and account for the ASE and mapping errors, to simultaneously call SNPs from the next generation RNA sequence data. The method introduced can reliably call for cSNP genotypes even in the presence of ASE and under low sequencing coverage. As a byproduct, the proposed method is able to provide predictions of ASE ratios for the heterozygous genotypes, which can then be used for ASE testing.
Collapse
|
11
|
Broderick T, Mackey L, Paisley J, Jordan MI. Combinatorial Clustering and the Beta Negative Binomial Process. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2015; 37:290-306. [PMID: 26353242 DOI: 10.1109/tpami.2014.2318721] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
We develop a Bayesian nonparametric approach to a general family of latent class problems in which individuals can belong simultaneously to multiple classes and where each class can be exhibited multiple times by an individual. We introduce a combinatorial stochastic process known as the negative binomial process ( NBP ) as an infinite-dimensional prior appropriate for such problems. We show that the NBP is conjugate to the beta process, and we characterize the posterior distribution under the beta-negative binomial process ( BNBP) and hierarchical models based on the BNBP (the HBNBP). We study the asymptotic properties of the BNBP and develop a three-parameter extension of the BNBP that exhibits power-law behavior. We derive MCMC algorithms for posterior inference under the HBNBP , and we present experiments using these algorithms in the domains of image segmentation, object recognition, and document analysis.
Collapse
|
12
|
Chen H, Bell JM, Zavala NA, Ji HP, Zhang NR. Allele-specific copy number profiling by next-generation DNA sequencing. Nucleic Acids Res 2014; 43:e23. [PMID: 25477383 PMCID: PMC4344483 DOI: 10.1093/nar/gku1252] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
The progression and clonal development of tumors often involve amplifications and deletions of genomic DNA. Estimation of allele-specific copy number, which quantifies the number of copies of each allele at each variant loci rather than the total number of chromosome copies, is an important step in the characterization of tumor genomes and the inference of their clonal history. We describe a new method, falcon, for finding somatic allele-specific copy number changes by next generation sequencing of tumors with matched normals. falcon is based on a change-point model on a bivariate mixed Binomial process, which explicitly models the copy numbers of the two chromosome haplotypes and corrects for local allele-specific coverage biases. By using the Binomial distribution rather than a normal approximation, falcon more effectively pools evidence from sites with low coverage. A modified Bayesian information criterion is used to guide model selection for determining the number of copy number events. Falcon is evaluated on in silico spike-in data and applied to the analysis of a pre-malignant colon tumor sample and late-stage colorectal adenocarcinoma from the same individual. The allele-specific copy number estimates obtained by falcon allows us to draw detailed conclusions regarding the clonal history of the individual's colon cancer.
Collapse
Affiliation(s)
- Hao Chen
- Department of Statistics, University of California, One Shields Avenue, Davis, CA 95616, USA
| | - John M Bell
- Division of Oncology, School of Medicine, Stanford University, 291 Campus Dr, Stanford, CA 94305, USA
| | - Nicolas A Zavala
- Division of Oncology, School of Medicine, Stanford University, 291 Campus Dr, Stanford, CA 94305, USA
| | - Hanlee P Ji
- Division of Oncology, School of Medicine, Stanford University, 291 Campus Dr, Stanford, CA 94305, USA
| | - Nancy R Zhang
- Department of Statistics, The Wharton School, University of Pennsylvania, 3730 Walnut Street, Philadelphia, PA 19104, USA
| |
Collapse
|
13
|
Pierre-Jean M, Rigaill G, Neuvial P. Performance evaluation of DNA copy number segmentation methods. Brief Bioinform 2014; 16:600-15. [PMID: 25202135 PMCID: PMC4501247 DOI: 10.1093/bib/bbu026] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2014] [Accepted: 06/10/2014] [Indexed: 11/13/2022] Open
Abstract
A number of bioinformatic or biostatistical methods are available for analyzing DNA copy number profiles measured from microarray or sequencing technologies. In the absence of rich enough gold standard data sets, the performance of these methods is generally assessed using unrealistic simulation studies, or based on small real data analyses. To make an objective and reproducible performance assessment, we have designed and implemented a framework to generate realistic DNA copy number profiles of cancer samples with known truth. These profiles are generated by resampling publicly available SNP microarray data from genomic regions with known copy-number state. The original data have been extracted from dilutions series of tumor cell lines with matched blood samples at several concentrations. Therefore, the signal-to-noise ratio of the generated profiles can be controlled through the (known) percentage of tumor cells in the sample. This article describes this framework and its application to a comparison study between methods for segmenting DNA copy number profiles from SNP microarrays. This study indicates that no single method is uniformly better than all others. It also helps identifying pros and cons of the compared methods as a function of biologically informative parameters, such as the fraction of tumor cells in the sample and the proportion of heterozygous markers. This comparison study may be reproduced using the open source and cross-platform R package jointseg, which implements the proposed data generation and evaluation framework: http://r-forge.r-project.org/R/?group_id=1562.
Collapse
|
14
|
Nadauld LD, Garcia S, Natsoulis G, Bell JM, Miotke L, Hopmans ES, Xu H, Pai RK, Palm C, Regan JF, Chen H, Flaherty P, Ootani A, Zhang NR, Ford JM, Kuo CJ, Ji HP. Metastatic tumor evolution and organoid modeling implicate TGFBR2 as a cancer driver in diffuse gastric cancer. Genome Biol 2014; 15:428. [PMID: 25315765 PMCID: PMC4145231 DOI: 10.1186/s13059-014-0428-9] [Citation(s) in RCA: 97] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2014] [Accepted: 08/27/2014] [Indexed: 12/30/2022] Open
Abstract
Background Gastric cancer is the second-leading cause of global cancer deaths, with metastatic disease representing the primary cause of mortality. To identify candidate drivers involved in oncogenesis and tumor evolution, we conduct an extensive genome sequencing analysis of metastatic progression in a diffuse gastric cancer. This involves a comparison between a primary tumor from a hereditary diffuse gastric cancer syndrome proband and its recurrence as an ovarian metastasis. Results Both the primary tumor and ovarian metastasis have common biallelic loss-of-function of both the CDH1 and TP53 tumor suppressors, indicating a common genetic origin. While the primary tumor exhibits amplification of the Fibroblast growth factor receptor 2 (FGFR2) gene, the metastasis notably lacks FGFR2 amplification but rather possesses unique biallelic alterations of Transforming growth factor-beta receptor 2 (TGFBR2), indicating the divergent in vivo evolution of a TGFBR2-mutant metastatic clonal population in this patient. As TGFBR2 mutations have not previously been functionally validated in gastric cancer, we modeled the metastatic potential of TGFBR2 loss in a murine three-dimensional primary gastric organoid culture. The Tgfbr2 shRNA knockdown within Cdh1-/-; Tp53-/- organoids generates invasion in vitro and robust metastatic tumorigenicity in vivo, confirming Tgfbr2 metastasis suppressor activity. Conclusions We document the metastatic differentiation and genetic heterogeneity of diffuse gastric cancer and reveal the potential metastatic role of TGFBR2 loss-of-function. In support of this study, we apply a murine primary organoid culture method capable of recapitulating in vivo metastatic gastric cancer. Overall, we describe an integrated approach to identify and functionally validate putative cancer drivers involved in metastasis. Electronic supplementary material The online version of this article (doi:10.1186/s13059-014-0428-9) contains supplementary material, which is available to authorized users.
Collapse
|
15
|
Xia R, Vattathil S, Scheet P. Identification of allelic imbalance with a statistical model for subtle genomic mosaicism. PLoS Comput Biol 2014; 10:e1003765. [PMID: 25166618 PMCID: PMC4148184 DOI: 10.1371/journal.pcbi.1003765] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2014] [Accepted: 05/22/2014] [Indexed: 11/18/2022] Open
Abstract
Genetic heterogeneity in a mixed sample of tumor and normal DNA can confound characterization of the tumor genome. Numerous computational methods have been proposed to detect aberrations in DNA samples from tumor and normal tissue mixtures. Most of these require tumor purities to be at least 10-15%. Here, we present a statistical model to capture information, contained in the individual's germline haplotypes, about expected patterns in the B allele frequencies from SNP microarrays while fully modeling their magnitude, the first such model for SNP microarray data. Our model consists of a pair of hidden Markov models--one for the germline and one for the tumor genome--which, conditional on the observed array data and patterns of population haplotype variation, have a dependence structure induced by the relative imbalance of an individual's inherited haplotypes. Together, these hidden Markov models offer a powerful approach for dealing with mixtures of DNA where the main component represents the germline, thus suggesting natural applications for the characterization of primary clones when stromal contamination is extremely high, and for identifying lesions in rare subclones of a tumor when tumor purity is sufficient to characterize the primary lesions. Our joint model for germline haplotypes and acquired DNA aberration is flexible, allowing a large number of chromosomal alterations, including balanced and imbalanced losses and gains, copy-neutral loss-of-heterozygosity (LOH) and tetraploidy. We found our model (which we term J-LOH) to be superior for localizing rare aberrations in a simulated 3% mixture sample. More generally, our model provides a framework for full integration of the germline and tumor genomes to deal more effectively with missing or uncertain features, and thus extract maximal information from difficult scenarios where existing methods fail.
Collapse
Affiliation(s)
- Rui Xia
- Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
- Division of Biostatistics, The University of Texas School of Public Health, Houston, Texas, United States of America
| | - Selina Vattathil
- Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
- Human & Molecular Genetics Program, The University of Texas Graduate School of Biomedical Sciences, Houston, Texas, United States of America
| | - Paul Scheet
- Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
- Division of Biostatistics, The University of Texas School of Public Health, Houston, Texas, United States of America
- Human & Molecular Genetics Program, The University of Texas Graduate School of Biomedical Sciences, Houston, Texas, United States of America
| |
Collapse
|
16
|
Lin YJ, Chen YT, Hsu SN, Peng CH, Tang CY, Yen TC, Hsieh WP. HaplotypeCN: copy number haplotype inference with Hidden Markov Model and localized haplotype clustering. PLoS One 2014; 9:e96841. [PMID: 24849202 PMCID: PMC4029584 DOI: 10.1371/journal.pone.0096841] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2013] [Accepted: 04/11/2014] [Indexed: 11/18/2022] Open
Abstract
Copy number variation (CNV) has been reported to be associated with disease and various cancers. Hence, identifying the accurate position and the type of CNV is currently a critical issue. There are many tools targeting on detecting CNV regions, constructing haplotype phases on CNV regions, or estimating the numerical copy numbers. However, none of them can do all of the three tasks at the same time. This paper presents a method based on Hidden Markov Model to detect parent specific copy number change on both chromosomes with signals from SNP arrays. A haplotype tree is constructed with dynamic branch merging to model the transition of the copy number status of the two alleles assessed at each SNP locus. The emission models are constructed for the genotypes formed with the two haplotypes. The proposed method can provide the segmentation points of the CNV regions as well as the haplotype phasing for the allelic status on each chromosome. The estimated copy numbers are provided as fractional numbers, which can accommodate the somatic mutation in cancer specimens that usually consist of heterogeneous cell populations. The algorithm is evaluated on simulated data and the previously published regions of CNV of the 270 HapMap individuals. The results were compared with five popular methods: PennCNV, genoCN, COKGEN, QuantiSNP and cnvHap. The application on oral cancer samples demonstrates how the proposed method can facilitate clinical association studies. The proposed algorithm exhibits comparable sensitivity of the CNV regions to the best algorithm in our genome-wide study and demonstrates the highest detection rate in SNP dense regions. In addition, we provide better haplotype phasing accuracy than similar approaches. The clinical association carried out with our fractional estimate of copy numbers in the cancer samples provides better detection power than that with integer copy number states.
Collapse
Affiliation(s)
- Yen-Jen Lin
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
| | - Yu-Tin Chen
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
| | - Shu-Ni Hsu
- Institute of Statistics, National Tsing Hua University, Hsinchu, Taiwan
| | - Chien-Hua Peng
- Department of Resource Center for Clinical Research, Chang Gung Memorial Hospital, Taoyuan, Taiwan
| | - Chuan-Yi Tang
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
- Department of Computer Science and Information Engineering, Providence University, Taichung, Taiwan
| | - Tzu-Chen Yen
- Head and Neck Oncology Group, Chang Gung Memorial Hospital, Taoyuan, Taiwan
- Nuclear Medicine and Molecular Imaging Center, Chang Gung Memorial Hospital, Taoyuan, Taiwan
| | - Wen-Ping Hsieh
- Institute of Statistics, National Tsing Hua University, Hsinchu, Taiwan
- * E-mail:
| |
Collapse
|
17
|
Genome-wide identification of somatic aberrations from paired normal-tumor samples. PLoS One 2014; 9:e87212. [PMID: 24498045 PMCID: PMC3907544 DOI: 10.1371/journal.pone.0087212] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2013] [Accepted: 12/26/2013] [Indexed: 12/13/2022] Open
Abstract
Genomic copy number alteration and allelic imbalance are distinct features of cancer cells, and recent advances in the genotyping technology have greatly boosted the research in the cancer genome. However, the complicated nature of tumor usually hampers the dissection of the SNP arrays. In this study, we describe a bioinformatic tool, named GIANT, for genome-wide identification of somatic aberrations from paired normal-tumor samples measured with SNP arrays. By efficiently incorporating genotype information of matched normal sample, it accurately detects different types of aberrations in cancer genome, even for aneuploid tumor samples with severe normal cell contamination. Furthermore, it allows for discovery of recurrent aberrations with critical biological properties in tumorigenesis by using statistical significance test. We demonstrate the superior performance of the proposed method on various datasets including tumor replicate pairs, simulated SNP arrays and dilution series of normal-cancer cell lines. Results show that GIANT has the potential to detect the genomic aberration even when the cancer cell proportion is as low as 5∼10%. Application on a large number of paired tumor samples delivers a genome-wide profile of the statistical significance of the various aberrations, including amplification, deletion and LOH. We believe that GIANT represents a powerful bioinformatic tool for interpreting the complex genomic aberration, and thus assisting both academic study and the clinical treatment of cancer.
Collapse
|
18
|
Baugher JD, Baugher BD, Shirley MD, Pevsner J. Sensitive and specific detection of mosaic chromosomal abnormalities using the Parent-of-Origin-based Detection (POD) method. BMC Genomics 2013; 14:367. [PMID: 23724825 PMCID: PMC3680018 DOI: 10.1186/1471-2164-14-367] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2012] [Accepted: 05/14/2013] [Indexed: 11/25/2022] Open
Abstract
Background Mosaic somatic alterations are present in all multi-cellular organisms, but the physiological effects of low-level mosaicism are largely unknown. Most mosaic alterations remain undetectable with current analytical approaches, although the presence of such alterations is increasingly implicated as causative for disease. Results Here, we present the Parent-of-Origin-based Detection (POD) method for chromosomal abnormality detection in trio-based SNP microarray data. Our software implementation, triPOD, was benchmarked using a simulated dataset, outperformed comparable software for sensitivity of abnormality detection, and displayed substantial improvement in the detection of low-level mosaicism while maintaining comparable specificity. Examples of low-level mosaic abnormalities from a large autism dataset demonstrate the benefits of the increased sensitivity provided by triPOD. The triPOD analyses showed robustness across multiple types of Illumina microarray chips. Two large, clinically-relevant datasets were characterized and compared. Conclusions Our method and software provide a significant advancement in the ability to detect low-level mosaic abnormalities, thereby opening new avenues for research into the implications of mosaicism in pathogenic and non-pathogenic processes.
Collapse
Affiliation(s)
- Joseph D Baugher
- Program in Biochemistry, Cellular and Molecular Biology, Johns Hopkins School of Medicine, Baltimore, MD 21205, USA
| | | | | | | |
Collapse
|
19
|
Shadravan F. Sex bias in copy number variation of olfactory receptor gene family depends on ethnicity. Front Genet 2013; 4:32. [PMID: 23503716 PMCID: PMC3596775 DOI: 10.3389/fgene.2013.00032] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2012] [Accepted: 02/26/2013] [Indexed: 12/22/2022] Open
Abstract
Gender plays a pivotal role in the human genetic identity and is also manifested in many genetic disorders particularly mental retardation. In this study its effect on copy number variation (CNV), known to cause genetic disorders was explored. As the olfactory receptor (OR) repertoire comprises the largest human gene family, it was selected for this study, which was carried out within and between three populations, derived from 150 individuals from the 1000 Genome Project. Analysis of 3872 CNVs detected among 791 OR loci, in which 307 loci showed CNV, revealed the following novel findings: Sex bias in CNV was significantly more prevalent in uncommon than common CNV variants of OR pseudogenes, in which the male genome showed more CNVs; and in one-copy number loss compared to complete deletion of OR pseudogenes; both findings implying a more recent evolutionary role for gender. Sex bias in copy number gain was also detected. Another novel finding was that the observed sex bias was largely dependent on ethnicity and was in general absent in East Asians. Using a CNV public database for sick children (International Standard Cytogenomic Array Consortium) the application of these findings for improving clinical molecular diagnostics is discussed by showing an example of sex bias in CNV among kids with autism. Additional clinical relevance is discussed, as the most polymorphic CNV-enriched OR cluster in the human genome, located on chr 15q11.2, is found near the Prader–Willi syndrome/Angelman syndrome bi-directionally imprinted region associated with two well-known mental retardation syndromes. As olfaction represents the primitive cognition in most mammals, arguably in competition with the development of a larger brain, the extensive retention of OR pseudogenes in females of this study, might point to a parent-of-origin indirect regulatory role for OR pseudogenes in the embryonic development of human brain. Thus any perturbation in the temporal regulation of olfactory system could lead to developmental delay disorders including mental retardation.
Collapse
Affiliation(s)
- Farideh Shadravan
- *Correspondence: Farideh Shadravan, 2584 San Jose Ave, San Francisco, CA 94112, USA. e-mail:
| |
Collapse
|
20
|
Shen R, Wang S, Mo Q. SPARSE INTEGRATIVE CLUSTERING OF MULTIPLE OMICS DATA SETS. Ann Appl Stat 2013; 7:269-294. [PMID: 24587839 DOI: 10.1214/12-aoas578] [Citation(s) in RCA: 72] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
High resolution microarrays and second-generation sequencing platforms are powerful tools to investigate genome-wide alterations in DNA copy number, methylation, and gene expression associated with a disease. An integrated genomic profiling approach measuring multiple omics data types simultaneously in the same set of biological samples would render an integrated data resolution that would not be available with any single data type. In this study, we use penalized latent variable regression methods for joint modeling of multiple omics data types to identify common latent variables that can be used to cluster patient samples into biologically and clinically relevant disease subtypes. We consider lasso (Tibshirani, 1996), elastic net (Zou and Hastie, 2005), and fused lasso (Tibshirani et al., 2005) methods to induce sparsity in the coefficient vectors, revealing important genomic features that have significant contributions to the latent variables. An iterative ridge regression is used to compute the sparse coefficient vectors. In model selection, a uniform design (Fang and Wang, 1994) is used to seek "experimental" points that scattered uniformly across the search domain for efficient sampling of tuning parameter combinations. We compared our method to sparse singular value decomposition (SVD) and penalized Gaussian mixture model (GMM) using both real and simulated data sets. The proposed method is applied to integrate genomic, epigenomic, and transcriptomic data for subtype analysis in breast and lung cancer data sets.
Collapse
|
21
|
Vattathil S, Scheet P. Haplotype-based profiling of subtle allelic imbalance with SNP arrays. Genome Res 2013; 23:152-8. [PMID: 23028187 PMCID: PMC3530675 DOI: 10.1101/gr.141374.112] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2012] [Accepted: 09/14/2012] [Indexed: 01/19/2023]
Abstract
Due to limitations of surgical dissection and tumor heterogeneity, tumor samples collected for cancer genomics studies are often heavily diluted with normal tissue or contain subpopulations of cells harboring important aberrations. Methods for profiling tumor-associated allelic imbalance in such scenarios break down at aberrant cell proportions of 10%-15% and below. Here, we present an approach that offers a vast improvement for detection of subtle allelic imbalance, or low proportions of cells harboring aberrant allelic ratio among nonaberrant cells, in unpaired tumor samples using SNP microarrays. We leverage the expected pattern of allele-specific intensity ratios determined by an individual's germline haplotypes, information that has been ignored in existing approaches. We demonstrate our method on real and simulated data from the CRL-2324 breast cancer cell line genotyped on the Illumina 370K array. Assuming a 5 million SNP array, we can detect the presence of aberrant cells in proportions lower than 0.25% in the breast cancer sample, approaching the sensitivity of some minimal residual disease assays. Further, we apply a hidden Markov model to identify copy-neutral LOH (loss of heterozygosity) events as short as 11 Mb in mixtures of only 4% tumor using 370K data. We anticipate our approach will offer a new paradigm for genomic profiling of heterogeneous samples.
Collapse
Affiliation(s)
- Selina Vattathil
- Human & Molecular Genetics Program, The University of Texas Graduate School of Biomedical Sciences, Houston, Texas 77030, USA.
| | | |
Collapse
|
22
|
Lai Y. Change-point analysis of paired allele-specific copy number variation data. J Comput Biol 2012; 19:679-93. [PMID: 22697241 DOI: 10.1089/cmb.2012.0031] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
The recent genome-wide allele-specific copy number variation data enable us to explore two types of genomic information including chromosomal genotype variations as well as DNA copy number variations. For a cancer study, it is common to collect data for paired normal and tumor samples. Then, two types of paired data can be obtained to study a disease subject. However, there is a lack of methods for a simultaneous analysis of these four sequences of data. In this study, we propose a statistical framework based on the change-point analysis approach. The validity and usefulness of our proposed statistical framework are demonstrated through the simulation studies and applications based on an experimental data set.
Collapse
Affiliation(s)
- Yinglei Lai
- Department of Statistics and Biostatistics Center, The George Washington University, Washington, DC, USA
| |
Collapse
|
23
|
Ortiz-Estevez M, Aramburu A, Rubio A. Getting DNA copy numbers without control samples. Algorithms Mol Biol 2012; 7:19. [PMID: 22898240 PMCID: PMC3512512 DOI: 10.1186/1748-7188-7-19] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2011] [Accepted: 06/15/2012] [Indexed: 01/30/2023] Open
Abstract
Background The selection of the reference to scale the data in a copy number analysis has paramount importance to achieve accurate estimates. Usually this reference is generated using control samples included in the study. However, these control samples are not always available and in these cases, an artificial reference must be created. A proper generation of this signal is crucial in terms of both noise and bias. We propose NSA (Normality Search Algorithm), a scaling method that works with and without control samples. It is based on the assumption that genomic regions enriched in SNPs with identical copy numbers in both alleles are likely to be normal. These normal regions are predicted for each sample individually and used to calculate the final reference signal. NSA can be applied to any CN data regardless the microarray technology and preprocessing method. It also finds an optimal weighting of the samples minimizing possible batch effects. Results Five human datasets (a subset of HapMap samples, Glioblastoma Multiforme (GBM), Ovarian, Prostate and Lung Cancer experiments) have been analyzed. It is shown that using only tumoral samples, NSA is able to remove the bias in the copy number estimation, to reduce the noise and therefore, to increase the ability to detect copy number aberrations (CNAs). These improvements allow NSA to also detect recurrent aberrations more accurately than other state of the art methods. Conclusions NSA provides a robust and accurate reference for scaling probe signals data to CN values without the need of control samples. It minimizes the problems of bias, noise and batch effects in the estimation of CNs. Therefore, NSA scaling approach helps to better detect recurrent CNAs than current methods. The automatic selection of references makes it useful to perform bulk analysis of many GEO or ArrayExpress experiments without the need of developing a parser to find the normal samples or possible batches within the data. The method is available in the open-source R package NSA, which is an add-on to the aroma.cn framework.
http://www.aroma-project.org/addons.
Collapse
|
24
|
Zhang Z, Lange K, Sabatti C. Reconstructing DNA copy number by joint segmentation of multiple sequences. BMC Bioinformatics 2012; 13:205. [PMID: 22897923 PMCID: PMC3534631 DOI: 10.1186/1471-2105-13-205] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2012] [Accepted: 07/27/2012] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Variations in DNA copy number carry information on the modalities of genome evolution and mis-regulation of DNA replication in cancer cells. Their study can help localize tumor suppressor genes, distinguish different populations of cancerous cells, and identify genomic variations responsible for disease phenotypes. A number of different high throughput technologies can be used to identify copy number variable sites, and the literature documents multiple effective algorithms. We focus here on the specific problem of detecting regions where variation in copy number is relatively common in the sample at hand. This problem encompasses the cases of copy number polymorphisms, related samples, technical replicates, and cancerous sub-populations from the same individual. RESULTS We present a segmentation method named generalized fused lasso (GFL) to reconstruct copy number variant regions. GFL is based on penalized estimation and is capable of processing multiple signals jointly. Our approach is computationally very attractive and leads to sensitivity and specificity levels comparable to those of state-of-the-art specialized methodologies. We illustrate its applicability with simulated and real data sets. CONCLUSIONS The flexibility of our framework makes it applicable to data obtained with a wide range of technology. Its versatility and speed make GFL particularly useful in the initial screening stages of large data sets.
Collapse
Affiliation(s)
- Zhongyang Zhang
- Department of Statistics, University of California, Los Angeles, CA, USA
| | - Kenneth Lange
- Department of Human Genetics, Biomathematics and Statistics, University of California, Los Angeles, CA, USA
| | - Chiara Sabatti
- Department of Health Research and Policy and Statistics, Stanford University, Stanford, CA, USA
| |
Collapse
|
25
|
Mosén-Ansorena D, Aransay AM, Rodríguez-Ezpeleta N. Comparison of methods to detect copy number alterations in cancer using simulated and real genotyping data. BMC Bioinformatics 2012; 13:192. [PMID: 22870940 PMCID: PMC3472297 DOI: 10.1186/1471-2105-13-192] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2011] [Accepted: 06/30/2012] [Indexed: 01/29/2023] Open
Abstract
Background The detection of genomic copy number alterations (CNA) in cancer based on SNP arrays requires methods that take into account tumour specific factors such as normal cell contamination and tumour heterogeneity. A number of tools have been recently developed but their performance needs yet to be thoroughly assessed. To this aim, a comprehensive model that integrates the factors of normal cell contamination and intra-tumour heterogeneity and that can be translated to synthetic data on which to perform benchmarks is indispensable. Results We propose such model and implement it in an R package called CnaGen to synthetically generate a wide range of alterations under different normal cell contamination levels. Six recently published methods for CNA and loss of heterozygosity (LOH) detection on tumour samples were assessed on this synthetic data and on a dilution series of a breast cancer cell-line: ASCAT, GAP, GenoCNA, GPHMM, MixHMM and OncoSNP. We report the recall rates in terms of normal cell contamination levels and alteration characteristics: length, copy number and LOH state, as well as the false discovery rate distribution for each copy number under different normal cell contamination levels. Assessed methods are in general better at detecting alterations with low copy number and under a little normal cell contamination levels. All methods except GPHMM, which failed to recognize the alteration pattern in the cell-line samples, provided similar results for the synthetic and cell-line sample sets. MixHMM and GenoCNA are the poorliest performing methods, while GAP generally performed better. This supports the viability of approaches other than the common hidden Markov model (HMM)-based. Conclusions We devised and implemented a comprehensive model to generate data that simulate tumoural samples genotyped using SNP arrays. The validity of the model is supported by the similarity of the results obtained with synthetic and real data. Based on these results and on the software implementation of the methods, we recommend GAP for advanced users and GPHMM for a fully driven analysis.
Collapse
Affiliation(s)
- David Mosén-Ansorena
- Genome Analysis Platform, CIC bioGUNE-CIBERehd, Technologic Park of Bizkaia, Building 502, 48160 Derio, Spain.
| | | | | |
Collapse
|
26
|
Shen JJ, Zhang NR. Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing. Ann Appl Stat 2012. [DOI: 10.1214/11-aoas517] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
27
|
Ortiz-Estevez M, Aramburu A, Bengtsson H, Neuvial P, Rubio A. CalMaTe: a method and software to improve allele-specific copy number of SNP arrays for downstream segmentation. ACTA ACUST UNITED AC 2012; 28:1793-4. [PMID: 22576175 DOI: 10.1093/bioinformatics/bts248] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
SUMMARY CalMaTe calibrates preprocessed allele-specific copy number estimates (ASCNs) from DNA microarrays by controlling for single-nucleotide polymorphism-specific allelic crosstalk. The resulting ASCNs are on average more accurate, which increases the power of segmentation methods for detecting changes between copy number states in tumor studies including copy neutral loss of heterozygosity. CalMaTe applies to any ASCNs regardless of preprocessing method and microarray technology, e.g. Affymetrix and Illumina. AVAILABILITY The method is available on CRAN (http://cran.r-project.org/) in the open-source R package calmate, which also includes an add-on to the Aroma Project framework (http://www.aroma-project.org/).
Collapse
|
28
|
Rasmussen M, Sundström M, Göransson Kultima H, Botling J, Micke P, Birgisson H, Glimelius B, Isaksson A. Allele-specific copy number analysis of tumor samples with aneuploidy and tumor heterogeneity. Genome Biol 2011; 12:R108. [PMID: 22023820 PMCID: PMC3333778 DOI: 10.1186/gb-2011-12-10-r108] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2011] [Revised: 09/08/2011] [Accepted: 10/24/2011] [Indexed: 12/15/2022] Open
Abstract
We describe a bioinformatic tool, Tumor Aberration Prediction Suite (TAPS), for the identification of allele-specific copy numbers in tumor samples using data from Affymetrix SNP arrays. It includes detailed visualization of genomic segment characteristics and iterative pattern recognition for copy number identification, and does not require patient-matched normal samples. TAPS can be used to identify chromosomal aberrations with high sensitivity even when the proportion of tumor cells is as low as 30%. Analysis of cancer samples indicates that TAPS is well suited to investigate samples with aneuploidy and tumor heterogeneity, which is commonly found in many types of solid tumors.
Collapse
Affiliation(s)
- Markus Rasmussen
- Science for Life Laboratory, Department of Medical Sciences, Uppsala University, Akademiska sjukhuset, SE-751 85 Uppsala, Sweden
| | | | | | | | | | | | | | | |
Collapse
|
29
|
Olshen AB, Bengtsson H, Neuvial P, Spellman PT, Olshen RA, Seshan VE. Parent-specific copy number in paired tumor-normal studies using circular binary segmentation. ACTA ACUST UNITED AC 2011; 27:2038-46. [PMID: 21666266 DOI: 10.1093/bioinformatics/btr329] [Citation(s) in RCA: 94] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
MOTIVATION High-throughput techniques facilitate the simultaneous measurement of DNA copy number at hundreds of thousands of sites on a genome. Older techniques allow measurement only of total copy number, the sum of the copy number contributions from the two parental chromosomes. Newer single nucleotide polymorphism (SNP) techniques can in addition enable quantifying parent-specific copy number (PSCN). The raw data from such experiments are two-dimensional, but are unphased. Consequently, inference based on them necessitates development of new analytic methods. METHODS We have adapted and enhanced the circular binary segmentation (CBS) algorithm for this purpose with focus on paired test and reference samples. The essence of paired parent-specific CBS (Paired PSCBS) is to utilize the original CBS algorithm to identify regions of equal total copy number and then to further segment these regions where there have been changes in PSCN. For the final set of regions, calls are made of equal parental copy number and loss of heterozygosity (LOH). PSCN estimates are computed both before and after calling. RESULTS The methodology is evaluated by simulation and on glioblastoma data. In the simulation, PSCBS compares favorably to established methods. On the glioblastoma data, PSCBS identifies interesting genomic regions, such as copy-neutral LOH. AVAILABILITY The Paired PSCBS method is implemented in an open-source R package named PSCBS, available on CRAN (http://cran.r-project.org/).
Collapse
Affiliation(s)
- Adam B Olshen
- Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA.
| | | | | | | | | | | |
Collapse
|
30
|
Bengtsson H, Neuvial P, Speed TP. TumorBoost: normalization of allele-specific tumor copy numbers from a single pair of tumor-normal genotyping microarrays. BMC Bioinformatics 2010; 11:245. [PMID: 20462408 PMCID: PMC2894037 DOI: 10.1186/1471-2105-11-245] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2009] [Accepted: 05/12/2010] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND High-throughput genotyping microarrays assess both total DNA copy number and allelic composition, which makes them a tool of choice for copy number studies in cancer, including total copy number and loss of heterozygosity (LOH) analyses. Even after state of the art preprocessing methods, allelic signal estimates from genotyping arrays still suffer from systematic effects that make them difficult to use effectively for such downstream analyses. RESULTS We propose a method, TumorBoost, for normalizing allelic estimates of one tumor sample based on estimates from a single matched normal. The method applies to any paired tumor-normal estimates from any microarray-based technology, combined with any preprocessing method. We demonstrate that it increases the signal-to-noise ratio of allelic signals, making it significantly easier to detect allelic imbalances. CONCLUSIONS TumorBoost increases the power to detect somatic copy-number events (including copy-neutral LOH) in the tumor from allelic signals of Affymetrix or Illumina origin. We also conclude that high-precision allelic estimates can be obtained from a single pair of tumor-normal hybridizations, if TumorBoost is combined with single-array preprocessing methods such as (allele-specific) CRMA v2 for Affymetrix or BeadStudio's (proprietary) XY-normalization method for Illumina. A bounded-memory implementation is available in the open-source and cross-platform R package aroma.cn, which is part of the Aroma Project (http://www.aroma-project.org/).
Collapse
Affiliation(s)
- Henrik Bengtsson
- Department of Statistics, University of California, Berkeley, USA
| | - Pierre Neuvial
- Department of Statistics, University of California, Berkeley, USA
| | - Terence P Speed
- Department of Statistics, University of California, Berkeley, USA
- Bioinformatics Division, Walter & Eliza Hall Institute of Medical Research, Parkville, Australia
| |
Collapse
|