101
|
Clustering-Based Method for Developing a Genomic Copy Number Alteration Signature for Predicting the Metastatic Potential of Prostate Cancer. JOURNAL OF PROBABILITY AND STATISTICS 2012; 2012:873570. [PMID: 25419216 DOI: 10.1155/2012/873570] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
The transition of cancer from a localized tumor to a distant metastasis is not well understood for prostate and many other cancers, partly, because of the scarcity of tumor samples, especially metastases, from cancer patients with long-term clinical follow-up. To overcome this limitation, we developed a semi-supervised clustering method using the tumor genomic DNA copy number alterations to classify each patient into inferred clinical outcome groups of metastatic potential. Our data set was comprised of 294 primary tumors and 49 metastases from 5 independent cohorts of prostate cancer patients. The alterations were modeled based on Darwin's evolutionary selection theory and the genes overlapping these altered genomic regions were used to develop a metastatic potential score for a prostate cancer primary tumor. The function of the proteins encoded by some of the predictor genes promote escape from anoikis, a pathway of apoptosis, deregulated in metastases. We evaluated the metastatic potential score with other clinical predictors available at diagnosis using a Cox proportional hazards model and show our proposed score was the only significant predictor of metastasis free survival. The metastasis gene signature and associated score could be applied directly to copy number alteration profiles from patient biopsies positive for prostate cancer.
Collapse
|
102
|
Rippe RCA, Meulman JJ, Eilers PHC. Visualization of genomic changes by segmented smoothing using an L0 penalty. PLoS One 2012; 7:e38230. [PMID: 22679492 PMCID: PMC3367998 DOI: 10.1371/journal.pone.0038230] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2011] [Accepted: 05/05/2012] [Indexed: 11/22/2022] Open
Abstract
Copy number variations (CNV) and allelic imbalance in tumor tissue can show strong segmentation. Their graphical presentation can be enhanced by appropriate smoothing. Existing signal and scatterplot smoothers do not respect segmentation well. We present novel algorithms that use a penalty on the L(0) norm of differences of neighboring values. Visualization is our main goal, but we compare classification performance to that of VEGA.
Collapse
Affiliation(s)
- Ralph C A Rippe
- Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, The Netherlands.
| | | | | |
Collapse
|
103
|
Abstract
Genomic alterations have been linked to the development and progression of cancer. The technique of comparative genomic hybridization (CGH) yields data consisting of fluorescence intensity ratios of test and reference DNA samples. The intensity ratios provide information about the number of copies in DNA. Practical issues such as the contamination of tumor cells in tissue specimens and normalization errors necessitate the use of statistics for learning about the genomic alterations from array CGH data. As increasing amounts of array CGH data become available, there is a growing need for automated algorithms for characterizing genomic profiles. Specifically, there is a need for algorithms that can identify gains and losses in the number of copies based on statistical considerations, rather than merely detect trends in the data.We adopt a Bayesian approach, relying on the hidden Markov model to account for the inherent dependence in the intensity ratios. Posterior inferences are made about gains and losses in copy number. Localized amplifications (associated with oncogene mutations) and deletions (associated with mutations of tumor suppressors) are identified using posterior probabilities. Global trends such as extended regions of altered copy number are detected. Because the posterior distribution is analytically intractable, we implement a Metropolis-within-Gibbs algorithm for efficient simulation-based inference. Publicly available data on pancreatic adenocarcinoma, glioblastoma multiforme, and breast cancer are analyzed, and comparisons are made with some widely used algorithms to illustrate the reliability and success of the technique.
Collapse
Affiliation(s)
- Subharup Guha
- Department of Statistics, University of Missouri-Columbia, Columbia, MO 65211
| | | | | |
Collapse
|
104
|
Magi A, Tattini L, Pippucci T, Torricelli F, Benelli M. Read count approach for DNA copy number variants detection. Bioinformatics 2011; 28:470-8. [DOI: 10.1093/bioinformatics/btr707] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
|
105
|
|
106
|
Jiang H, Zhu ZZ, Yu Y, Lin S, Hou L. Improved Statistical Analysis for Array CGH-Based DNA Copy Number Aberrations. Cancer Inform 2011; 10:249-58. [PMID: 22084565 PMCID: PMC3212864 DOI: 10.4137/cin.s8019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Array-based comparative genomic hybridization (aCGH) allows measuring DNA copy number at the whole genome scale. In cancer studies, one may be interested in identifying DNA copy number aberrations (CNAs) associated with certain clinicopathological characteristics such as cancer metastasis. We proposed to define test regions based on copy number pattern profiles across multiple samples, using either smoothed log2-ratio or discrete data of copy number gain/loss calls. Association test performed on the refined test regions instead of the probes has improved power due to reduced number of tests. We also compared three types of measurement of copy number levels, normalized log2-ratio, smoothed log2-ratio, and copy number gain or loss calls in statistical hypothesis testing. The relative strengths and weaknesses of the proposed method were demonstrated using both simulation studies and real data analysis of a liver cancer study.
Collapse
Affiliation(s)
- Hongmei Jiang
- Department of Statistics, Northwestern University, 2006 Sheridan Road, Evanston, IL 60208, USA
| | | | | | | | | |
Collapse
|
107
|
Mathiesen RR, Fjelldal R, Liestøl K, Due EU, Geigl JB, Riethdorf S, Borgen E, Rye IH, Schneider IJ, Obenauf AC, Mauermann O, Nilsen G, Christian Lingjaerde O, Børresen-Dale AL, Pantel K, Speicher MR, Naume B, Baumbusch LO. High-resolution analyses of copy number changes in disseminated tumor cells of patients with breast cancer. Int J Cancer 2011; 131:E405-15. [PMID: 21935921 DOI: 10.1002/ijc.26444] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2011] [Accepted: 09/02/2011] [Indexed: 12/13/2022]
Abstract
The presence of disseminated tumor cells (DTCs) in bone marrow (BM) identifies breast cancer patients with less favorable outcome. Furthermore, molecular characterization is required to investigate the malignant potential of these cells. This study presents a single-cell array comparative genomic hybridization (SCaCGH) method providing molecular analysis of immunomorphologically detected DTCs. The resolution limit of the method was estimated using the cancer cell line SK-BR-3 on 44 and 244k arrays. The technique was further tested on 28 circulating tumor cells and four hematopoietic cells (HCs) from peripheral blood (n = 8 patients). The SCaCGH method was finally applied to 24 DTCs, three immunopositive cells morphologically classified as probable HCs from breast cancer patients and five HC controls from BM (n = 7 patients plus n = 1 healthy donor). The frequency of copy number changes of the DTCs revealed similarities with primary breast tumor samples. Three of the patients had available profiles for DTCs and the corresponding tumor tissue from primary surgery. More than two-third of the analyzed DTCs disclosed equivalent changes, both to each other and to the corresponding primary disease, whereas the rest of the cells showed balanced profiles. The probable HCs revealed either balanced profiles (n = 2) or changes comparable to the tumor tissue and DTCs (n = 1), indicating morphological overlap between HCs and DTCs. Similar aberration patterns were visible in DTCs collected at diagnosis and at 3 years relapse-free follow-up. SCaCGH may be a powerful tool for the molecular characterization of DTCs.
Collapse
Affiliation(s)
- Randi R Mathiesen
- Department of Genetics, Oslo University Hospital Radiumhospitalet, Oslo, Norway
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
108
|
Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion. Proc Natl Acad Sci U S A 2011; 108:E1128-36. [PMID: 22065754 DOI: 10.1073/pnas.1110574108] [Citation(s) in RCA: 172] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
DNA copy number variations (CNVs) play an important role in the pathogenesis and progression of cancer and confer susceptibility to a variety of human disorders. Array comparative genomic hybridization has been used widely to identify CNVs genome wide, but the next-generation sequencing technology provides an opportunity to characterize CNVs genome wide with unprecedented resolution. In this study, we developed an algorithm to detect CNVs from whole-genome sequencing data and applied it to a newly sequenced glioblastoma genome with a matched control. This read-depth algorithm, called BIC-seq, can accurately and efficiently identify CNVs via minimizing the Bayesian information criterion. Using BIC-seq, we identified hundreds of CNVs as small as 40 bp in the cancer genome sequenced at 10× coverage, whereas we could only detect large CNVs (> 15 kb) in the array comparative genomic hybridization profiles for the same genome. Eighty percent (14/16) of the small variants tested (110 bp to 14 kb) were experimentally validated by quantitative PCR, demonstrating high sensitivity and true positive rate of the algorithm. We also extended the algorithm to detect recurrent CNVs in multiple samples as well as deriving error bars for breakpoints using a Gibbs sampling approach. We propose this statistical approach as a principled yet practical and efficient method to estimate CNVs in whole-genome sequencing data.
Collapse
|
109
|
Mahmud MP, Schliep A. Fast MCMC sampling for hidden Markov Models to determine copy number variations. BMC Bioinformatics 2011; 12:428. [PMID: 22047014 PMCID: PMC3371636 DOI: 10.1186/1471-2105-12-428] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2011] [Accepted: 11/02/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Hidden Markov Models (HMM) are often used for analyzing Comparative Genomic Hybridization (CGH) data to identify chromosomal aberrations or copy number variations by segmenting observation sequences. For efficiency reasons the parameters of a HMM are often estimated with maximum likelihood and a segmentation is obtained with the Viterbi algorithm. This introduces considerable uncertainty in the segmentation, which can be avoided with Bayesian approaches integrating out parameters using Markov Chain Monte Carlo (MCMC) sampling. While the advantages of Bayesian approaches have been clearly demonstrated, the likelihood based approaches are still preferred in practice for their lower running times; datasets coming from high-density arrays and next generation sequencing amplify these problems. RESULTS We propose an approximate sampling technique, inspired by compression of discrete sequences in HMM computations and by kd-trees to leverage spatial relations between data points in typical data sets, to speed up the MCMC sampling. CONCLUSIONS We test our approximate sampling method on simulated and biological ArrayCGH datasets and high-density SNP arrays, and demonstrate a speed-up of 10 to 60 respectively 90 while achieving competitive results with the state-of-the art Bayesian approaches. AVAILABILITY An implementation of our method will be made available as part of the open source GHMM library from http://ghmm.org.
Collapse
Affiliation(s)
- Md Pavel Mahmud
- Department of Computer Science, Rutgers University, 110 Frelinghuysen Road, Piscataway, NJ 08854, USA.
| | | |
Collapse
|
110
|
Holcomb IN, Trask BJ. Comparative genomic hybridization to detect variation in the copy number of large DNA segments. Cold Spring Harb Protoc 2011; 2011:1323-1333. [PMID: 22046040 DOI: 10.1101/pdb.top066589] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Array comparative genomic hybridization (CGH) is an excellent tool to scan the genome for copy number variations (CNVs) when used conscientiously. This article is intended to provide an understanding of the basic principles of array CGH and the different options available to the user to design their array CGH experiments. Specifically, the six subsections discuss the different array platforms available, test and reference DNA preparation, reference DNA choice, the basics of hybridization, data processing, and our current understanding of CNVs in the human genome.
Collapse
|
111
|
Park C, Ahn J, Yoon Y, Park S. A multi-sample based method for identifying common CNVs in normal human genomic structure using high-resolution aCGH data. PLoS One 2011; 6:e26975. [PMID: 22073121 PMCID: PMC3205051 DOI: 10.1371/journal.pone.0026975] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2011] [Accepted: 10/07/2011] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND It is difficult to identify copy number variations (CNV) in normal human genomic data due to noise and non-linear relationships between different genomic regions and signal intensity. A high-resolution array comparative genomic hybridization (aCGH) containing 42 million probes, which is very large compared to previous arrays, was recently published. Most existing CNV detection algorithms do not work well because of noise associated with the large amount of input data and because most of the current methods were not designed to analyze normal human samples. Normal human genome analysis often requires a joint approach across multiple samples. However, the majority of existing methods can only identify CNVs from a single sample. METHODOLOGY AND PRINCIPAL FINDINGS We developed a multi-sample-based genomic variations detector (MGVD) that uses segmentation to identify common breakpoints across multiple samples and a k-means-based clustering strategy. Unlike previous methods, MGVD simultaneously considers multiple samples with different genomic intensities and identifies CNVs and CNV zones (CNVZs); CNVZ is a more precise measure of the location of a genomic variant than the CNV region (CNVR). CONCLUSIONS AND SIGNIFICANCE We designed a specialized algorithm to detect common CNVs from extremely high-resolution multi-sample aCGH data. MGVD showed high sensitivity and a low false discovery rate for a simulated data set, and outperformed most current methods when real, high-resolution HapMap datasets were analyzed. MGVD also had the fastest runtime compared to the other algorithms evaluated when actual, high-resolution aCGH data were analyzed. The CNVZs identified by MGVD can be used in association studies for revealing relationships between phenotypes and genomic aberrations. Our algorithm was developed with standard C++ and is available in Linux and MS Windows format in the STL library. It is freely available at: http://embio.yonsei.ac.kr/~Park/mgvd.php.
Collapse
Affiliation(s)
- Chihyun Park
- Department of Computer Science, Yonsei University, Seoul, South Korea
| | - Jaegyoon Ahn
- Department of Computer Science, Yonsei University, Seoul, South Korea
| | - Youngmi Yoon
- Division of Information Engineering, Gachon University of Medicine and Science, Incheon, South Korea
| | - Sanghyun Park
- Department of Computer Science, Yonsei University, Seoul, South Korea
| |
Collapse
|
112
|
Hsu FH, Chen HIH, Tsai MH, Lai LC, Huang CC, Tu SH, Chuang EY, Chen Y. A model-based circular binary segmentation algorithm for the analysis of array CGH data. BMC Res Notes 2011; 4:394. [PMID: 21985277 PMCID: PMC3224564 DOI: 10.1186/1756-0500-4-394] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2011] [Accepted: 10/10/2011] [Indexed: 12/22/2022] Open
Abstract
Background Circular Binary Segmentation (CBS) is a permutation-based algorithm for array Comparative Genomic Hybridization (aCGH) data analysis. CBS accurately segments data by detecting change-points using a maximal-t test; but extensive computational burden is involved for evaluating the significance of change-points using permutations. A recent implementation utilizing a hybrid method and early stopping rules (hybrid CBS) to improve the performance in speed was subsequently proposed. However, a time analysis revealed that a major portion of computation time of the hybrid CBS was still spent on permutation. In addition, what the hybrid method provides is an approximation of the significance upper bound or lower bound, not an approximation of the significance of change-points itself. Results We developed a novel model-based algorithm, extreme-value based CBS (eCBS), which limits permutations and provides robust results without loss of accuracy. Thousands of aCGH data under null hypothesis were simulated in advance based on a variety of non-normal assumptions, and the corresponding maximal-t distribution was modeled by the Generalized Extreme Value (GEV) distribution. The modeling results, which associate characteristics of aCGH data to the GEV parameters, constitute lookup tables (eXtreme model). Using the eXtreme model, the significance of change-points could be evaluated in a constant time complexity through a table lookup process. Conclusions A novel algorithm, eCBS, was developed in this study. The current implementation of eCBS consistently outperforms the hybrid CBS 4× to 20× in computation time without loss of accuracy. Source codes, supplementary materials, supplementary figures, and supplementary tables can be found at http://ntumaps.cgm.ntu.edu.tw/eCBSsupplementary.
Collapse
Affiliation(s)
- Fang-Han Hsu
- Graduate Institute of Biomedical Electronics and Bioinformatics, Department of Electrical Engineering, National Taiwan University, Taipei 106, Taiwan.
| | | | | | | | | | | | | | | |
Collapse
|
113
|
Presson AP, Kim N, Xiaofei Y, Chen IS, Kim S. Methodology and software to detect viral integration site hot-spots. BMC Bioinformatics 2011; 12:367. [PMID: 21914224 PMCID: PMC3203353 DOI: 10.1186/1471-2105-12-367] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2011] [Accepted: 09/14/2011] [Indexed: 11/17/2022] Open
Abstract
Background Modern gene therapy methods have limited control over where a therapeutic viral vector inserts into the host genome. Vector integration can activate local gene expression, which can cause cancer if the vector inserts near an oncogene. Viral integration hot-spots or 'common insertion sites' (CIS) are scrutinized to evaluate and predict patient safety. CIS are typically defined by a minimum density of insertions (such as 2-4 within a 30-100 kb region), which unfortunately depends on the total number of observed VIS. This is problematic for comparing hot-spot distributions across data sets and patients, where the VIS numbers may vary. Results We develop two new methods for defining hot-spots that are relatively independent of data set size. Both methods operate on distributions of VIS across consecutive 1 Mb 'bins' of the genome. The first method 'z-threshold' tallies the number of VIS per bin, converts these counts to z-scores, and applies a threshold to define high density bins. The second method 'BCP' applies a Bayesian change-point model to the z-scores to define hot-spots. The novel hot-spot methods are compared with a conventional CIS method using simulated data sets and data sets from five published human studies, including the X-linked ALD (adrenoleukodystrophy), CGD (chronic granulomatous disease) and SCID-X1 (X-linked severe combined immunodeficiency) trials. The BCP analysis of the human X-linked ALD data for two patients separately (774 and 1627 VIS) and combined (2401 VIS) resulted in 5-6 hot-spots covering 0.17-0.251% of the genome and containing 5.56-7.74% of the total VIS. In comparison, the CIS analysis resulted in 12-110 hot-spots covering 0.018-0.246% of the genome and containing 5.81-22.7% of the VIS, corresponding to a greater number of hot-spots as the data set size increased. Our hot-spot methods enable one to evaluate the extent of VIS clustering, and formally compare data sets in terms of hot-spot overlap. Finally, we show that the BCP hot-spots from the repopulating samples coincide with greater gene and CpG island density than the median genome density. Conclusions The z-threshold and BCP methods are useful for comparing hot-spot patterns across data sets of disparate sizes. The methodology and software provided here should enable one to study hot-spot conservation across a variety of VIS data sets and evaluate vector safety for gene therapy trials.
Collapse
Affiliation(s)
- Angela P Presson
- Department of Biostatistics, University of California Los Angeles, School of Public Health, USA.
| | | | | | | | | |
Collapse
|
114
|
Single-cell copy number variation detection. Genome Biol 2011; 12:R80. [PMID: 21854607 PMCID: PMC3245619 DOI: 10.1186/gb-2011-12-8-r80] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2011] [Revised: 08/09/2011] [Accepted: 08/19/2011] [Indexed: 12/15/2022] Open
Abstract
Detection of chromosomal aberrations from a single cell by array comparative genomic hybridization (single-cell array CGH), instead of from a population of cells, is an emerging technique. However, such detection is challenging because of the genome artifacts and the DNA amplification process inherent to the single cell approach. Current normalization algorithms result in inaccurate aberration detection for single-cell data. We propose a normalization method based on channel, genome composition and recurrent genome artifact corrections. We demonstrate that the proposed channel clone normalization significantly improves the copy number variation detection in both simulated and real single-cell array CGH data.
Collapse
|
115
|
Stamoulis C, Betensky RA. A novel signal processing approach for the detection of copy number variations in the human genome. Bioinformatics 2011; 27:2338-45. [PMID: 21752800 DOI: 10.1093/bioinformatics/btr402] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Human genomic variability occurs at different scales, from single nucleotide polymorphisms (SNPs) to large DNA segments. Copy number variations (CNVs) represent a significant part of our genetic heterogeneity and have also been associated with many diseases and disorders. Short, localized CNVs, which may play an important role in human disease, may be undetectable in noisy genomic data. Therefore, robust methodologies are needed for their detection. Furthermore, for meaningful identification of pathological CNVs, estimation of normal allelic aberrations is necessary. RESULTS We developed a signal processing-based methodology for sequence denoising followed by pattern matching, to increase SNR in genomic data and improve CNV detection. We applied this signal-decomposition-matched filtering (SDMF) methodology to 429 normal genomic sequences, and compared detected CNVs to those in the Database of Genomic Variants. SDMF successfully detected a significant number of previously identified CNVs with frequencies of occurrence ≥10%, as well as unreported short CNVs. Its performance was also compared to circular binary segmentation (CBS). through simulations. SDMF had a significantly lower false detection rate and was significantly faster than CBS, an important advantage for handling large datasets generated with high-resolution arrays. By focusing on improving SNR (instead of the robustness of the detection algorithm), SDMF is a very promising methodology for identifying CNVs at all genomic spatial scales. AVAILABILITY The data are available at http://tcga-data.nci.nih.gov/tcga/ The software and list of analyzed sequence IDs are available at http://www.hsph.harvard.edu/~betensky/ A Matlab code for Empirical Mode Decomposition may be found at: http://www.clear.rice.edu/elec301/Projects02/empiricalMode/code.html CONTACT caterina@mit.edu.
Collapse
Affiliation(s)
- Catherine Stamoulis
- Department of Radiology, Harvard School of Public Health, Boston, MA 02115, USA.
| | | |
Collapse
|
116
|
Dalmasso C, Broët P. Detection of chromosomal abnormalities using high resolution arrays in clinical cancer research. J Biomed Inform 2011; 44:936-42. [PMID: 21703362 DOI: 10.1016/j.jbi.2011.06.003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2010] [Revised: 05/11/2011] [Accepted: 06/06/2011] [Indexed: 01/15/2023]
Abstract
In clinical cancer research, high throughput genomic technologies are increasingly used to identify copy number aberrations. However, the admixture of tumor and stromal cells and the inherent karyotypic heterogeneity of most of the solid tumor samples make this task highly challenging. Here, we propose a robust two-step strategy to detect copy number aberrations in such a context. A spatial mixture model is first used to fit the preprocessed data. Then, a calling algorithm is applied to classify the genomic segments in three biologically meaningful states (copy loss, copy gain and modal copy). The results of a simulation study show the good properties of the proposed procedure with complex patterns of genomic aberrations. The interest of the proposed procedure in clinical cancer research is then illustrated by the analysis of real lung adenocarcinoma samples.
Collapse
Affiliation(s)
- Cyril Dalmasso
- Genome Institute of Singapore, 60 Biopolis Street, 02-01 Genome, Singapore.
| | | |
Collapse
|
117
|
Olshen AB, Bengtsson H, Neuvial P, Spellman PT, Olshen RA, Seshan VE. Parent-specific copy number in paired tumor-normal studies using circular binary segmentation. ACTA ACUST UNITED AC 2011; 27:2038-46. [PMID: 21666266 DOI: 10.1093/bioinformatics/btr329] [Citation(s) in RCA: 94] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
MOTIVATION High-throughput techniques facilitate the simultaneous measurement of DNA copy number at hundreds of thousands of sites on a genome. Older techniques allow measurement only of total copy number, the sum of the copy number contributions from the two parental chromosomes. Newer single nucleotide polymorphism (SNP) techniques can in addition enable quantifying parent-specific copy number (PSCN). The raw data from such experiments are two-dimensional, but are unphased. Consequently, inference based on them necessitates development of new analytic methods. METHODS We have adapted and enhanced the circular binary segmentation (CBS) algorithm for this purpose with focus on paired test and reference samples. The essence of paired parent-specific CBS (Paired PSCBS) is to utilize the original CBS algorithm to identify regions of equal total copy number and then to further segment these regions where there have been changes in PSCN. For the final set of regions, calls are made of equal parental copy number and loss of heterozygosity (LOH). PSCN estimates are computed both before and after calling. RESULTS The methodology is evaluated by simulation and on glioblastoma data. In the simulation, PSCBS compares favorably to established methods. On the glioblastoma data, PSCBS identifies interesting genomic regions, such as copy-neutral LOH. AVAILABILITY The Paired PSCBS method is implemented in an open-source R package named PSCBS, available on CRAN (http://cran.r-project.org/).
Collapse
Affiliation(s)
- Adam B Olshen
- Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA.
| | | | | | | | | | | |
Collapse
|
118
|
Nowak G, Hastie T, Pollack JR, Tibshirani R. A fused lasso latent feature model for analyzing multi-sample aCGH data. Biostatistics 2011; 12:776-91. [PMID: 21642389 DOI: 10.1093/biostatistics/kxr012] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms single-sample methods when the simulated samples share common information. We also propose a method for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and Significance Testing of Aberrant Copy number (an alternative, existing approach) identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying 3 distinct groups of samples based on their patterns of CNV for chromosome 17.
Collapse
Affiliation(s)
- Gen Nowak
- Department of Biostatistics, Harvard University, Boston, MA 02115, USA.
| | | | | | | |
Collapse
|
119
|
|
120
|
Ryba T, Battaglia D, Pope BD, Hiratani I, Gilbert DM. Genome-scale analysis of replication timing: from bench to bioinformatics. Nat Protoc 2011; 6:870-95. [PMID: 21637205 PMCID: PMC3111951 DOI: 10.1038/nprot.2011.328] [Citation(s) in RCA: 83] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Replication timing profiles are cell type-specific and reflect genome organization changes during differentiation. In this protocol, we describe how to analyze genome-wide replication timing (RT) in mammalian cells. Asynchronously cycling cells are pulse labeled with the nucleotide analog 5-bromo-2-deoxyuridine (BrdU) and sorted into S-phase fractions on the basis of DNA content using flow cytometry. BrdU-labeled DNA from each fraction is immunoprecipitated, amplified, differentially labeled and co-hybridized to a whole-genome comparative genomic hybridization microarray, which is currently more cost effective than high-throughput sequencing and equally capable of resolving features at the biologically relevant level of tens to hundreds of kilobases. We also present a guide to analyzing the resulting data sets based on methods we use routinely. Subjects include normalization, scaling and data quality measures, LOESS (local polynomial) smoothing of RT values, segmentation of data into domains and assignment of timing values to gene promoters. Finally, we cover clustering methods and means to relate changes in the replication program to gene expression and other genetic and epigenetic data sets. Some experience with R or similar programming languages is assumed. All together, the protocol takes ∼3 weeks per batch of samples.
Collapse
Affiliation(s)
- Tyrone Ryba
- Department of Biological Science, Florida State University, Tallahassee, FL 32306
| | - Dana Battaglia
- Department of Biological Science, Florida State University, Tallahassee, FL 32306
| | - Benjamin D. Pope
- Department of Biological Science, Florida State University, Tallahassee, FL 32306
| | - Ichiro Hiratani
- Biological Macromolecules Laboratory, National Institute of Genetics, Japan
| | - David M. Gilbert
- Department of Biological Science, Florida State University, Tallahassee, FL 32306
| |
Collapse
|
121
|
Siegmund D, Yakir B, Zhang NR. Detecting simultaneous variant intervals in aligned sequences. Ann Appl Stat 2011. [DOI: 10.1214/10-aoas400] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
122
|
Abstract
Most existing methods for identifying aberrant regions with array CGH data are confined to a single target sample. Focusing on the comparison of multiple samples from two different groups, we develop a new penalized regression approach with a fused adaptive lasso penalty to accommodate the spatial dependence of the clones. The nonrandom aberrant genomic segments are determined by assessing the significance of the differences between neighboring clones and neighboring segments. The algorithm proposed in this article is a first attempt to simultaneously detect the common aberrant regions within each group, and the regions where the two groups differ in copy number changes. The simulation study suggests that the proposed procedure outperforms the commonly used single-sample aberration detection methods for segmentation in terms of both false positives and false negatives. To further assess the value of the proposed method, we analyze a data set from a study that identified the aberrant genomic regions associated with grade subgroups of breast cancer tumors.
Collapse
Affiliation(s)
- Huixia Judy Wang
- Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695, USA.
| | | |
Collapse
|
123
|
Eckel-Passow JE, Atkinson EJ, Maharjan S, Kardia SLR, de Andrade M. Software comparison for evaluating genomic copy number variation for Affymetrix 6.0 SNP array platform. BMC Bioinformatics 2011; 12:220. [PMID: 21627824 PMCID: PMC3146450 DOI: 10.1186/1471-2105-12-220] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2010] [Accepted: 05/31/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Copy number data are routinely being extracted from genome-wide association study chips using a variety of software. We empirically evaluated and compared four freely-available software packages designed for Affymetrix SNP chips to estimate copy number: Affymetrix Power Tools (APT), Aroma.Affymetrix, PennCNV and CRLMM. Our evaluation used 1,418 GENOA samples that were genotyped on the Affymetrix Genome-Wide Human SNP Array 6.0. We compared bias and variance in the locus-level copy number data, the concordance amongst regions of copy number gains/deletions and the false-positive rate amongst deleted segments. RESULTS APT had median locus-level copy numbers closest to a value of two, whereas PennCNV and Aroma.Affymetrix had the smallest variability associated with the median copy number. Of those evaluated, only PennCNV provides copy number specific quality-control metrics and identified 136 poor CNV samples. Regions of copy number variation (CNV) were detected using the hidden Markov models provided within PennCNV and CRLMM/VanillaIce. PennCNV detected more CNVs than CRLMM/VanillaIce; the median number of CNVs detected per sample was 39 and 30, respectively. PennCNV detected most of the regions that CRLMM/VanillaIce did as well as additional CNV regions. The median concordance between PennCNV and CRLMM/VanillaIce was 47.9% for duplications and 51.5% for deletions. The estimated false-positive rate associated with deletions was similar for PennCNV and CRLMM/VanillaIce. CONCLUSIONS If the objective is to perform statistical tests on the locus-level copy number data, our empirical results suggest that PennCNV or Aroma.Affymetrix is optimal. If the objective is to perform statistical tests on the summarized segmented data then PennCNV would be preferred over CRLMM/VanillaIce. Specifically, PennCNV allows the analyst to estimate locus-level copy number, perform segmentation and evaluate CNV-specific quality-control metrics within a single software package. PennCNV has relatively small bias, small variability and detects more regions while maintaining a similar estimated false-positive rate as CRLMM/VanillaIce. More generally, we advocate that software developers need to provide guidance with respect to evaluating and choosing optimal settings in order to obtain optimal results for an individual dataset. Until such guidance exists, we recommend trying multiple algorithms, evaluating concordance/discordance and subsequently consider the union of regions for downstream association tests.
Collapse
Affiliation(s)
- Jeanette E Eckel-Passow
- Division of Biomedical Statistics and Informatics, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA.
| | | | | | | | | |
Collapse
|
124
|
Chen CH, Lee HC, Ling Q, Chen HR, Ko YA, Tsou TS, Wang SC, Wu LC, Lee HC. An all-statistics, high-speed algorithm for the analysis of copy number variation in genomes. Nucleic Acids Res 2011; 39:e89. [PMID: 21576227 PMCID: PMC3141250 DOI: 10.1093/nar/gkr137] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Detection of copy number variation (CNV) in DNA has recently become an important method for understanding the pathogenesis of cancer. While existing algorithms for extracting CNV from microarray data have worked reasonably well, the trend towards ever larger sample sizes and higher resolution microarrays has vastly increased the challenges they face. Here, we present Segmentation analysis of DNA (SAD), a clustering algorithm constructed with a strategy in which all operational decisions are based on simple and rigorous applications of statistical principles, measurement theory and precise mathematical relations. Compared with existing packages, SAD is simpler in formulation, more user friendly, much faster and less thirsty for memory, offers higher accuracy and supplies quantitative statistics for its predictions. Unique among such algorithms, SAD's running time scales linearly with array size; on a typical modern notebook, it completes high-quality CNV analyses for a 250 thousand-probe array in ∼1 s and a 1.8 million-probe array in ∼8 s.
Collapse
Affiliation(s)
- Chih-Hao Chen
- Graduate Institute of Systems Biology and Bioinformatics, National Central University, Chungli, Taiwan 32001
| | | | | | | | | | | | | | | | | |
Collapse
|
125
|
Asimit JL, Andrulis IL, Bull SB. Regression models, scan statistics and reappearance probabilities to detect regions of association between gene expression and copy number. Stat Med 2011; 30:1157-78. [PMID: 21337593 DOI: 10.1002/sim.4193] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2010] [Accepted: 12/17/2010] [Indexed: 12/22/2022]
Abstract
Early studies of breast cancer microarray data used linear models to quantify the relationship between measures of gene expression (GE) and copy number (CN) obtained from tumour samples. Motivated by a study of women with axillary node-negative breast cancer, we propose a regression-based scan statistic to identify within-chromosome clusters of genetic probes that exhibit association between GE and CN, while accounting for tumour characteristics known to be prognostic for clinical outcome. As a measure of the association between GE and CN, for each genetic probe available from a microarray we regress GE on CN, and include subject-specific covariates. In the development of the scan statistic, the within-chromosome spatial distribution of the subset of probes with a statistically significant association is approximated by a Poisson process. By incorporating the distance between the probe positions, the scan statistic accounts for the spatial nature of CN alterations. Regions identified as clusters of significant associations are hypothesized to harbour genes involved in breast cancer progression. Using simulations, we examine the sensitivity of the method to certain factors, and to address issues of repeatability, we consider reappearance probabilities for each probe within detected regions and assess the utility of a quantity estimated by bootstrap sample frequencies. Applications of the proposed method to joint analysis of GE and CN in breast tumours, with and without an informative covariate, and comparisons with alternative methods suggest that inclusion of covariates and the use of a regional test statistic can serve to refine regions for further investigation including the analysis of their association with outcome.
Collapse
Affiliation(s)
- Jennifer L Asimit
- Samuel Lunenfeld Research Institute of Mount Sinai Hospital, University of Toronto, Toronto, ON, Canada
| | | | | |
Collapse
|
126
|
Pinto D, Darvishi K, Shi X, Rajan D, Rigler D, Fitzgerald T, Lionel AC, Thiruvahindrapuram B, Macdonald JR, Mills R, Prasad A, Noonan K, Gribble S, Prigmore E, Donahoe PK, Smith RS, Park JH, Hurles ME, Carter NP, Lee C, Scherer SW, Feuk L. Comprehensive assessment of array-based platforms and calling algorithms for detection of copy number variants. Nat Biotechnol 2011; 29:512-20. [PMID: 21552272 PMCID: PMC3270583 DOI: 10.1038/nbt.1852] [Citation(s) in RCA: 332] [Impact Index Per Article: 23.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2010] [Accepted: 03/22/2011] [Indexed: 11/09/2022]
Abstract
We have systematically compared copy number variant (CNV) detection on eleven microarrays to evaluate data quality and CNV calling, reproducibility, concordance across array platforms and laboratory sites, breakpoint accuracy and analysis tool variability. Different analytic tools applied to the same raw data typically yield CNV calls with <50% concordance. Moreover, reproducibility in replicate experiments is <70% for most platforms. Nevertheless, these findings should not preclude detection of large CNVs for clinical diagnostic purposes because large CNVs with poor reproducibility are found primarily in complex genomic regions and would typically be removed by standard clinical data curation. The striking differences between CNV calls from different platforms and analytic tools highlight the importance of careful assessment of experimental design in discovery and association studies and of strict data curation and filtering in diagnostics. The CNV resource presented here allows independent data evaluation and provides a means to benchmark new algorithms.
Collapse
Affiliation(s)
- Dalila Pinto
- The Centre for Applied Genomics, The Hospital for Sick Children, Toronto, Ontario, Canada
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
127
|
Abstract
High-throughput tools for nucleic acid characterization now provide the means to conduct comprehensive analyses of all somatic alterations in the cancer genomes. Both large-scale and focused efforts have identified new targets of translational potential. The deluge of information that emerges from these genome-scale investigations has stimulated a parallel development of new analytical frameworks and tools. The complexity of somatic genomic alterations in cancer genomes also requires the development of robust methods for the interrogation of the function of genes identified by these genomics efforts. Here we provide an overview of the current state of cancer genomics, appraise the current portals and tools for accessing and analyzing cancer genomic data, and discuss emerging approaches to exploring the functions of somatically altered genes in cancer.
Collapse
Affiliation(s)
- Lynda Chin
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts 02115, USA.
| | | | | | | |
Collapse
|
128
|
Ritz A, Paris PL, Ittmann MM, Collins C, Raphael BJ. Detection of recurrent rearrangement breakpoints from copy number data. BMC Bioinformatics 2011; 12:114. [PMID: 21510904 PMCID: PMC3112242 DOI: 10.1186/1471-2105-12-114] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2010] [Accepted: 04/21/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Copy number variants (CNVs), including deletions, amplifications, and other rearrangements, are common in human and cancer genomes. Copy number data from array comparative genome hybridization (aCGH) and next-generation DNA sequencing is widely used to measure copy number variants. Comparison of copy number data from multiple individuals reveals recurrent variants. Typically, the interior of a recurrent CNV is examined for genes or other loci associated with a phenotype. However, in some cases, such as gene truncations and fusion genes, the target of variant lies at the boundary of the variant. RESULTS We introduce Neighborhood Breakpoint Conservation (NBC), an algorithm for identifying rearrangement breakpoints that are highly conserved at the same locus in multiple individuals. NBC detects recurrent breakpoints at varying levels of resolution, including breakpoints whose location is exactly conserved and breakpoints whose location varies within a gene. NBC also identifies pairs of recurrent breakpoints such as those that result from fusion genes. We apply NBC to aCGH data from 36 primary prostate tumors and identify 12 novel rearrangements, one of which is the well-known TMPRSS2-ERG fusion gene. We also apply NBC to 227 glioblastoma tumors and predict 93 novel rearrangements which we further classify as gene truncations, germline structural variants, and fusion genes. A number of these variants involve the protein phosphatase PTPN12 suggesting that deregulation of PTPN12, via a variety of rearrangements, is common in glioblastoma. CONCLUSIONS We demonstrate that NBC is useful for detection of recurrent breakpoints resulting from copy number variants or other structural variants, and in particular identifies recurrent breakpoints that result in gene truncations or fusion genes. Software is available at http://http.//cs.brown.edu/people/braphael/software.html.
Collapse
Affiliation(s)
- Anna Ritz
- Department of Computer Science, Brown University, Providence, RI, USA.
| | | | | | | | | |
Collapse
|
129
|
Seifert M, Strickert M, Schliep A, Grosse I. Exploiting prior knowledge and gene distances in the analysis of tumor expression profiles with extended Hidden Markov Models. ACTA ACUST UNITED AC 2011; 27:1645-52. [PMID: 21511716 DOI: 10.1093/bioinformatics/btr199] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Changes in gene expression levels play a central role in tumors. Additional information about the distribution of gene expression levels and distances between adjacent genes on chromosomes should be integrated into the analysis of tumor expression profiles. RESULTS We use a Hidden Markov Model with distance-scaled transition matrices (DSHMM) to incorporate chromosomal distances of adjacent genes on chromosomes into the identification of differentially expressed genes in breast cancer. We train the DSHMM by integrating prior knowledge about potential distributions of expression levels of differentially expressed and unchanged genes in tumor. We find that especially the combination of these data and to a lesser extent the modeling of distances between adjacent genes contribute to a substantial improvement of the identification of differentially expressed genes in comparison to other existing methods. This performance benefit is also supported by the identification of genes well known to be associated with breast cancer. That suggests applications of DSHMMs for screening of other tumor expression profiles. AVAILABILITY The DSHMM is available as part of the open-source Java library Jstacs (www.jstacs.de/index.php/DSHMM).
Collapse
Affiliation(s)
- Michael Seifert
- Department of Molecular Genetics, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany.
| | | | | | | |
Collapse
|
130
|
He D, Hormozdiari F, Furlotte N, Eskin E. Efficient algorithms for tandem copy number variation reconstruction in repeat-rich regions. Bioinformatics 2011; 27:1513-20. [PMID: 21505028 DOI: 10.1093/bioinformatics/btr169] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
MOTIVATION Structural variations and in particular copy number variations (CNVs) have dramatic effects of disease and traits. Technologies for identifying CNVs have been an active area of research for over 10 years. The current generation of high-throughput sequencing techniques presents new opportunities for identification of CNVs. Methods that utilize these technologies map sequencing reads to a reference genome and look for signatures which might indicate the presence of a CNV. These methods work well when CNVs lie within unique genomic regions. However, the problem of CNV identification and reconstruction becomes much more challenging when CNVs are in repeat-rich regions, due to the multiple mapping positions of the reads. RESULTS In this study, we propose an efficient algorithm to handle these multi-mapping reads such that the CNVs can be reconstructed with high accuracy even for repeat-rich regions. To our knowledge, this is the first attempt to both identify and reconstruct CNVs in repeat-rich regions. Our experiments show that our method is not only computationally efficient but also accurate.
Collapse
Affiliation(s)
- Dan He
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, USA
| | | | | | | |
Collapse
|
131
|
Halper-Stromberg E, Frelin L, Ruczinski I, Scharpf R, Jie C, Carvalho B, Hao H, Hetrick K, Jedlicka A, Dziedzic A, Doheny K, Scott AF, Baylin S, Pevsner J, Spencer F, Irizarry RA. Performance assessment of copy number microarray platforms using a spike-in experiment. Bioinformatics 2011; 27:1052-60. [PMID: 21478196 PMCID: PMC3072561 DOI: 10.1093/bioinformatics/btr106] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2010] [Revised: 01/20/2011] [Accepted: 02/17/2011] [Indexed: 01/01/2023] Open
Abstract
MOTIVATION Changes in the copy number of chromosomal DNA segments [copy number variants (CNVs)] have been implicated in human variation, heritable diseases and cancers. Microarray-based platforms are the current established technology of choice for studies reporting these discoveries and constitute the benchmark against which emergent sequence-based approaches will be evaluated. Research that depends on CNV analysis is rapidly increasing, and systematic platform assessments that distinguish strengths and weaknesses are needed to guide informed choice. RESULTS We evaluated the sensitivity and specificity of six platforms, provided by four leading vendors, using a spike-in experiment. NimbleGen and Agilent platforms outperformed Illumina and Affymetrix in accuracy and precision of copy number dosage estimates. However, Illumina and Affymetrix algorithms that leverage single nucleotide polymorphism (SNP) information make up for this disadvantage and perform well at variant detection. Overall, the NimbleGen 2.1M platform outperformed others, but only with the use of an alternative data analysis pipeline to the one offered by the manufacturer. AVAILABILITY The data is available from http://rafalab.jhsph.edu/cnvcomp/. CONTACT pevsner@jhmi.edu; fspencer@jhmi.edu; rafa@jhu.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Eitan Halper-Stromberg
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
132
|
Koike A, Nishida N, Yamashita D, Tokunaga K. Comparative analysis of copy number variation detection methods and database construction. BMC Genet 2011; 12:29. [PMID: 21385384 PMCID: PMC3058066 DOI: 10.1186/1471-2156-12-29] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2010] [Accepted: 03/07/2011] [Indexed: 12/13/2022] Open
Abstract
Background Array-based detection of copy number variations (CNVs) is widely used for identifying disease-specific genetic variations. However, the accuracy of CNV detection is not sufficient and results differ depending on the detection programs used and their parameters. In this study, we evaluated five widely used CNV detection programs, Birdsuite (mainly consisting of the Birdseye and Canary modules), Birdseye (part of Birdsuite), PennCNV, CGHseg, and DNAcopy from the viewpoint of performance on the Affymetrix platform using HapMap data and other experimental data. Furthermore, we identified CNVs of 180 healthy Japanese individuals using parameters that showed the best performance in the HapMap data and investigated their characteristics. Results The results indicate that Hidden Markov model-based programs PennCNV and Birdseye (part of Birdsuite), or Birdsuite show better detection performance than other programs when the high reproducibility rates of the same individuals and the low Mendelian inconsistencies are considered. Furthermore, when rates of overlap with other experimental results were taken into account, Birdsuite showed the best performance from the view point of sensitivity but was expected to include many false negatives and some false positives. The results of 180 healthy Japanese demonstrate that the ratio containing repeat sequences, not only segmental repeats but also long interspersed nuclear element (LINE) sequences both in the start and end regions of the CNVs, is higher in CNVs that are commonly detected among multiple individuals than that in randomly selected regions, and the conservation score based on primates is lower in these regions than in randomly selected regions. Similar tendencies were observed in HapMap data and other experimental data. Conclusions Our results suggest that not only segmental repeats but also interspersed repeats, especially LINE sequences, are deeply involved in CNVs, particularly in common CNV formations. The detected CNVs are stored in the CNV repository database newly constructed by the "Japanese integrated database project" for sharing data among researchers. http://gwas.lifesciencedb.jp/cgi-bin/cnvdb/cnv_top.cgi
Collapse
Affiliation(s)
- Asako Koike
- Central Research Laboratory, Hitachi Ltd., Tokyo, Japan.
| | | | | | | |
Collapse
|
133
|
Miecznikowski JC, Gaile DP, Liu S, Shepherd L, Nowak N. A new normalizing algorithm for BAC CGH arrays with quality control metrics. J Biomed Biotechnol 2011; 2011:860732. [PMID: 21403910 PMCID: PMC3043322 DOI: 10.1155/2011/860732] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2010] [Revised: 11/23/2010] [Accepted: 12/18/2010] [Indexed: 11/17/2022] Open
Abstract
The main focus in pin-tip (or print-tip) microarray analysis is determining which probes, genes, or oligonucleotides are differentially expressed. Specifically in array comparative genomic hybridization (aCGH) experiments, researchers search for chromosomal imbalances in the genome. To model this data, scientists apply statistical methods to the structure of the experiment and assume that the data consist of the signal plus random noise. In this paper we propose "SmoothArray", a new method to preprocess comparative genomic hybridization (CGH) bacterial artificial chromosome (BAC) arrays and we show the effects on a cancer dataset. As part of our R software package "aCGHplus," this freely available algorithm removes the variation due to the intensity effects, pin/print-tip, the spatial location on the microarray chip, and the relative location from the well plate. removal of this variation improves the downstream analysis and subsequent inferences made on the data. Further, we present measures to evaluate the quality of the dataset according to the arrayer pins, 384-well plates, plate rows, and plate columns. We compare our method against competing methods using several metrics to measure the biological signal. With this novel normalization algorithm and quality control measures, the user can improve their inferences on datasets and pinpoint problems that may arise in their BAC aCGH technology.
Collapse
|
134
|
Ortiz-Estevez M, De Las Rivas J, Fontanillo C, Rubio A. Segmentation of genomic and transcriptomic microarrays data reveals major correlation between DNA copy number aberrations and gene–loci expression. Genomics 2011; 97:86-93. [DOI: 10.1016/j.ygeno.2010.10.008] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2010] [Revised: 10/20/2010] [Accepted: 10/22/2010] [Indexed: 11/26/2022]
|
135
|
Wang S, Wang Y, Xie Y, Xiao G. A novel approach to DNA copy number data segmentation. J Bioinform Comput Biol 2011; 9:131-48. [PMID: 21328710 PMCID: PMC3084615 DOI: 10.1142/s0219720011005343] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2010] [Revised: 11/02/2010] [Accepted: 11/04/2010] [Indexed: 11/18/2022]
Abstract
DNA copy number (DCN) is the number of copies of DNA at a region of a genome. The alterations of DCN are highly associated with the development of different tumors. Recently, microarray technologies are being employed to detect DCN changes at many loci at the same time in tumor samples. The resulting DCN data are often very noisy, and the tumor sample is often contaminated by normal cells. The goal of computational analysis of array-based DCN data is to infer the underlying DCNs from raw DCN data. Previous methods for this task do not model the tumor/normal cell mixture ratio explicitly and they cannot output segments with DCN annotations. We developed a novel model-based method using the minimum description length (MDL) principle for DCN data segmentation. Our new method can output underlying DCN for each chromosomal segment, and at the same time, infer the underlying tumor proportion in the test samples. Empirical results show that our method achieves better accuracies on average as compared to three previous methods, namely Circular Binary Segmentation, Hidden Markov Model and Ultrasome.
Collapse
Affiliation(s)
- Siling Wang
- Department of Computer Science and Engineering, Southern Methodist University, Dallas, Texas 75205, USA.
| | | | | | | |
Collapse
|
136
|
Chen H, Xing H, Zhang NR. Estimation of parent specific DNA copy number in tumors using high-density genotyping arrays. PLoS Comput Biol 2011; 7:e1001060. [PMID: 21298078 PMCID: PMC3029233 DOI: 10.1371/journal.pcbi.1001060] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2010] [Accepted: 12/17/2010] [Indexed: 01/01/2023] Open
Abstract
Chromosomal gains and losses comprise an important type of genetic change in tumors, and can now be assayed using microarray hybridization-based experiments. Most current statistical models for DNA copy number estimate total copy number, which do not distinguish between the underlying quantities of the two inherited chromosomes. This latter information, sometimes called parent specific copy number, is important for identifying allele-specific amplifications and deletions, for quantifying normal cell contamination, and for giving a more complete molecular portrait of the tumor. We propose a stochastic segmentation model for parent-specific DNA copy number in tumor samples, and give an estimation procedure that is computationally efficient and can be applied to data from the current high density genotyping platforms. The proposed method does not require matched normal samples, and can estimate the unknown genotypes simultaneously with the parent specific copy number. The new method is used to analyze 223 glioblastoma samples from the Cancer Genome Atlas (TCGA) project, giving a more comprehensive summary of the copy number events in these samples. Detailed case studies on these samples reveal the additional insights that can be gained from an allele-specific copy number analysis, such as the quantification of fractional gains and losses, the identification of copy neutral loss of heterozygosity, and the characterization of regions of simultaneous changes of both inherited chromosomes. Many genetic diseases are related to copy number aberrations of some regions of the genome. As we know, each chromosome normally has two copies. However, under some circumstances, for some regions, either one or both of the chromosomes change. Genotyping microarray data provides the copy number of the two alleles of polymorphic sites along the chromosomes, which make the inference of the copy number aberrations of the chromosome feasible. One difficulty is that genotyping microarray data cannot provide the haplotype of the two copies of a chromosome. In this paper, we model the copy number along the chromosome as a two-dimensional Markov Chain. Using the observed copy number of both alleles of all the sites, we can determine the parent specific copy number along the chromosome as well as infer the haplotypes of the two copies of the inherited chromosomes in regions where there is allelic imbalance. Simulation results show high sensitivity and specificity of the method. Applying this method to glioblastoma samples from the Cancer Genome Atlas data illustrate the insights gained from allele-specific copy number analysis.
Collapse
Affiliation(s)
- Hao Chen
- Department of Statistics, Stanford University, Stanford, California, United States of America
| | - Haipeng Xing
- Department of Applied Mathematics and Statistics, SUNY at Stony Brook, Stony Brook, New York, United States of America
| | - Nancy R. Zhang
- Department of Statistics, Stanford University, Stanford, California, United States of America
- * E-mail:
| |
Collapse
|
137
|
Yu X, Randolph TW, Tang H, Hsu L. Detecting genomic aberrations using products in a multiscale analysis. Biometrics 2011; 66:684-93. [PMID: 19817738 DOI: 10.1111/j.1541-0420.2009.01337.x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Genomic instability, such as copy-number losses and gains, occurs in many genetic diseases. Recent technology developments enable researchers to measure copy numbers at tens of thousands of markers simultaneously. In this article, we propose a nonparametric approach for detecting the locations of copy-number changes and provide a measure of significance for each change point. The proposed test is based on seeking scale-based changes in the sequence of copy numbers, which is ordered by the marker locations along the chromosome. The method leads to a natural way to estimate the null distribution for the test of a change point and adjusted p-values for the significance of a change point using a step-down maxT permutation algorithm to control the family-wise error rate. A simulation study investigates the finite sample performance of the proposed method and compares it with a more standard sequential testing method. The method is illustrated using two real data sets.
Collapse
Affiliation(s)
- Xuesong Yu
- Statistical Center for HIV/AIDS Research and Prevention, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, USA.
| | | | | | | |
Collapse
|
138
|
Picard F, Lebarbier E, Hoebeke M, Rigaill G, Thiam B, Robin S. Joint segmentation, calling, and normalization of multiple CGH profiles. Biostatistics 2011; 12:413-28. [PMID: 21209153 DOI: 10.1093/biostatistics/kxq076] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
The statistical analysis of array comparative genomic hybridization (CGH) data has now shifted to the joint assessment of copy number variations at the cohort level. Considering multiple profiles gives the opportunity to correct for systematic biases observed on single profiles, such as probe GC content or the so-called "wave effect." In this article, we extend the segmentation model developed in the univariate case to the joint analysis of multiple CGH profiles. Our contribution is multiple: we propose an integrated model to perform joint segmentation, normalization, and calling for multiple array CGH profiles. This model shows great flexibility, especially in the modeling of the wave effect that gives a likelihood framework to approaches proposed by others. We propose a new dynamic programming algorithm for break point positioning, as well as a model selection criterion based on a modified bayesian information criterion proposed in the univariate case. The performance of our method is assessed using simulated and real data sets. Our method is implemented in the R package cghseg.
Collapse
Affiliation(s)
- Franck Picard
- Laboratoire de Biometrie et Biologie Evolutive, UMR CNRS 5558 - Univ. Lyon 1, F-69622, Villeurbanne, France.
| | | | | | | | | | | |
Collapse
|
139
|
Vandeweyer G, Reyniers E, Wuyts W, Rooms L, Kooy RF. CNV-WebStore: online CNV analysis, storage and interpretation. BMC Bioinformatics 2011; 12:4. [PMID: 21208430 PMCID: PMC3024943 DOI: 10.1186/1471-2105-12-4] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2010] [Accepted: 01/05/2011] [Indexed: 02/02/2023] Open
Abstract
Background Microarray technology allows the analysis of genomic aberrations at an ever increasing resolution, making functional interpretation of these vast amounts of data the main bottleneck in routine implementation of high resolution array platforms, and emphasising the need for a centralised and easy to use CNV data management and interpretation system. Results We present CNV-WebStore, an online platform to streamline the processing and downstream interpretation of microarray data in a clinical context, tailored towards but not limited to the Illumina BeadArray platform. Provided analysis tools include CNV analsyis, parent of origin and uniparental disomy detection. Interpretation tools include data visualisation, gene prioritisation, automated PubMed searching, linking data to several genome browsers and annotation of CNVs based on several public databases. Finally a module is provided for uniform reporting of results. Conclusion CNV-WebStore is able to present copy number data in an intuitive way to both lab technicians and clinicians, making it a useful tool in daily clinical practice.
Collapse
Affiliation(s)
- Geert Vandeweyer
- Department of Medical Genetics, University Hospital Antwerp, Antwerp, Belgium
| | | | | | | | | |
Collapse
|
140
|
Guo B, Villagran A, Vannucci M, Wang J, Davis C, Man TK, Lau C, Guerra R. Bayesian estimation of genomic copy number with single nucleotide polymorphism genotyping arrays. BMC Res Notes 2010; 3:350. [PMID: 21192799 PMCID: PMC3023756 DOI: 10.1186/1756-0500-3-350] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2010] [Accepted: 12/30/2010] [Indexed: 11/19/2022] Open
Abstract
Background The identification of copy number aberration in the human genome is an important area in cancer research. We develop a model for determining genomic copy numbers using high-density single nucleotide polymorphism genotyping microarrays. The method is based on a Bayesian spatial normal mixture model with an unknown number of components corresponding to true copy numbers. A reversible jump Markov chain Monte Carlo algorithm is used to implement the model and perform posterior inference. Results The performance of the algorithm is examined on both simulated and real cancer data, and it is compared with the popular CNAG algorithm for copy number detection. Conclusions We demonstrate that our Bayesian mixture model performs at least as well as the hidden Markov model based CNAG algorithm and in certain cases does better. One of the added advantages of our method is the flexibility of modeling normal cell contamination in tumor samples.
Collapse
Affiliation(s)
- Beibei Guo
- Department of Statistics, Rice University, 6100 Main, Houston, TX 77005-1827, USA.
| | | | | | | | | | | | | | | |
Collapse
|
141
|
He D, Furlotte N, Eskin E. Detection and reconstruction of tandemly organized de novo copy number variations. BMC Bioinformatics 2010; 11 Suppl 11:S12. [PMID: 21172047 PMCID: PMC3024866 DOI: 10.1186/1471-2105-11-s11-s12] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Background The characterization of structural variations (SV) such as insertions, deletions and copy number variations is a critical step in the process of understanding the full genetic architecture of organisms. Copy number variations (CNV) have attracted much recent attention due to their effects on gene expression and disease status. Results In this paper, we present a method that utilizes next-generation sequencing technologies (NGS), in order to both detect and reconstruct CNVs. We focus on a special type of CNV, namely tandemly organized de novo CNVs, which have been shown to occur with high frequency in the mouse genome. Conclusions We apply our method to CNV regions randomly inserted into the reference mouse genome and show that our method achieves good performance for both detection and reconstruction of tandemly organized de novo CNVs.
Collapse
Affiliation(s)
- Dan He
- Dept, of Comp, Sci, Univ, of California Los Angeles, Los Angeles, CA 90095, USA.
| | | | | |
Collapse
|
142
|
A first comparative map of copy number variations in the sheep genome. Genomics 2010; 97:158-65. [PMID: 21111040 DOI: 10.1016/j.ygeno.2010.11.005] [Citation(s) in RCA: 90] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2010] [Revised: 11/12/2010] [Accepted: 11/16/2010] [Indexed: 12/16/2022]
Abstract
We carried out a cross species cattle-sheep array comparative genome hybridization experiment to identify copy number variations (CNVs) in the sheep genome analysing ewes of Italian dairy or dual-purpose breeds (Bagnolese, Comisana, Laticauda, Massese, Sarda, and Valle del Belice) using a tiling oligonucleotide array with ~385,000 probes designed on the bovine genome. We identified 135 CNV regions (CNVRs; 24 reported in more than one animal) covering ~10.5 Mb of the virtual sheep genome referred to the bovine genome (0.398%) with a mean and a median equal to 77.6 and 55.9 kb, respectively. A comparative analysis between the identified sheep CNVRs and those reported in cattle and goat genomes indicated that overlaps between sheep and both other species CNVRs are highly significant (P<0.0001), suggesting that several chromosome regions might contain recurrent interspecies CNVRs. Many sheep CNVRs include genes with important biological functions. Further studies are needed to evaluate their functional relevance.
Collapse
|
143
|
Muggeo VMR, Adelfio G. Efficient change point detection for genomic sequences of continuous measurements. ACTA ACUST UNITED AC 2010; 27:161-6. [PMID: 21088029 DOI: 10.1093/bioinformatics/btq647] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Knowing the exact locations of multiple change points in genomic sequences serves several biological needs, for instance when data represent aCGH profiles and it is of interest to identify possibly damaged genes involved in cancer and other diseases. Only a few of the currently available methods deal explicitly with estimation of the number and location of change points, and moreover these methods may be somewhat vulnerable to deviations of model assumptions usually employed. RESULTS We present a computationally efficient method to obtain estimates of the number and location of the change points. The method is based on a simple transformation of data and it provides results quite robust to model misspecifications. The efficiency of the method guarantees moderate computational times regardless of the series length and the number of change points. AVAILABILITY The methods described in this article are implemented in the new R package cumSeg available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=cumSeg.
Collapse
Affiliation(s)
- Vito M R Muggeo
- Dipartimento di Scienze Statistiche e Matematiche Vianelli, Università di Palermo, Palermo, Italy.
| | | |
Collapse
|
144
|
Zhang ZD, Gerstein MB. Detection of copy number variation from array intensity and sequencing read depth using a stepwise Bayesian model. BMC Bioinformatics 2010; 11:539. [PMID: 21034510 PMCID: PMC2992546 DOI: 10.1186/1471-2105-11-539] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2010] [Accepted: 10/31/2010] [Indexed: 11/17/2022] Open
Abstract
Background Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations. Array-based comparative genomic hybridization (array-CGH) and newly developed read-depth approach through ultrahigh throughput genomic sequencing both provide rapid, robust, and comprehensive methods to identify CNVs on a whole-genome scale. Results We developed a Bayesian statistical analysis algorithm for the detection of CNVs from both types of genomic data. The algorithm can analyze such data obtained from PCR-based bacterial artificial chromosome arrays, high-density oligonucleotide arrays, and more recently developed high-throughput DNA sequencing. Treating parameters--e.g., the number of CNVs, the position of each CNV, and the data noise level--that define the underlying data generating process as random variables, our approach derives the posterior distribution of the genomic CNV structure given the observed data. Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates. We illustrate the characteristics of our algorithm by applying it to both synthetic and experimental data sets in comparison to other segmentation algorithms. Conclusions In particular, the synthetic data comparison shows that our method is more sensitive than other approaches at low false positive rates. Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.
Collapse
Affiliation(s)
- Zhengdong D Zhang
- Department of Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, USA.
| | | |
Collapse
|
145
|
A bayesian analysis for identifying DNA copy number variations using a compound poisson process. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2010; 2010:268513. [PMID: 20976296 DOI: 10.1155/2010/268513] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/03/2010] [Revised: 07/29/2010] [Accepted: 08/06/2010] [Indexed: 11/17/2022]
Abstract
To study chromosomal aberrations that may lead to cancer formation or genetic diseases, the array-based Comparative Genomic Hybridization (aCGH) technique is often used for detecting DNA copy number variants (CNVs). Various methods have been developed for gaining CNVs information based on aCGH data. However, most of these methods make use of the log-intensity ratios in aCGH data without taking advantage of other information such as the DNA probe (e.g., biomarker) positions/distances contained in the data. Motivated by the specific features of aCGH data, we developed a novel method that takes into account the estimation of a change point or locus of the CNV in aCGH data with its associated biomarker position on the chromosome using a compound Poisson process. We used a Bayesian approach to derive the posterior probability for the estimation of the CNV locus. To detect loci of multiple CNVs in the data, a sliding window process combined with our derived Bayesian posterior probability was proposed. To evaluate the performance of the method in the estimation of the CNV locus, we first performed simulation studies. Finally, we applied our approach to real data from aCGH experiments, demonstrating its applicability.
Collapse
|
146
|
Morganella S, Cerulo L, Viglietto G, Ceccarelli M. VEGA: variational segmentation for copy number detection. ACTA ACUST UNITED AC 2010; 26:3020-7. [PMID: 20959380 DOI: 10.1093/bioinformatics/btq586] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Genomic copy number (CN) information is useful to study genetic traits of many diseases. Using array comparative genomic hybridization (aCGH), researchers are able to measure the copy number of thousands of DNA loci at the same time. Therefore, a current challenge in bioinformatics is the development of efficient algorithms to detect the map of aberrant chromosomal regions. METHODS We describe an approach for the segmentation of copy number aCGH data. Variational estimator for genomic aberrations (VEGA) adopt a variational model used in image segmentation. The optimal segmentation is modeled as the minimum of an energy functional encompassing both the quality of interpolation of the data and the complexity of the solution measured by the length of the boundaries between segmented regions. This solution is obtained by a region growing process where the stop condition is completely data driven. RESULTS VEGA is compared with three algorithms that represent the state of the art in CN segmentation. Performance assessment is made both on synthetic and real data. Synthetic data simulate different noise conditions. Results on these data show the robustness with respect to noise of variational models and the accuracy of VEGA in terms of recall and precision. Eight mantle cell lymphoma cell lines and two samples of glioblastoma multiforme are used to evaluate the behavior of VEGA on real biological data. Comparison between results and current biological knowledge shows the ability of the proposed method in detecting known chromosomal aberrations. AVAILABILITY VEGA has been implemented in R and is available at the address http://www.dsba.unisannio.it/Members/ceccarelli/vega in the section Download.
Collapse
Affiliation(s)
- Sandro Morganella
- Department of Biological and Environmental Studies, University of Sannio, Benevento, Italy
| | | | | | | |
Collapse
|
147
|
Gao X, Huang J. A robust penalized method for the analysis of noisy DNA copy number data. BMC Genomics 2010; 11:517. [PMID: 20868505 PMCID: PMC3247090 DOI: 10.1186/1471-2164-11-517] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2009] [Accepted: 09/25/2010] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND Deletions and amplifications of the human genomic DNA copy number are the causes of numerous diseases, such as, various forms of cancer. Therefore, the detection of DNA copy number variations (CNV) is important in understanding the genetic basis of many diseases. Various techniques and platforms have been developed for genome-wide analysis of DNA copy number, such as, array-based comparative genomic hybridization (aCGH) and high-resolution mapping with high-density tiling oligonucleotide arrays. Since complicated biological and experimental processes are often associated with these platforms, data can be potentially contaminated by outliers. RESULTS We propose a penalized LAD regression model with the adaptive fused lasso penalty for detecting CNV. This method contains robust properties and incorporates both the spatial dependence and sparsity of CNV into the analysis. Our simulation studies and real data analysis indicate that the proposed method can correctly detect the numbers and locations of the true breakpoints while appropriately controlling the false positives. CONCLUSIONS The proposed method has three advantages for detecting CNV change points: it contains robustness properties; incorporates both spatial dependence and sparsity; and estimates the true values at each marker accurately.
Collapse
Affiliation(s)
- Xiaoli Gao
- Department of Mathematics and Statistics, Oakland University, Rochester, MI 48309, USA
| | - Jian Huang
- Department of Statistics and Actuarial Science, University of Iowa, Iowa City, IA 52246, USA
- Department of Biostatistics, University of Iowa, Iowa City, IA 52246, USA
| |
Collapse
|
148
|
Oh M, Song B, Lee H. CAM: a web tool for combining array CGH and microarray gene expression data from multiple samples. Comput Biol Med 2010; 40:781-5. [PMID: 20728879 DOI: 10.1016/j.compbiomed.2010.07.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2009] [Revised: 05/06/2010] [Accepted: 07/30/2010] [Indexed: 11/16/2022]
Abstract
We develop a web-based tool for Combining Array CGH copy number aberration data and Microarray gene expression data (CAM). This tool analyzes these two data sets from multiple samples to detect genes having both DNA copy number aberrations (CNAs) and gene expression changes. CAM provides several statistical methods for identifying CNAs, which are consistent across multiple samples. Identified CNAs and their correlated gene expression changes are then visualized along the chromosomes. As a result, CAM is a useful tool for identifying disease related genes when these two types of data sets are available. To illustrate the various analysis outputs of CAM, we subsequently provide ten sets of example data from seven cancer types.
Collapse
Affiliation(s)
- Mira Oh
- Department of Information and Communications, Gwangju Institute of Science and Technology, Gwangju 500-712, Republic of Korea
| | | | | |
Collapse
|
149
|
Kim TM, Luquette LJ, Xi R, Park PJ. rSW-seq: algorithm for detection of copy number alterations in deep sequencing data. BMC Bioinformatics 2010; 11:432. [PMID: 20718989 PMCID: PMC2939611 DOI: 10.1186/1471-2105-11-432] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2009] [Accepted: 08/18/2010] [Indexed: 02/05/2023] Open
Abstract
Background Recent advances in sequencing technologies have enabled generation of large-scale genome sequencing data. These data can be used to characterize a variety of genomic features, including the DNA copy number profile of a cancer genome. A robust and reliable method for screening chromosomal alterations would allow a detailed characterization of the cancer genome with unprecedented accuracy. Results We develop a method for identification of copy number alterations in a tumor genome compared to its matched control, based on application of Smith-Waterman algorithm to single-end sequencing data. In a performance test with simulated data, our algorithm shows >90% sensitivity and >90% precision in detecting a single copy number change that contains approximately 500 reads for the normal sample. With 100-bp reads, this corresponds to a ~50 kb region for 1X genome coverage of the human genome. We further refine the algorithm to develop rSW-seq, (recursive Smith-Waterman-seq) to identify alterations in a complex configuration, which are commonly observed in the human cancer genome. To validate our approach, we compare our algorithm with an existing algorithm using simulated and publicly available datasets. We also compare the sequencing-based profiles to microarray-based results. Conclusion We propose rSW-seq as an efficient method for detecting copy number changes in the tumor genome.
Collapse
Affiliation(s)
- Tae-Min Kim
- Center for Biomedical Informatics, Harvard Medical School, 10 Shattuck St, Boston, Massachusetts 02115, USA
| | | | | | | |
Collapse
|
150
|
Rapaport F, Leslie C. Determining frequent patterns of copy number alterations in cancer. PLoS One 2010; 5:e12028. [PMID: 20711339 PMCID: PMC2920822 DOI: 10.1371/journal.pone.0012028] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2010] [Accepted: 07/02/2010] [Indexed: 01/18/2023] Open
Abstract
Cancer progression is often driven by an accumulation of genetic changes but also accompanied by increasing genomic instability. These processes lead to a complicated landscape of copy number alterations (CNAs) within individual tumors and great diversity across tumor samples. High resolution array-based comparative genomic hybridization (aCGH) is being used to profile CNAs of ever larger tumor collections, and better computational methods for processing these data sets and identifying potential driver CNAs are needed. Typical studies of aCGH data sets take a pipeline approach, starting with segmentation of profiles, calls of gains and losses, and finally determination of frequent CNAs across samples. A drawback of pipelines is that choices at each step may produce different results, and biases are propagated forward. We present a mathematically robust new method that exploits probe-level correlations in aCGH data to discover subsets of samples that display common CNAs. Our algorithm is related to recent work on maximum-margin clustering. It does not require pre-segmentation of the data and also provides grouping of recurrent CNAs into clusters. We tested our approach on a large cohort of glioblastoma aCGH samples from The Cancer Genome Atlas and recovered almost all CNAs reported in the initial study. We also found additional significant CNAs missed by the original analysis but supported by earlier studies, and we identified significant correlations between CNAs.
Collapse
Affiliation(s)
| | - Christina Leslie
- Computational Biology Program, Sloan-Kettering Institute, New York, New York, United States of America
- * E-mail:
| |
Collapse
|