1
|
Song L, Bhuvaneshwar K, Wang Y, Feng Y, Shih IM, Madhavan S, Gusev Y. CINdex: A Bioconductor Package for Analysis of Chromosome Instability in DNA Copy Number Data. Cancer Inform 2017; 16:1176935117746637. [PMID: 29343938 PMCID: PMC5761903 DOI: 10.1177/1176935117746637] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2017] [Accepted: 10/26/2017] [Indexed: 01/10/2023] Open
Abstract
The CINdex Bioconductor package addresses an important area of high-throughput genomic analysis. It calculates the chromosome instability (CIN) index, a novel measurement that quantitatively characterizes genome-wide copy number alterations (CNAs) as a measure of CIN. The advantage of this package is an ability to compare CIN index values between several groups for patients (case and control groups), which is a typical use case in translational research. The differentially changed cytobands or chromosomes can then be linked to genes located in the affected genomic regions, as well as pathways. This enables in-depth systems biology-based network analysis and assessment of the impact of CNA on various biological processes or clinical outcomes. This package was successfully applied to analysis of DNA copy number data in colorectal cancer as a part of multi-omics integrative study as well as for analysis of several other cancer types. The source code, along with an end-to-end tutorial, and example data are freely available in Bioconductor at http://bioconductor.org/packages/CINdex/.
Collapse
Affiliation(s)
- Lei Song
- Innovation Center for Biomedical Informatics, Georgetown University, Washington, DC, USA
| | - Krithika Bhuvaneshwar
- Innovation Center for Biomedical Informatics, Georgetown University, Washington, DC, USA
| | - Yue Wang
- The Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, USA
| | - Yuanjian Feng
- The Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, USA
| | - Ie-Ming Shih
- Department of Gynecology and Obstetrics, The Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Subha Madhavan
- Innovation Center for Biomedical Informatics, Georgetown University, Washington, DC, USA
| | - Yuriy Gusev
- Innovation Center for Biomedical Informatics, Georgetown University, Washington, DC, USA
| |
Collapse
|
2
|
Jang H, Hur Y, Lee H. Identification of cancer-driver genes in focal genomic alterations from whole genome sequencing data. Sci Rep 2016; 6:25582. [PMID: 27156852 PMCID: PMC4860638 DOI: 10.1038/srep25582] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2016] [Accepted: 04/19/2016] [Indexed: 11/18/2022] Open
Abstract
DNA copy number alterations (CNAs) are the main genomic events that occur during the initiation and development of cancer. Distinguishing driver aberrant regions from passenger regions, which might contain candidate target genes for cancer therapies, is an important issue. Several methods for identifying cancer-driver genes from multiple cancer patients have been developed for single nucleotide polymorphism (SNP) arrays. However, for NGS data, methods for the SNP array cannot be directly applied because of different characteristics of NGS such as higher resolutions of data without predefined probes and incorrectly mapped reads to reference genomes. In this study, we developed a wavelet-based method for identification of focal genomic alterations for sequencing data (WIFA-Seq). We applied WIFA-Seq to whole genome sequencing data from glioblastoma multiforme, ovarian serous cystadenocarcinoma and lung adenocarcinoma, and identified focal genomic alterations, which contain candidate cancer-related genes as well as previously known cancer-driver genes.
Collapse
Affiliation(s)
- Ho Jang
- Gwangju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwangju, 500-712, South Korea
| | - Youngmi Hur
- Yonsei University, Department of Mathematics, Seoul, 120-749, South Korea
| | - Hyunju Lee
- Gwangju Institute of Science and Technology, School of Electrical Engineering and Computer Science, Gwangju, 500-712, South Korea
| |
Collapse
|
3
|
Gamazon ER, Cox NJ, Davis LK. Structural architecture of SNP effects on complex traits. Am J Hum Genet 2014; 95:477-89. [PMID: 25307299 PMCID: PMC4225594 DOI: 10.1016/j.ajhg.2014.09.009] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2014] [Accepted: 09/16/2014] [Indexed: 12/16/2022] Open
Abstract
Despite the discovery of copy-number variation (CNV) across the genome nearly 10 years ago, current SNP-based analysis methodologies continue to collapse the homozygous (i.e., A/A), hemizygous (i.e., A/0), and duplicative (i.e., A/A/A) genotype states, treating the genotype variable as irreducible or unaltered by other colocalizing forms of genetic (e.g., structural) variation. Our understanding of common, genome-wide CNVs suggests that the canonical genotype construct might belie the enormous complexity of the genome. Here we present multiple analyses of several phenotypes and provide methods supporting a conceptual shift that embraces the structural dimension of genotype. We comprehensively investigate the impact of the structural dimension of genotype on (1) GWAS methods, (2) interpretation of rare LOF variants, (3) characterization of genomic architecture, and (4) implications for mapping loci involved in complex disease. Taken together, these results argue for the inclusion of a structural dimension and suggest that some portion of the "missing" heritability might be recovered through integration of the structural dimension of SNP effects on complex traits.
Collapse
Affiliation(s)
- Eric R Gamazon
- Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL 60637, USA
| | - Nancy J Cox
- Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL 60637, USA
| | - Lea K Davis
- Section of Genetic Medicine, Department of Medicine, The University of Chicago, Chicago, IL 60637, USA.
| |
Collapse
|
4
|
Subramanian A, Shackney S, Schwartz R. Novel multisample scheme for inferring phylogenetic markers from whole genome tumor profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1422-1431. [PMID: 24407301 PMCID: PMC3830698 DOI: 10.1109/tcbb.2013.33] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Computational cancer phylogenetics seeks to enumerate the temporal sequences of aberrations in tumor evolution, thereby delineating the evolution of possible tumor progression pathways, molecular subtypes, and mechanisms of action. We previously developed a pipeline for constructing phylogenies describing evolution between major recurring cell types computationally inferred from whole-genome tumor profiles. The accuracy and detail of the phylogenies, however, depend on the identification of accurate, high-resolution molecular markers of progression, i.e., reproducible regions of aberration that robustly differentiate different subtypes and stages of progression. Here, we present a novel hidden Markov model (HMM) scheme for the problem of inferring such phylogenetically significant markers through joint segmentation and calling of multisample tumor data. Our method classifies sets of genome-wide DNA copy number measurements into a partitioning of samples into normal (diploid) or amplified at each probe. It differs from other similar HMM methods in its design specifically for the needs of tumor phylogenetics, by seeking to identify robust markers of progression conserved across a set of copy number profiles. We show an analysis of our method in comparison to other methods on both synthetic and real tumor data, which confirms its effectiveness for tumor phylogeny inference and suggests avenues for future advances.
Collapse
Affiliation(s)
- Ayshwarya Subramanian
- Graduate student at the Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA, 15213.
| | | | | |
Collapse
|
5
|
Abstract
MOTIVATION Data quality is a critical issue in the analyses of DNA copy number alterations obtained from microarrays. It is commonly assumed that copy number alteration data can be modeled as piecewise constant and the measurement errors of different probes are independent. However, these assumptions do not always hold in practice. In some published datasets, we find that measurement errors are highly correlated between probes that interrogate nearby genomic loci, and the piecewise-constant model does not fit the data well. The correlated errors cause problems in downstream analysis, leading to a large number of DNA segments falsely identified as having copy number gains and losses. METHOD We developed a simple tool, called autocorrelation scanning profile, to assess the dependence of measurement error between neighboring probes. RESULTS Autocorrelation scanning profile can be used to check data quality and refine the analysis of DNA copy number data, which we demonstrate in some typical datasets. CONTACT lzhangli@mdanderson.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Liangcai Zhang
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77230, USA and Department of Biophysics, College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang 150081, China
| | | |
Collapse
|
6
|
Comparative analysis of methods for identifying recurrent copy number alterations in cancer. PLoS One 2012; 7:e52516. [PMID: 23285074 PMCID: PMC3527554 DOI: 10.1371/journal.pone.0052516] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2012] [Accepted: 11/14/2012] [Indexed: 11/19/2022] Open
Abstract
Recurrent copy number alterations (CNAs) play an important role in cancer genesis. While a number of computational methods have been proposed for identifying such CNAs, their relative merits remain largely unknown in practice since very few efforts have been focused on comparative analysis of the methods. To facilitate studies of recurrent CNA identification in cancer genome, it is imperative to conduct a comprehensive comparison of performance and limitations among existing methods. In this paper, six representative methods proposed in the latest six years are compared. These include one-stage and two-stage approaches, working with raw intensity ratio data and discretized data respectively. They are based on various techniques such as kernel regression, correlation matrix diagonal segmentation, semi-parametric permutation and cyclic permutation schemes. We explore multiple criteria including type I error rate, detection power, Receiver Operating Characteristics (ROC) curve and the area under curve (AUC), and computational complexity, to evaluate performance of the methods under multiple simulation scenarios. We also characterize their abilities on applications to two real datasets obtained from cancers with lung adenocarcinoma and glioblastoma. This comparison study reveals general characteristics of the existing methods for identifying recurrent CNAs, and further provides new insights into their strengths and weaknesses. It is believed helpful to accelerate the development of novel and improved methods.
Collapse
|
7
|
Abstract
Since the emergence of high-throughput genome sequencing platforms and more recently the next-generation platforms, the genome databases are growing at an astronomical rate. Tremendous efforts have been invested in recent years in understanding intriguing complexities beneath the vast ocean of genomic data. This is apparent in the spurt of computational methods for interpreting these data in the past few years. Genomic data interpretation is notoriously difficult, partly owing to the inherent heterogeneities appearing at different scales. Methods developed to interpret these data often suffer from their inability to adequately measure the underlying heterogeneities and thus lead to confounding results. Here, we present an information entropy-based approach that unravels the distinctive patterns underlying genomic data efficiently and thus is applicable in addressing a variety of biological problems. We show the robustness and consistency of the proposed methodology in addressing three different biological problems of significance—identification of alien DNAs in bacterial genomes, detection of structural variants in cancer cell lines and alignment-free genome comparison.
Collapse
Affiliation(s)
- Rajeev K Azad
- Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA 15260, USA.
| | | |
Collapse
|
8
|
Hsu FH, Chen HIH, Tsai MH, Lai LC, Huang CC, Tu SH, Chuang EY, Chen Y. A model-based circular binary segmentation algorithm for the analysis of array CGH data. BMC Res Notes 2011; 4:394. [PMID: 21985277 PMCID: PMC3224564 DOI: 10.1186/1756-0500-4-394] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2011] [Accepted: 10/10/2011] [Indexed: 12/22/2022] Open
Abstract
Background Circular Binary Segmentation (CBS) is a permutation-based algorithm for array Comparative Genomic Hybridization (aCGH) data analysis. CBS accurately segments data by detecting change-points using a maximal-t test; but extensive computational burden is involved for evaluating the significance of change-points using permutations. A recent implementation utilizing a hybrid method and early stopping rules (hybrid CBS) to improve the performance in speed was subsequently proposed. However, a time analysis revealed that a major portion of computation time of the hybrid CBS was still spent on permutation. In addition, what the hybrid method provides is an approximation of the significance upper bound or lower bound, not an approximation of the significance of change-points itself. Results We developed a novel model-based algorithm, extreme-value based CBS (eCBS), which limits permutations and provides robust results without loss of accuracy. Thousands of aCGH data under null hypothesis were simulated in advance based on a variety of non-normal assumptions, and the corresponding maximal-t distribution was modeled by the Generalized Extreme Value (GEV) distribution. The modeling results, which associate characteristics of aCGH data to the GEV parameters, constitute lookup tables (eXtreme model). Using the eXtreme model, the significance of change-points could be evaluated in a constant time complexity through a table lookup process. Conclusions A novel algorithm, eCBS, was developed in this study. The current implementation of eCBS consistently outperforms the hybrid CBS 4× to 20× in computation time without loss of accuracy. Source codes, supplementary materials, supplementary figures, and supplementary tables can be found at http://ntumaps.cgm.ntu.edu.tw/eCBSsupplementary.
Collapse
Affiliation(s)
- Fang-Han Hsu
- Graduate Institute of Biomedical Electronics and Bioinformatics, Department of Electrical Engineering, National Taiwan University, Taipei 106, Taiwan.
| | | | | | | | | | | | | | | |
Collapse
|
9
|
Morganella S, Pagnotta SM, Ceccarelli M. Finding recurrent copy number alterations preserving within-sample homogeneity. ACTA ACUST UNITED AC 2011; 27:2949-56. [PMID: 21873327 DOI: 10.1093/bioinformatics/btr488] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
MOTIVATION Copy number alterations (CNAs) represent an important component of genetic variation and play a significant role in many human diseases. Development of array comparative genomic hybridization (aCGH) technology has made it possible to identify CNAs. Identification of recurrent CNAs represents the first fundamental step to provide a list of genomic regions which form the basis for further biological investigations. The main problem in recurrent CNAs discovery is related to the need to distinguish between functional changes and random events without pathological relevance. Within-sample homogeneity represents a common feature of copy number profile in cancer, so it can be used as additional source of information to increase the accuracy of the results. Although several algorithms aimed at the identification of recurrent CNAs have been proposed, no attempt of a comprehensive comparison of different approaches has yet been published. RESULTS We propose a new approach, called Genomic Analysis of Important Alterations (GAIA), to find recurrent CNAs where a statistical hypothesis framework is extended to take into account within-sample homogeneity. Statistical significance and within-sample homogeneity are combined into an iterative procedure to extract the regions that likely are involved in functional changes. Results show that GAIA represents a valid alternative to other proposed approaches. In addition, we perform an accurate comparison by using two real aCGH datasets and a carefully planned simulation study. AVAILABILITY GAIA has been implemented as R/Bioconductor package. It can be downloaded from the following page http://bioinformatics.biogem.it/download/gaia. CONTACT ceccarelli@unisannio.it; morganella@unisannio.it. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sandro Morganella
- Department of Science, University of Sannio, 82100, Benevento, Italy.
| | | | | |
Collapse
|
10
|
Liu Y, Lee YF, Ng MK. SNP and gene networks construction and analysis from classification of copy number variations data. BMC Bioinformatics 2011; 12 Suppl 5:S4. [PMID: 21989070 PMCID: PMC3226254 DOI: 10.1186/1471-2105-12-s5-s4] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND Detection of genomic DNA copy number variations (CNVs) can provide a complete and more comprehensive view of human disease. It is interesting to identify and represent relevant CNVs from a genome-wide data due to high data volume and the complexity of interactions. RESULTS In this paper, we incorporate the DNA copy number variation data derived from SNP arrays into a computational shrunken model and formalize the detection of copy number variations as a case-control classification problem. More than 80% accuracy can be obtained using our classification model and by shrinkage, the number of relevant CNVs to disease can be determined. In order to understand relevant CNVs, we study their corresponding SNPs in the genome and a statistical software PLINK is employed to compute the pair-wise SNP-SNP interactions, and identify SNP networks based on their P-values. Our selected SNP networks are statistically significant compared with random SNP networks and play a role in the biological process. For the unique genes that those SNPs are located in, a gene-gene similarity value is computed using GOSemSim and gene pairs that have similarity values being greater than a threshold are selected to construct gene networks. A gene enrichment analysis show that our gene networks are functionally important.Experimental results demonstrate that our selected SNP and gene networks based on the selected CNVs contain some functional relationships directly or indirectly to disease study. CONCLUSIONS Two datasets are given to demonstrate the effectiveness of the introduced method. Some statistical and biological analysis show that this shrunken classification model is effective in identifying CNVs from genome-wide data and our proposed framework has a potential to become a useful analysis tool for SNP data sets.
Collapse
Affiliation(s)
- Yang Liu
- Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong
| | | | | |
Collapse
|
11
|
Data-driven approach to detect common copy-number variations and frequency profiles in a population-based Korean cohort. Eur J Hum Genet 2011; 19:1167-72. [PMID: 21731056 DOI: 10.1038/ejhg.2011.103] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
To date, hundreds of thousands of copy-number variation (CNV) data have been reported using various platforms. The proportion of Asians in these data is, however, relatively small as compared with that of other ethnic groups, such as Caucasians and Yorubas. Because of limitations in platform resolution and the high noise level in signal intensity, in most CNV studies (particularly those using single nucleotide polymorphism arrays), the average number of CNVs in an individual is less than the number of known CNVs. In this study, we ascertained reliable, common CNV regions (CNVRs) and identified actual frequency rates in the Korean population to provide more CNV information. We performed two-stage analyses for detecting structural variations with two platforms. We discovered 576 common CNVRs (88 CNV segments on average in an individual), and 87% (501 of 576) of these CNVRs overlapped by ≥1 bp with previously validated CNV events. Interestingly, from the frequency analysis of CNV profiles, 52 of 576 CNVRs had a frequency rate of <1% in the 8842 individuals. Compared with other common CNV studies, this study found six common CNVRs that were not reported in previous CNV studies. In conclusion, we propose the data-driven detection approach to discover common CNVRs including those of unreported in the previous Korean CNV study while minimizing false positives. Through our approach, we successfully discovered more common CNVRs than previous Korean CNV study and conducted frequency analysis. These results will be a valuable resource for the effective level of CNVs in the Korean population.
Collapse
|
12
|
Picard F, Lebarbier E, Hoebeke M, Rigaill G, Thiam B, Robin S. Joint segmentation, calling, and normalization of multiple CGH profiles. Biostatistics 2011; 12:413-28. [PMID: 21209153 DOI: 10.1093/biostatistics/kxq076] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
The statistical analysis of array comparative genomic hybridization (CGH) data has now shifted to the joint assessment of copy number variations at the cohort level. Considering multiple profiles gives the opportunity to correct for systematic biases observed on single profiles, such as probe GC content or the so-called "wave effect." In this article, we extend the segmentation model developed in the univariate case to the joint analysis of multiple CGH profiles. Our contribution is multiple: we propose an integrated model to perform joint segmentation, normalization, and calling for multiple array CGH profiles. This model shows great flexibility, especially in the modeling of the wave effect that gives a likelihood framework to approaches proposed by others. We propose a new dynamic programming algorithm for break point positioning, as well as a model selection criterion based on a modified bayesian information criterion proposed in the univariate case. The performance of our method is assessed using simulated and real data sets. Our method is implemented in the R package cghseg.
Collapse
Affiliation(s)
- Franck Picard
- Laboratoire de Biometrie et Biologie Evolutive, UMR CNRS 5558 - Univ. Lyon 1, F-69622, Villeurbanne, France.
| | | | | | | | | | | |
Collapse
|
13
|
Zhang ZD, Gerstein MB. Detection of copy number variation from array intensity and sequencing read depth using a stepwise Bayesian model. BMC Bioinformatics 2010; 11:539. [PMID: 21034510 PMCID: PMC2992546 DOI: 10.1186/1471-2105-11-539] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2010] [Accepted: 10/31/2010] [Indexed: 11/17/2022] Open
Abstract
Background Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations. Array-based comparative genomic hybridization (array-CGH) and newly developed read-depth approach through ultrahigh throughput genomic sequencing both provide rapid, robust, and comprehensive methods to identify CNVs on a whole-genome scale. Results We developed a Bayesian statistical analysis algorithm for the detection of CNVs from both types of genomic data. The algorithm can analyze such data obtained from PCR-based bacterial artificial chromosome arrays, high-density oligonucleotide arrays, and more recently developed high-throughput DNA sequencing. Treating parameters--e.g., the number of CNVs, the position of each CNV, and the data noise level--that define the underlying data generating process as random variables, our approach derives the posterior distribution of the genomic CNV structure given the observed data. Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates. We illustrate the characteristics of our algorithm by applying it to both synthetic and experimental data sets in comparison to other segmentation algorithms. Conclusions In particular, the synthetic data comparison shows that our method is more sensitive than other approaches at low false positive rates. Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.
Collapse
Affiliation(s)
- Zhengdong D Zhang
- Department of Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, USA.
| | | |
Collapse
|