1
|
Alshawaqfeh M, Al Kawam A, Serpedin E, Datta A. Robust Recurrent CNV Detection in the Presence of Inter-Subject Variability. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1056-1067. [PMID: 30387737 DOI: 10.1109/tcbb.2018.2878560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The study of recurrent copy number variations (CNVs) plays an important role in understanding the onset and evolution of complex diseases such as cancer. Array-based comparative genomic hybridization (aCGH) is a widely used microarray based technology for identifying CNVs. However, due to high noise levels and inter-sample variability, detecting recurrent CNVs from aCGH data remains a challenging topic. This paper proposes a novel method for identification of the recurrent CNVs. In the proposed method, the noisy aCGH data is modeled as the superposition of three matrices: a full-rank matrix of weighted piece-wise generating signals accounting for the clean aCGH data, a Gaussian noise matrix to model the inherent experimentation errors and other sources of error, and a sparse matrix to capture the sparse inter-sample (sample-specific) variations. We demonstrated the ability of our method to separate accurately recurrent CNVs from sample-specific variations and noise in both simulated (artificial) data and real data. The proposed method produced more accurate results than current state-of-the-art methods used in recurrent CNV detection and exhibited robustness to noise and sample-specific variations.
Collapse
|
2
|
Mabire C, Duarte J, Darracq A, Pirani A, Rimbert H, Madur D, Combes V, Vitte C, Praud S, Rivière N, Joets J, Pichon JP, Nicolas SD. High throughput genotyping of structural variations in a complex plant genome using an original Affymetrix® axiom® array. BMC Genomics 2019; 20:848. [PMID: 31722668 PMCID: PMC6854671 DOI: 10.1186/s12864-019-6136-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2019] [Accepted: 09/23/2019] [Indexed: 12/19/2022] Open
Abstract
Background Insertions/deletions (InDels) and more specifically presence/absence variations (PAVs) are pervasive in several species and have strong functional and phenotypic effect by removing or drastically modifying genes. Genotyping of such variants on large panels remains poorly addressed, while necessary for approaches such as association mapping or genomic selection. Results We have developed, as a proof of concept, a new high-throughput and affordable approach to genotype InDels. We first identified 141,000 InDels by aligning reads from the B73 line against the genome of three temperate maize inbred lines (F2, PH207, and C103) and reciprocally. Next, we designed an Affymetrix® Axiom® array to target these InDels, with a combination of probes selected at breakpoint sites (13%) or within the InDel sequence, either at polymorphic (25%) or non-polymorphic sites (63%) sites. The final array design is composed of 662,772 probes and targets 105,927 InDels, including PAVs ranging from 35 bp to 129kbp. After Affymetrix® quality control, we successfully genotyped 86,648 polymorphic InDels (82% of all InDels interrogated by the array) on 445 maize DNA samples with 422,369 probes. Genotyping InDels using this approach produced a highly reliable dataset, with low genotyping error (~ 3%), high call rate (~ 98%), and high reproducibility (> 95%). This reliability can be further increased by combining genotyping of several probes calling the same InDels (< 0.1% error rate and > 99.9% of call rate for 5 probes). This “proof of concept” tool was used to estimate the kinship matrix between 362 maize lines with 57,824 polymorphic InDels. This InDels kinship matrix was highly correlated with kinship estimated using SNPs from Illumina 50 K SNP arrays. Conclusions We efficiently genotyped thousands of small to large InDels on a sizeable number of individuals using a new Affymetrix® Axiom® array. This powerful approach opens the way to studying the contribution of InDels to trait variation and heterosis in maize. The approach is easily extendable to other species and should contribute to decipher the biological impact of InDels at a larger scale.
Collapse
Affiliation(s)
- Clément Mabire
- GQE - Le Moulon, INRA, Univ. Paris-Sud, CNRS, AgroParisTech, Université Paris-Saclay, 91190, Gif-sur-Yvette, France
| | - Jorge Duarte
- Biogemma - Centre de Recherche de Chappes, CS 90126, 63720, Chappes, France
| | - Aude Darracq
- GQE - Le Moulon, INRA, Univ. Paris-Sud, CNRS, AgroParisTech, Université Paris-Saclay, 91190, Gif-sur-Yvette, France
| | - Ali Pirani
- Thermo Fisher Scientific, 3450 Central Expressway, Santa Clara, CA, 95051, USA
| | - Hélène Rimbert
- Biogemma - Centre de Recherche de Chappes, CS 90126, 63720, Chappes, France.,Present address: GDEC, INRA, Université Clermont Auvergne, 63000, Clermont-Ferrand, France
| | - Delphine Madur
- GQE - Le Moulon, INRA, Univ. Paris-Sud, CNRS, AgroParisTech, Université Paris-Saclay, 91190, Gif-sur-Yvette, France
| | - Valérie Combes
- GQE - Le Moulon, INRA, Univ. Paris-Sud, CNRS, AgroParisTech, Université Paris-Saclay, 91190, Gif-sur-Yvette, France
| | - Clémentine Vitte
- GQE - Le Moulon, INRA, Univ. Paris-Sud, CNRS, AgroParisTech, Université Paris-Saclay, 91190, Gif-sur-Yvette, France
| | - Sébastien Praud
- Biogemma - Centre de Recherche de Chappes, CS 90126, 63720, Chappes, France
| | - Nathalie Rivière
- Biogemma - Centre de Recherche de Chappes, CS 90126, 63720, Chappes, France
| | - Johann Joets
- GQE - Le Moulon, INRA, Univ. Paris-Sud, CNRS, AgroParisTech, Université Paris-Saclay, 91190, Gif-sur-Yvette, France
| | | | - Stéphane D Nicolas
- GQE - Le Moulon, INRA, Univ. Paris-Sud, CNRS, AgroParisTech, Université Paris-Saclay, 91190, Gif-sur-Yvette, France.
| |
Collapse
|
3
|
Girimurugan SB, Liu Y, Lung PY, Vera DL, Dennis JH, Bass HW, Zhang J. iSeg: an efficient algorithm for segmentation of genomic and epigenomic data. BMC Bioinformatics 2018; 19:131. [PMID: 29642840 PMCID: PMC5896135 DOI: 10.1186/s12859-018-2140-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2017] [Accepted: 03/26/2018] [Indexed: 11/16/2022] Open
Abstract
Background Identification of functional elements of a genome often requires dividing a sequence of measurements along a genome into segments where adjacent segments have different properties, such as different mean values. Despite dozens of algorithms developed to address this problem in genomics research, methods with improved accuracy and speed are still needed to effectively tackle both existing and emerging genomic and epigenomic segmentation problems. Results We designed an efficient algorithm, called iSeg, for segmentation of genomic and epigenomic profiles. iSeg first utilizes dynamic programming to identify candidate segments and test for significance. It then uses a novel data structure based on two coupled balanced binary trees to detect overlapping significant segments and update them simultaneously during searching and refinement stages. Refinement and merging of significant segments are performed at the end to generate the final set of segments. By using an objective function based on the p-values of the segments, the algorithm can serve as a general computational framework to be combined with different assumptions on the distributions of the data. As a general segmentation method, it can segment different types of genomic and epigenomic data, such as DNA copy number variation, nucleosome occupancy, nuclease sensitivity, and differential nuclease sensitivity data. Using simple t-tests to compute p-values across multiple datasets of different types, we evaluate iSeg using both simulated and experimental datasets and show that it performs satisfactorily when compared with some other popular methods, which often employ more sophisticated statistical models. Implemented in C++, iSeg is also very computationally efficient, well suited for large numbers of input profiles and data with very long sequences. Conclusions We have developed an efficient general-purpose segmentation tool and showed that it had comparable or more accurate results than many of the most popular segment-calling algorithms used in contemporary genomic data analysis. iSeg is capable of analyzing datasets that have both positive and negative values. Tunable parameters allow users to readily adjust the statistical stringency to best match the biological nature of individual datasets, including widely or sparsely mapped genomic datasets or those with non-normal distributions. Electronic supplementary material The online version of this article (10.1186/s12859-018-2140-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Yuhang Liu
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Pei-Yau Lung
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Daniel L Vera
- Center for Genomics and Personalized Medicine, Florida State University, Tallahassee, FL, USA
| | - Jonathan H Dennis
- Department of Biological Science, Florida State University, Tallahassee, FL, USA
| | - Hank W Bass
- Department of Biological Science, Florida State University, Tallahassee, FL, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, USA.
| |
Collapse
|
4
|
Algama M, Keith JM. Investigating genomic structure using changept: A Bayesian segmentation model. Comput Struct Biotechnol J 2014; 10:107-15. [PMID: 25349679 PMCID: PMC4204429 DOI: 10.1016/j.csbj.2014.08.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023] Open
Abstract
Genomes are composed of a wide variety of elements with distinct roles and characteristics. Some of these elements are well-characterised functional components such as protein-coding exons. Other elements play regulatory or structural roles, encode functional non-protein-coding RNAs, or perform some other function yet to be characterised. Still others may have no functional importance, though they may nevertheless be of interest to biologists. One technique for investigating the composition of genomes is to segment sequences into compositionally homogenous blocks. This technique, known as 'sequence segmentation' or 'change-point analysis', is used to identify patterns of variation across genomes such as GC-rich and GC-poor regions, coding and non-coding regions, slowly evolving and rapidly evolving regions and many other types of variation. In this mini-review we outline many of the genome segmentation methods currently available and then focus on a Bayesian DNA segmentation algorithm, with examples of its various applications.
Collapse
Affiliation(s)
- Manjula Algama
- School of Mathematical Sciences, Monash University, Clayton, VIC 3800, Australia
| | - Jonathan M Keith
- School of Mathematical Sciences, Monash University, Clayton, VIC 3800, Australia
| |
Collapse
|
5
|
Zhou X, Liu J, Wan X, Yu W. Piecewise-constant and low-rank approximation for identification of recurrent copy number variations. Bioinformatics 2014; 30:1943-9. [PMID: 24642062 DOI: 10.1093/bioinformatics/btu131] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION The post-genome era sees urgent need for more novel approaches to extracting useful information from the huge amount of genetic data. The identification of recurrent copy number variations (CNVs) from array-based comparative genomic hybridization (aCGH) data can help understand complex diseases, such as cancer. Most of the previous computational methods focused on single-sample analysis or statistical testing based on the results of single-sample analysis. Finding recurrent CNVs from multi-sample data remains a challenging topic worth further study. RESULTS We present a general and robust method to identify recurrent CNVs from multi-sample aCGH profiles. We express the raw dataset as a matrix and demonstrate that recurrent CNVs will form a low-rank matrix. Hence, we formulate the problem as a matrix recovering problem, where we aim to find a piecewise-constant and low-rank approximation (PLA) to the input matrix. We propose a convex formulation for matrix recovery and an efficient algorithm to globally solve the problem. We demonstrate the advantages of PLA compared with alternative methods using synthesized datasets and two breast cancer datasets. The experimental results show that PLA can successfully reconstruct the recurrent CNV patterns from raw data and achieve better performance compared with alternative methods under a wide range of scenarios. AVAILABILITY AND IMPLEMENTATION The MATLAB code is available at http://bioinformatics.ust.hk/pla.zip.
Collapse
Affiliation(s)
- Xiaowei Zhou
- Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon and Department of Computer Science and Institute of Theoretical and Computational Study, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| | - Jiming Liu
- Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon and Department of Computer Science and Institute of Theoretical and Computational Study, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| | - Xiang Wan
- Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon and Department of Computer Science and Institute of Theoretical and Computational Study, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| | - Weichuan Yu
- Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon and Department of Computer Science and Institute of Theoretical and Computational Study, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
| |
Collapse
|
6
|
Plummer PJ, Chen J. A Bayesian approach for locating change points in a compound Poisson process with application to detecting DNA copy number variations. J Appl Stat 2013. [DOI: 10.1080/02664763.2013.840272] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
7
|
Zhou X, Yang C, Wan X, Zhao H, Yu W. Multisample aCGH data analysis via total variation and spectral regularization. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:230-235. [PMID: 23702561 PMCID: PMC3715577 DOI: 10.1109/tcbb.2012.166] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
DNA copy number variation (CNV) accounts for a large proportion of genetic variation. One commonly used approach to detecting CNVs is array-based comparative genomic hybridization (aCGH). Although many methods have been proposed to analyze aCGH data, it is not clear how to combine information from multiple samples to improve CNV detection. In this paper, we propose to use a matrix to approximate the multisample aCGH data and minimize the total variation of each sample as well as the nuclear norm of the whole matrix. In this way, we can make use of the smoothness property of each sample and the correlation among multiple samples simultaneously in a convex optimization framework. We also developed an efficient and scalable algorithm to handle large-scale data. Experiments demonstrate that the proposed method outperforms the state-of-the-art techniques under a wide range of scenarios and it is capable of processing large data sets with millions of probes.
Collapse
Affiliation(s)
- Xiaowei Zhou
- Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong, China.
| | | | | | | | | |
Collapse
|
8
|
Rueda OM, Rueda C, Diaz-Uriarte R. A Bayesian HMM with random effects and an unknown number of states for DNA copy number analysis. J STAT COMPUT SIM 2013. [DOI: 10.1080/00949655.2011.609818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
9
|
Seifert M, Gohr A, Strickert M, Grosse I. Parsimonious higher-order hidden Markov models for improved array-CGH analysis with applications to Arabidopsis thaliana. PLoS Comput Biol 2012; 8:e1002286. [PMID: 22253580 PMCID: PMC3257270 DOI: 10.1371/journal.pcbi.1002286] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2011] [Accepted: 10/11/2011] [Indexed: 12/19/2022] Open
Abstract
Array-based comparative genomic hybridization (Array-CGH) is an important technology in molecular biology for the detection of DNA copy number polymorphisms between closely related genomes. Hidden Markov Models (HMMs) are popular tools for the analysis of Array-CGH data, but current methods are only based on first-order HMMs having constrained abilities to model spatial dependencies between measurements of closely adjacent chromosomal regions. Here, we develop parsimonious higher-order HMMs enabling the interpolation between a mixture model ignoring spatial dependencies and a higher-order HMM exhaustively modeling spatial dependencies. We apply parsimonious higher-order HMMs to the analysis of Array-CGH data of the accessions C24 and Col-0 of the model plant Arabidopsis thaliana. We compare these models against first-order HMMs and other existing methods using a reference of known deletions and sequence deviations. We find that parsimonious higher-order HMMs clearly improve the identification of these polymorphisms. Moreover, we perform a functional analysis of identified polymorphisms revealing novel details of genomic differences between C24 and Col-0. Additional model evaluations are done on widely considered Array-CGH data of human cell lines indicating that parsimonious HMMs are also well-suited for the analysis of non-plant specific data. All these results indicate that parsimonious higher-order HMMs are useful for Array-CGH analyses. An implementation of parsimonious higher-order HMMs is available as part of the open source Java library Jstacs (www.jstacs.de/index.php/PHHMM). Array-based comparative genomics is a standard approach for the identification of DNA copy number polymorphisms between closely related genomes. The huge amounts of data produced by these experiments require efficient and accurate bioinformatics tools for the identification of copy number polymorphisms. Hidden Markov Models (HMMs) are frequently used for analyzing such data sets, but current models are based on first-order HMMs only having limited capabilities to model spatial dependencies between measurements of closely adjacent chromosomal regions. We develop parsimonious higher-order HMMs enabling the interpolation between a mixture model ignoring spatial dependencies and a higher-order HMM exhaustively modeling these dependencies to overcome this limitation. In an in-depth case study with Arabidopsis thaliana, we find that parsimonious higher-order HMMs clearly improve the identification of copy number polymorphisms in comparison to standard first-order HMMs and other frequently used methods. Functional analysis of identified polymorphisms revealed details of genomic differences between the accessions C24 and Col-0 of Arabidopsis thaliana. An additional study on human cell lines further indicates that parsimonious HMMs are well-suited for the analysis of Array-CGH data.
Collapse
Affiliation(s)
- Michael Seifert
- Department of Molecular Genetics, Leibniz Institute of Plant Genetics and Crop Plant Research (IPK), Gatersleben, Germany.
| | | | | | | |
Collapse
|
10
|
Nowak G, Hastie T, Pollack JR, Tibshirani R. A fused lasso latent feature model for analyzing multi-sample aCGH data. Biostatistics 2011; 12:776-91. [PMID: 21642389 DOI: 10.1093/biostatistics/kxr012] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms single-sample methods when the simulated samples share common information. We also propose a method for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and Significance Testing of Aberrant Copy number (an alternative, existing approach) identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying 3 distinct groups of samples based on their patterns of CNV for chromosome 17.
Collapse
Affiliation(s)
- Gen Nowak
- Department of Biostatistics, Harvard University, Boston, MA 02115, USA.
| | | | | | | |
Collapse
|
11
|
Stjernqvist S, Rydén T, Greenman CD. Model-integrated estimation of normal tissue contamination for cancer SNP allelic copy number data. Cancer Inform 2011; 10:159-73. [PMID: 21695067 PMCID: PMC3118450 DOI: 10.4137/cin.s6873] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
SNP allelic copy number data provides intensity measurements for the two different alleles separately. We present a method that estimates the number of copies of each allele at each SNP position, using a continuous-index hidden Markov model. The method is especially suited for cancer data, since it includes the fraction of normal tissue contamination, often present when studying data from cancer tumors, into the model. The continuous-index structure takes into account the distances between the SNPs, and is thereby appropriate also when SNPs are unequally spaced. In a simulation study we show that the method performs favorably compared to previous methods even with as much as 70% normal contamination. We also provide results from applications to clinical data produced using the Affymetrix genome-wide SNP 6.0 platform.
Collapse
Affiliation(s)
- Susann Stjernqvist
- Centre for Mathematical Sciences, Lund University, Box 118, 221 00 Lund, Sweden, Department of Mathematics, Royal Institute of Technology, 100 44 Stockholm, Sweden
| | | | | |
Collapse
|
12
|
Picard F, Lebarbier E, Hoebeke M, Rigaill G, Thiam B, Robin S. Joint segmentation, calling, and normalization of multiple CGH profiles. Biostatistics 2011; 12:413-28. [PMID: 21209153 DOI: 10.1093/biostatistics/kxq076] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
The statistical analysis of array comparative genomic hybridization (CGH) data has now shifted to the joint assessment of copy number variations at the cohort level. Considering multiple profiles gives the opportunity to correct for systematic biases observed on single profiles, such as probe GC content or the so-called "wave effect." In this article, we extend the segmentation model developed in the univariate case to the joint analysis of multiple CGH profiles. Our contribution is multiple: we propose an integrated model to perform joint segmentation, normalization, and calling for multiple array CGH profiles. This model shows great flexibility, especially in the modeling of the wave effect that gives a likelihood framework to approaches proposed by others. We propose a new dynamic programming algorithm for break point positioning, as well as a model selection criterion based on a modified bayesian information criterion proposed in the univariate case. The performance of our method is assessed using simulated and real data sets. Our method is implemented in the R package cghseg.
Collapse
Affiliation(s)
- Franck Picard
- Laboratoire de Biometrie et Biologie Evolutive, UMR CNRS 5558 - Univ. Lyon 1, F-69622, Villeurbanne, France.
| | | | | | | | | | | |
Collapse
|
13
|
Yau C, Papaspiliopoulos O, Roberts GO, Holmes C. Bayesian Nonparametric Hidden Markov Models with application to the analysis of copy-number-variation in mammalian genomes. J R Stat Soc Series B Stat Methodol 2011; 73:37-57. [PMID: 21687778 PMCID: PMC3116623 DOI: 10.1111/j.1467-9868.2010.00756.x] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
We consider the development of Bayesian Nonparametric methods for product partition models such as Hidden Markov Models and change point models. Our approach uses a Mixture of Dirichlet Process (MDP) model for the unknown sampling distribution (likelihood) for the observations arising in each state and a computationally efficient data augmentation scheme to aid inference. The method uses novel MCMC methodology which combines recent retrospective sampling methods with the use of slice sampler variables. The methodology is computationally efficient, both in terms of MCMC mixing properties, and robustness to the length of the time series being investigated. Moreover, the method is easy to implement requiring little or no user-interaction. We apply our methodology to the analysis of genomic copy number variation.
Collapse
Affiliation(s)
- C. Yau
- Department of Statistics and the Oxford-Man Institute for Quantitative Finance, University of Oxford, ,
| | | | | | - C. Holmes
- Department of Statistics and the Oxford-Man Institute for Quantitative Finance, University of Oxford, ,
| |
Collapse
|
14
|
Mitra R, Gupta M. A continuous-index Bayesian hidden Markov model for prediction of nucleosome positioning in genomic DNA. Biostatistics 2010; 12:462-77. [PMID: 21193724 DOI: 10.1093/biostatistics/kxq077] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Nucleosomes are units of chromatin structure, consisting of DNA sequence wrapped around proteins called "histones." Nucleosomes occur at variable intervals throughout genomic DNA and prevent transcription factor (TF) binding by blocking TF access to the DNA. A map of nucleosomal locations would enable researchers to detect TF binding sites with greater efficiency. Our objective is to construct an accurate genomic map of nucleosome-free regions (NFRs) based on data from high-throughput genomic tiling arrays in yeast. These high-volume data typically have a complex structure in the form of dependence on neighboring probes as well as underlying DNA sequence, variable-sized gaps, and missing data. We propose a novel continuous-index model appropriate for non-equispaced tiling array data that simultaneously incorporates DNA sequence features relevant to nucleosome formation. Simulation studies and an application to a yeast nucleosomal assay demonstrate the advantages of using the new modeling framework, as well as its robustness to distributional misspecifications. Our results reinforce the previous biological hypothesis that higher-order nucleotide combinations are important in distinguishing nucleosomal regions from NFRs.
Collapse
Affiliation(s)
- Ritendranath Mitra
- Department of Biostatistics, The University of Texas M. D. Anderson Cancer Center, Houston, TX 77230, USA
| | | |
Collapse
|
15
|
Coe BP, Chari R, MacAulay C, Lam WL. FACADE: a fast and sensitive algorithm for the segmentation and calling of high resolution array CGH data. Nucleic Acids Res 2010; 38:e157. [PMID: 20551132 PMCID: PMC2926625 DOI: 10.1093/nar/gkq548] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
The availability of high resolution array comparative genomic hybridization (CGH) platforms has led to increasing complexities in data analysis. Specifically, defining contiguous regions of alterations or segmentation can be computationally intensive and popular algorithms can take hours to days for the processing of arrays comprised of hundreds of thousands to millions of elements. Additionally, tumors tend to demonstrate subtle copy number alterations due to heterogeneity, ploidy and hybridization effects. Thus, there is a need for fast, sensitive array CGH segmentation and alteration calling algorithms. Here, we describe Fast Algorithm for Calling After Detection of Edges (FACADE), a highly sensitive and easy to use algorithm designed to rapidly segment and call high resolution array data.
Collapse
Affiliation(s)
- Bradley P Coe
- Department of Integrative Oncology, British Columbia Cancer Research Centre, Vancouver, BC, Canada.
| | | | | | | |
Collapse
|
16
|
Choi H, Qin ZS, Ghosh D. A double-layered mixture model for the joint analysis of DNA copy number and gene expression data. J Comput Biol 2010; 17:121-37. [PMID: 20170400 DOI: 10.1089/cmb.2009.0019] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Copy number aberration is a common form of genomic instability in cancer. Gene expression is closely tied to cytogenetic events by the central dogma of molecular biology, and serves as a mediator of copy number changes in disease phenotypes. Accordingly, it is of interest to develop proper statistical methods for jointly analyzing copy number and gene expression data. This work describes a novel Bayesian inferential approach for a double-layered mixture model (DLMM) which directly models the stochastic nature of copy number data and identifies abnormally expressed genes due to aberrant copy number. Simulation studies were conducted to illustrate the robustness of DLMM under various settings of copy number aberration frequency, confounding effects, and signal-to-noise ratio in gene expression data. Analysis of a real breast cancer data shows that DLMM is able to identify expression changes specifically attributable to copy number aberration in tumors and that a sample-specific index built based on the selected genes is correlated with relevant clinical information.
Collapse
Affiliation(s)
- Hyungwon Choi
- Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA
| | | | | |
Collapse
|
17
|
van de Wiel MA, Picard F, van Wieringen WN, Ylstra B. Preprocessing and downstream analysis of microarray DNA copy number profiles. Brief Bioinform 2010; 12:10-21. [PMID: 20172948 DOI: 10.1093/bib/bbq004] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Analysis of DNA copy number profiles requires methods tailored to the specific nature of these data. The number of available data analysis methods has grown enormously in the last 5 years. We discuss the typical characteristics of DNA copy number data, as measured by microarray technology and review the extensive literature on preprocessing methods such as segmentation and calling. Subsequently, the focus narrows to applications of DNA copy number in cancer, in particular, several downstream analyses of multi-sample data sets such as testing, clustering and classification. Finally, we look ahead: what should we prepare for and which methodology-related topics may deserve attention in the near future?
Collapse
Affiliation(s)
- Mark A van de Wiel
- Department of Epidemiology & Biostatistics, VU University Medical Center, Amsterdam, The Netherlands.
| | | | | | | |
Collapse
|
18
|
Kim KY, Lee GY, Kim J, Jeung HC, Chung HC, Rha SY. Identification of significant regional genetic variations using continuous CNV values in aCGH data. Genomics 2009; 94:317-23. [DOI: 10.1016/j.ygeno.2009.08.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2009] [Revised: 07/20/2009] [Accepted: 08/11/2009] [Indexed: 11/26/2022]
|
19
|
Greenman CD, Bignell G, Butler A, Edkins S, Hinton J, Beare D, Swamy S, Santarius T, Chen L, Widaa S, Futreal PA, Stratton MR. PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatistics 2009; 11:164-75. [PMID: 19837654 PMCID: PMC2800165 DOI: 10.1093/biostatistics/kxp045] [Citation(s) in RCA: 157] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
High-throughput oligonucleotide microarrays are commonly employed to investigate genetic disease, including cancer. The algorithms employed to extract genotypes and copy number variation function optimally for diploid genomes usually associated with inherited disease. However, cancer genomes are aneuploid in nature leading to systematic errors when using these techniques. We introduce a preprocessing transformation and hidden Markov model algorithm bespoke to cancer. This produces genotype classification, specification of regions of loss of heterozygosity, and absolute allelic copy number segmentation. Accurate prediction is demonstrated with a combination of independent experimental techniques. These methods are exemplified with affymetrix genome-wide SNP6.0 data from 755 cancer cell lines, enabling inference upon a number of features of biological interest. These data and the coded algorithm are freely available for download.
Collapse
Affiliation(s)
- Chris D Greenman
- Cancer Genome Project, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
20
|
Stjernqvist S, Rydén T. A continuous-index hidden Markov jump process for modeling DNA copy number data. Biostatistics 2009; 10:773-8. [PMID: 19628640 DOI: 10.1093/biostatistics/kxp030] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The number of copies of DNA in human cells can be measured using array comparative genomic hybridization (aCGH), which provides intensity ratios of sample to reference DNA at genomic locations corresponding to probes on a microarray. In the present paper, we devise a statistical model, based on a latent continuous-index Markov jump process, that is aimed to capture certain features of aCGH data, including probes that are unevenly long, unevenly spaced, and overlapping. The model has a continuous state space, with 1 state representing a normal copy number of 2, and the rest of the states being either amplifications or deletions. We adopt a Bayesian approach and apply Markov chain Monte Carlo (MCMC) methods for estimating the parameters and the Markov process. The model can be applied to data from both tiling bacterial artificial chromosome arrays and oligonucleotide arrays. We also compare a model with normal distributed noise to a model with t-distributed noise, showing that the latter is more robust to outliers.
Collapse
Affiliation(s)
- Susann Stjernqvist
- Centre for Mathematical Sciences, Lund University, Box 118, 22100 Lund, Sweden.
| | | |
Collapse
|
21
|
Chen Z, Vijayan S, Barbieri R, Wilson MA, Brown EN. Discrete- and continuous-time probabilistic models and algorithms for inferring neuronal UP and DOWN states. Neural Comput 2009; 21:1797-862. [PMID: 19323637 PMCID: PMC2799196 DOI: 10.1162/neco.2009.06-08-799] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
UP and DOWN states, the periodic fluctuations between increased and decreased spiking activity of a neuronal population, are a fundamental feature of cortical circuits. Understanding UP-DOWN state dynamics is important for understanding how these circuits represent and transmit information in the brain. To date, limited work has been done on characterizing the stochastic properties of UP-DOWN state dynamics. We present a set of Markov and semi-Markov discrete- and continuous-time probability models for estimating UP and DOWN states from multiunit neural spiking activity. We model multiunit neural spiking activity as a stochastic point process, modulated by the hidden (UP and DOWN) states and the ensemble spiking history. We estimate jointly the hidden states and the model parameters by maximum likelihood using an expectation-maximization (EM) algorithm and a Monte Carlo EM algorithm that uses reversible-jump Markov chain Monte Carlo sampling in the E-step. We apply our models and algorithms in the analysis of both simulated multiunit spiking activity and actual multi- unit spiking activity recorded from primary somatosensory cortex in a behaving rat during slow-wave sleep. Our approach provides a statistical characterization of UP-DOWN state dynamics that can serve as a basis for verifying and refining mechanistic descriptions of this process.
Collapse
Affiliation(s)
- Zhe Chen
- Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, U.S.A., and Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A
| | - Sujith Vijayan
- Program in Neuroscience, Harvard University, Cambridge, MA 02139, U.S.A., and Picower Institute for Learning and Memory, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A
| | - Riccardo Barbieri
- Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, U.S.A
| | - Matthew A. Wilson
- Picower Institute for Learning and Memory, RIKEN-MIT Neuroscience Research Center, Department of Brain and Cognitive Sciences and Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A
| | - Emery N. Brown
- Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02114, U.S.A., Harvard-MIT Division of Health Sciences and Technology, Cambridge, MA 02139, U.S.A., and Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A
| |
Collapse
|
22
|
Nicolas P, Leduc A, Robin S, Rasmussen S, Jarmer H, Bessières P. Transcriptional landscape estimation from tiling array data using a model of signal shift and drift. ACTA ACUST UNITED AC 2009; 25:2341-7. [PMID: 19561016 PMCID: PMC2735659 DOI: 10.1093/bioinformatics/btp395] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Motivation: High-density oligonucleotide tiling array technology holds the promise of a better description of the complexity and the dynamics of transcriptional landscapes. In organisms such as bacteria and yeasts, transcription can be measured on a genome-wide scale with a resolution >25 bp. The statistical models currently used to handle these data remain however very simple, the most popular being the piecewise constant Gaussian model with a fixed number of breakpoints. Results: This article describes a new methodology based on a hidden Markov model that embeds the segmentation of a continuous-valued signal in a probabilistic setting. For a computationally affordable cost, this framework (i) alleviates the difficulty of choosing a fixed number of breakpoints, and (ii) permits retrieving more information than a unique segmentation by giving access to the whole probability distribution of the transcription profile. Importantly, the model is also enriched and accounts for subtle effects such as signal ‘drift’ and covariates. Relevance of this framework is demonstrated on a Bacillus subtilis dataset. Availability: A software is distributed under the GPL. Contact:pierre.nicolas@jouy.inra.fr Supplementary information:Supplementary data is available at Bioinformatics online.
Collapse
Affiliation(s)
- Pierre Nicolas
- INRA, Mathématique Informatique et Génome UR1077, 78350 Jouy-en-Josas, France.
| | | | | | | | | | | |
Collapse
|
23
|
Cancer gene discovery in mouse and man. Biochim Biophys Acta Rev Cancer 2009; 1796:140-61. [PMID: 19285540 PMCID: PMC2756404 DOI: 10.1016/j.bbcan.2009.03.001] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2009] [Revised: 03/03/2009] [Accepted: 03/05/2009] [Indexed: 12/31/2022]
Abstract
The elucidation of the human and mouse genome sequence and developments in high-throughput genome analysis, and in computational tools, have made it possible to profile entire cancer genomes. In parallel with these advances mouse models of cancer have evolved into a powerful tool for cancer gene discovery. Here we discuss the approaches that may be used for cancer gene identification in both human and mouse and discuss how a cross-species 'oncogenomics' approach to cancer gene discovery represents a powerful strategy for finding genes that drive tumourigenesis.
Collapse
|
24
|
Han ST, Kang HC, Choi HS, Jang MS. A Study on Development of Scoring Campaign System. KOREAN JOURNAL OF APPLIED STATISTICS 2009. [DOI: 10.5351/kjas.2009.22.1.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
25
|
Kim BS, Kim SC. A Penalized Spline Based Method for Detecting the DNA Copy Number Alteration in an Array-CGH Experiment. KOREAN JOURNAL OF APPLIED STATISTICS 2009. [DOI: 10.5351/kjas.2009.22.1.115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
26
|
Budinska E, Gelnarova E, Schimek MG. MSMAD: a computationally efficient method for the analysis of noisy array CGH data. ACTA ACUST UNITED AC 2009; 25:703-13. [PMID: 19147666 DOI: 10.1093/bioinformatics/btp022] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Genome analysis has become one of the most important tools for understanding the complex process of cancerogenesis. With increasing resolution of CGH arrays, the demand for computationally efficient algorithms arises, which are effective in the detection of aberrations even in very noisy data. RESULTS We developed a rather simple, non-parametric technique of high computational efficiency for CGH array analysis that adopts a median absolute deviation concept for breakpoint detection, comprising median smoothing for pre-processing. The resulting algorithm has the potential to outperform any single smoothing approach as well as several recently proposed segmentation techniques. We show its performance through the application of simulated and real datasets in comparison to three other methods for array CGH analysis. IMPLEMENTATION Our approach is implemented in the R-language and environment for statistical computing (version 2.6.1 for Windows, R-project, 2007). The code is available at: http://www.iba.muni.cz/~budinska/msmad.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Eva Budinska
- Institute of Biostatistics and Analyses, Masaryk University, Kamenice 126/3, 625 00 Brno, Czech Republic.
| | | | | |
Collapse
|
27
|
Wang LY, Abyzov A, Korbel JO, Snyder M, Gerstein M. MSB: a mean-shift-based approach for the analysis of structural variation in the genome. Genome Res 2008; 19:106-17. [PMID: 19037015 DOI: 10.1101/gr.080069.108] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Genome structural variation includes segmental duplications, deletions, and other rearrangements, and array-based comparative genomic hybridization (array-CGH) is a popular technology for determining this. Drawing relevant conclusions from array-CGH requires computational methods for partitioning the chromosome into segments of elevated, reduced, or unchanged copy number. Several approaches have been described, most of which attempt to explicitly model the underlying distribution of data based on particular assumptions. Often, they optimize likelihood functions for estimating model parameters, by expectation maximization or related approaches; however, this requires good parameter initialization through prespecifying the number of segments. Moreover, convergence is difficult to achieve, since many parameters are required to characterize an experiment. To overcome these limitations, we propose a nonparametric method without a global criterion to be optimized. Our method involves mean-shift-based (MSB) procedures; it considers the observed array-CGH signal as sampling from a probability-density function, uses a kernel-based approach to estimate local gradients for this function, and iteratively follows them to determine local modes of the signal. Overall, our method achieves robust discontinuity-preserving smoothing, thus accurately segmenting chromosomes into regions of duplication and deletion. It does not require the number of segments as input, nor does its convergence depend on this. We successfully applied our method to both simulated data and array-CGH experiments on glioblastoma and adenocarcinoma. We show that it performs at least as well as, and often better than, 10 previously published algorithms. Finally, we show that our approach can be extended to segmenting the signal resulting from the depth-of-coverage of mapped reads from next-generation sequencing.
Collapse
Affiliation(s)
- Lu-Yong Wang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
| | | | | | | | | |
Collapse
|
28
|
Andersson R, Bruder CEG, Piotrowski A, Menzel U, Nord H, Sandgren J, Hvidsten TR, Diaz de Ståhl T, Dumanski JP, Komorowski J. A segmental maximum a posteriori approach to genome-wide copy number profiling. ACTA ACUST UNITED AC 2008; 24:751-8. [PMID: 18204059 DOI: 10.1093/bioinformatics/btn003] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
MOTIVATION Copy number profiling methods aim at assigning DNA copy numbers to chromosomal regions using measurements from microarray-based comparative genomic hybridizations. Among the proposed methods to this end, Hidden Markov Model (HMM)-based approaches seem promising since DNA copy number transitions are naturally captured in the model. Current discrete-index HMM-based approaches do not, however, take into account heterogeneous information regarding the genomic overlap between clones. Moreover, the majority of existing methods are restricted to chromosome-wise analysis. RESULTS We introduce a novel Segmental Maximum A Posteriori approach, SMAP, for DNA copy number profiling. Our method is based on discrete-index Hidden Markov Modeling and incorporates genomic distance and overlap between clones. We exploit a priori information through user-controllable parameterization that enables the identification of copy number deviations of various lengths and amplitudes. The model parameters may be inferred at a genome-wide scale to avoid overfitting of model parameters often resulting from chromosome-wise model inference. We report superior performances of SMAP on synthetic data when compared with two recent methods. When applied on our new experimental data, SMAP readily recognizes already known genetic aberrations including both large-scale regions with aberrant DNA copy number and changes affecting only single features on the array. We highlight the differences between the prediction of SMAP and the compared methods and show that SMAP accurately determines copy number changes and benefits from overlap consideration.
Collapse
Affiliation(s)
- Robin Andersson
- The Linnaeus Centre for Bioinformatics, Uppsala University, 751 24 Uppsala, Sweden
| | | | | | | | | | | | | | | | | | | |
Collapse
|