1
|
Liu X, Duan J, Gong D. MSigSeg: An R package for multiple signals segmentation. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2025; 265:108744. [PMID: 40199111 DOI: 10.1016/j.cmpb.2025.108744] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/18/2024] [Revised: 03/07/2025] [Accepted: 03/26/2025] [Indexed: 04/10/2025]
Abstract
BACKGROUND AND OBJECTIVE Identifying breakpoints in signals is crucial for uncovering important features in scientific data. In the biomedical field, the heterogeneity of signals leads to increased complexity in identifying breakpoints. While existing methods and software packages most focus on detecting breakpoints in individual signals, a significant challenge in this field is to detect common breakpoints of multiple signals. To address this challenge, a fast and optimal method has been developed and implemented in the R package MSigSeg as a practical tool. METHODS The proposed method utilizes an optimization approach with ℓ-0 norm penalty to efficiently and accurately detect the locations of common breakpoints in multiple signals. This article provides a detailed description of the mathematical problem, the fast optimization algorithm which is implemented in the package, and the usage of core functions along with example datasets. RESULTS To evaluate the performance of the proposed method, a simulation study is conducted, comparing it with other segmentation approaches. Real-world problems such as are also processed to demonstrate the practical value of the package. Substantial efficiency gain can be observed by our results. CONCLUSIONS Our R package MSigSeg implements an efficient and sensitive method for detecting common breakpoints across multiple signals, serving as a valuable resource for the analysis of intricate biomedical signals. The proposed package is available on the Comprehensive R Archive Network (CRAN) repository https://CRAN.R-project.org/package=MSigSeg.
Collapse
Affiliation(s)
- Xuanyu Liu
- Department of Oncology, The Second Affiliated Hospital of Xi'an Jiaotong University, Xi'an, China.
| | - Junbo Duan
- Key Laboratory of Biomedical Information Engineering of Ministry of Education, Department of Biomedical Engineering, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, China.
| | - Dian Gong
- Munich Institute of Biomedical Engineering, Technical University of Munich, Munich, Germany.
| |
Collapse
|
2
|
Li H, Li S, Zhao Z, Kong L, Fu X, Zhu J, Feng J, Tang W, Wu D, Kong X. Noninvasive prenatal diagnosis (NIPD) of non-syndromic hearing loss (NSHL) for singleton and twin pregnancies in the first trimester. Orphanet J Rare Dis 2025; 20:40. [PMID: 39871362 PMCID: PMC11773923 DOI: 10.1186/s13023-025-03558-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Accepted: 01/17/2025] [Indexed: 01/29/2025] Open
Abstract
BACKGROUND Noninvasive prenatal diagnosis (NIPD) has been proven feasible for non-syndromic hearing loss (NSHL) in singleton pregnancies. However, previous research is limited to the second trimester and the application in twin pregnancies is blank. Here we provide a novel algorithmic approach to assess singleton and twin pregnancies in the first trimester. METHODS A 324.614 kb capture panel was designed to selectively enrich target regions. Parental haplotypes were constructed by target sequencing of blood samples from the parents and the proband. Then single nucleotide polymorphisms (SNP) within target regions were classified into four and six categories in singleton and twin pregnancy, respectively. Combining relative haplotype dosage change (RHDO) and the Bayes factor (BF), fetal fraction (FF) and fetal genotype were deduced in singleton and twin pregnancies. The pregnant women's NIPD results were validated by invasive prenatal diagnosis and Sanger sequencing. RESULTS Sixteen women with singleton pregnancies and one woman with a twin pregnancy were recruited. Among the 16 singleton pregnancies, NIPD was successfully applied in 15 families and the coincidence rate with invasive prenatal diagnosis was 100% (15/15). Only one family NIPD result is "no call" because the imbalance distribution of SNP sites makes it difficult to estimate recombination events. Most (13/15) of pregnant women were diagnosed in the first trimester and the earliest gestation week was the 7th week. The twin pregnancy was a dichorionic diamniotic twin (DCDA). NIPD confirmed one fetus is affected, and another is a carrier with c.299_300delAT of GJB2 gene. CONCLUSION This study represents the pioneering evidence in the field, demonstrating the feasibility of NIPD for NSHL in twin pregnancies. Moreover, it provides a novel and advanced diagnostic approach for families at high risk of NSHL during pregnancy, offering earlier detection, enhanced safety, and improved accuracy.
Collapse
Affiliation(s)
- Huanyun Li
- Genetic and Prenatal Diagnosis Center, Department of Obstetrics and Gynecology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Shaojun Li
- Celula (China) Medical Technology Co., Ltd., Chengdu, China
| | - Zhenhua Zhao
- Genetic and Prenatal Diagnosis Center, Department of Obstetrics and Gynecology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Lingrong Kong
- Genetic and Prenatal Diagnosis Center, Department of Obstetrics and Gynecology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
- Department of Fetal Medicine and Prenatal Diagnosis Center, Shanghai First Maternity and Infant Hospital, School of Medicine, Tongji University, Shanghai, China
| | - Xinyu Fu
- Genetic and Prenatal Diagnosis Center, Department of Obstetrics and Gynecology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Jingqi Zhu
- Genetic and Prenatal Diagnosis Center, Department of Obstetrics and Gynecology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Jun Feng
- Celula (China) Medical Technology Co., Ltd., Chengdu, China
| | - Weiqin Tang
- Celula (China) Medical Technology Co., Ltd., Chengdu, China
| | - Di Wu
- Celula (China) Medical Technology Co., Ltd., Chengdu, China.
| | - Xiangdong Kong
- Genetic and Prenatal Diagnosis Center, Department of Obstetrics and Gynecology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China.
| |
Collapse
|
3
|
Wang J, Zhu QW, Cui AM, Lin MS, Lou HQ. Application of Genetic Origin Analysis of Copy Number Variations in Non-Invasive Prenatal Testing. Prenat Diagn 2025; 45:44-56. [PMID: 39425690 DOI: 10.1002/pd.6688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Revised: 09/24/2024] [Accepted: 10/03/2024] [Indexed: 10/21/2024]
Abstract
OBJECTIVE This study aimed to assess the application of origin analysis of copy number variations (CNVs) in non-invasive prenatal testing (NIPT) and provide a basis for expanding the clinical application of NIPT. METHOD We enrolled 35,317 patients who underwent NIPT between January 2019 and March 2023. Genome sequencing of copy number variation (CNV-Seq) analysis was performed using the CNV calling pipeline to identify subchromosomal abnormalities in maternal plasma. Genetic origin was determined by comparing the chimaerism ratio of CNV and the concentration of cell-free foetal DNA (cffDNA). All pregnant women with a high risk of CNV, as indicated by the NIPT, were informed of their genetic origins. Amniocentesis was recommended for detecting the CNVs in foetal chromosomes, and pregnancy outcomes were tracked. RESULTS A total of 109 pregnancies showed clinically significant positive results for CNV after NIPT, including 65 cases of maternal/foetal (M/F)-CNVs and 44 cases of F-CNVs. The occurrence of M/F-CNVs was independent of age, screening (serological or ultrasound) indications for abnormalities, and mode of pregnancy. The incidence of pathogenic/likely pathogenic (P/LP)-F-CNVs was high in cases where serological screening indicated intermediate, high-risk, or abnormal US findings (p < 0.05). In the M/F-CNV group, most of the P/LP-CNVs were small fragments with low penetrance; 55 (84.62%) were less than 5 Mb in size, and nine (13.85%) were between 5 and 10 Mb. In the F-CNV group, foetal P/LP-CNV was detected in 36 of 42 cases undergoing prenatal diagnosis, and no significant bias was noted in the size distribution of P/LP-F-CNV fragments. The prenatal diagnostic rate and positive predictive value in the F-CNV group were 95.45% and 85.71%, respectively, which were significantly different from those in the M/F group (26.15% and 52.95%), respectively (p < 0.05). CONCLUSIONS Genetic origin analysis of CNV can effectively improve adherence to prenatal diagnosis in pregnant women and the accuracy of prenatal diagnosis.
Collapse
Affiliation(s)
- Jing Wang
- Prenatal Diagnosis Center, Affiliated Maternity and Child Health Care Hospital of Nantong University, Nantong, China
| | - Qing-Wen Zhu
- Prenatal Diagnosis Center, Affiliated Maternity and Child Health Care Hospital of Nantong University, Nantong, China
| | - Ai-Ming Cui
- Department of Obstetrics, Affiliated Maternity and Child Health Care Hospital of Nantong University, Nantong, China
| | - Meng-Si Lin
- Prenatal Diagnosis Center, Affiliated Maternity and Child Health Care Hospital of Nantong University, Nantong, China
| | - Hai-Qin Lou
- Women's Health Care Department, Affiliated Maternity and Child Health Care Hospital of Nantong University, Nantong, China
| |
Collapse
|
4
|
Peripolli E, Stafuzza NB, Machado MA, do Carmo Panetto JC, do Egito AA, Baldi F, da Silva MVGB. Assessment of copy number variants in three Brazilian locally adapted cattle breeds using whole-genome re-sequencing data. Anim Genet 2023; 54:254-270. [PMID: 36740987 DOI: 10.1111/age.13298] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Revised: 12/13/2021] [Accepted: 01/13/2023] [Indexed: 02/07/2023]
Abstract
Further characterization of genetic structural variations should strongly focus on small and endangered local breeds given their role in unraveling genes and structural variants underlying selective pressures and phenotype variation. A comprehensive genome-wide assessment of copy number variations (CNVs) based on whole-genome re-sequencing data was performed on three Brazilian locally adapted cattle breeds (Caracu Caldeano, Crioulo Lageano, and Pantaneiro) using the ARS-UCD1.2 genome assembly. Data from 36 individuals with an average coverage depth of 14.07× per individual was used. A total of 24 945 CNVs were identified distributed among the breeds (Caracu Caldeano = 7285, Crioulo Lageano = 7297, and Pantaneiro = 10 363). Deletion events were 1.75-2.07-fold higher than duplications, and the total length of CNVs is composed mostly of a high number of segments between 10 and 30 kb. CNV regions (CNVRs) are not uniformly scattered throughout the genomes (n = 463), and 105 CNVRs were found overlapping among the studied breeds. Functional annotation of the CNVRs revealed variants with high consequence on protein sequence harboring relevant genes, in which we highlighted the BOLA-DQB, BOLA-DQA5, CD1A, β-defensins, PRG3, and ULBP21 genes. Enrichment analysis based on the gene list retrieved from the CNVRs disclosed over-represented terms (p < 0.01) strongly associated with immunity and cattle resilience to harsh environments. Additionally, QTL associated with body conformation and dairy-related traits were also unveiled within the CNVRs. These results provide better understanding of the selective forces shaping the genome of such cattle breeds and identify traces of natural selection pressures by which these populations have been exposed to challenging environmental conditions.
Collapse
Affiliation(s)
- Elisa Peripolli
- School of Agricultural and Veterinarian Sciences, São Paulo State University (Unesp), Jaboticabal, Brazil
| | | | | | | | | | - Fernando Baldi
- School of Agricultural and Veterinarian Sciences, São Paulo State University (Unesp), Jaboticabal, Brazil
| | | |
Collapse
|
5
|
WAVECNV: A New Approach for Detecting Copy Number Variation by Wavelet Clustering. MATHEMATICS 2022. [DOI: 10.3390/math10122151] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Copy number variation (CNV) detection based on second-generation sequencing technology is the basis of much gene research, but the read depth is affected by mapping errors, repeated reads, and GC bias. The existing methods have low sensitivity to variation regions with a short length and small variation range. Therefore, it is necessary to improve the sensitivity of algorithms to short-variation fragments. This study proposes a new CNV-detection method named WAVECNV to solve this issue. The algorithm uses wavelet clustering to process the read depth and determine the normal cluster and abnormal cluster according to the size of the cluster. Then, according to the distance between genome bins and normal clusters, the outlier of each genome bin is evaluated. Finally, a statistical model is established, and the p-value test is used for calling CNVs. Through this method, the information of the short variation region is retained. WAVECNV was tested and compared with peer methods in terms of simulated data and real cancer-sequencing data. The results show that the sensitivity of WAVECNV is better than the existing methods. It also has high precision in data with low purity and coverage. In real data experiments, WAVECNV can detect more cancer genes than existing methods. Therefore, this method can be regarded as a conventional method in the field of genomic mutation analysis of cancer samples.
Collapse
|
6
|
Banerjee S. Horseshoe shrinkage methods for Bayesian fusion estimation. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107450] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
7
|
Jia S, Shi L. Efficient change-points detection for genomic sequences via cumulative segmented regression. Bioinformatics 2022; 38:311-317. [PMID: 34601562 DOI: 10.1093/bioinformatics/btab685] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Revised: 07/08/2021] [Accepted: 09/28/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Knowing the number and the exact locations of multiple change points in genomic sequences serves several biological needs. The cumulative-segmented algorithm (cumSeg) has been recently proposed as a computationally efficient approach for multiple change-points detection, which is based on a simple transformation of data and provides results quite robust to model mis-specifications. However, the errors are also accumulated in the transformed model so that heteroscedasticity and serial correlation will show up, and thus the variations of the estimated change points will be quite different, while the locations of the change points should be of the same importance in the original genomic sequences. RESULTS In this study, we develop two new change-points detection procedures in the framework of cumulative segmented regression. Simulations reveal that the proposed methods not only improve the efficiency of each change point estimator substantially but also provide the estimators with similar variations for all the change points. By applying these proposed algorithms to Coriel and SNP genotyping data, we illustrate their performance on detecting copy number variations. AVAILABILITY AND IMPLEMENTATION The proposed algorithms are implemented in R program and the codes are provided in the online supplementary material. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shengji Jia
- School of Statistics and Mathematics; Interdisciplinary Research Institute of Data Science, Shanghai Lixin University of Accounting and Finance, Shanghai 201209, China
| | - Lei Shi
- Statistics and Mathematics School, Yunnan University of Finance and Economics, Kunming 650221, China
| |
Collapse
|
8
|
Chan NH, Ng WL, Yau CY, Yu H. Optimal change-point estimation in time series. Ann Stat 2021. [DOI: 10.1214/20-aos2039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Ngai Hang Chan
- Department of Statistics, The Chinese University of Hong Kong
| | - Wai Leong Ng
- Department of Mathematics, Statistics and Insurance, The Hang Seng University of Hong Kong
| | - Chun Yip Yau
- Department of Statistics, The Chinese University of Hong Kong
| | - Haihan Yu
- Department of Statistics, Iowa State University
| |
Collapse
|
9
|
Wei YC, Huang GH. CONY: A Bayesian procedure for detecting copy number variations from sequencing read depths. Sci Rep 2020; 10:10493. [PMID: 32591545 PMCID: PMC7319969 DOI: 10.1038/s41598-020-64353-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2019] [Accepted: 04/15/2020] [Indexed: 12/26/2022] Open
Abstract
Copy number variations (CNVs) are genomic structural mutations consisting of abnormal numbers of fragment copies. Next-generation sequencing of read-depth signals mirrors these variants. Some tools used to predict CNVs by depth have been published, but most of these tools can be applied to only a specific data type due to modeling limitations. We develop a tool for copy number variation detection by a Bayesian procedure, i.e., CONY, that adopts a Bayesian hierarchical model and an efficient reversible-jump Markov chain Monte Carlo inference algorithm for whole genome sequencing of read-depth data. CONY can be applied not only to individual samples for estimating the absolute number of copies but also to case-control pairs for detecting patient-specific variations. We evaluate the performance of CONY and compare CONY with competing approaches through simulations and by using experimental data from the 1000 Genomes Project. CONY outperforms the other methods in terms of accuracy in both single-sample and paired-samples analyses. In addition, CONY performs well regardless of whether the data coverage is high or low. CONY is useful for detecting both absolute and relative CNVs from read-depth data sequences. The package is available at https://github.com/weiyuchung/CONY.
Collapse
Affiliation(s)
- Yu-Chung Wei
- Graduate Institute of Statistics and Information Science, National Changhua University of Education, No.1 Jinde Road, Changhua City, Changhua County, 50007, Taiwan
| | - Guan-Hua Huang
- Institute of Statistics, National Chiao Tung University, 1001 University Road, Hsinchu, 30010, Taiwan.
| |
Collapse
|
10
|
Fang X, Li J, Siegmund D. Segmentation and estimation of change-point models: False positive control and confidence regions. Ann Stat 2020. [DOI: 10.1214/19-aos1861] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
11
|
Alshawaqfeh M, Al Kawam A, Serpedin E, Datta A. Robust Recurrent CNV Detection in the Presence of Inter-Subject Variability. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1056-1067. [PMID: 30387737 DOI: 10.1109/tcbb.2018.2878560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The study of recurrent copy number variations (CNVs) plays an important role in understanding the onset and evolution of complex diseases such as cancer. Array-based comparative genomic hybridization (aCGH) is a widely used microarray based technology for identifying CNVs. However, due to high noise levels and inter-sample variability, detecting recurrent CNVs from aCGH data remains a challenging topic. This paper proposes a novel method for identification of the recurrent CNVs. In the proposed method, the noisy aCGH data is modeled as the superposition of three matrices: a full-rank matrix of weighted piece-wise generating signals accounting for the clean aCGH data, a Gaussian noise matrix to model the inherent experimentation errors and other sources of error, and a sparse matrix to capture the sparse inter-sample (sample-specific) variations. We demonstrated the ability of our method to separate accurately recurrent CNVs from sample-specific variations and noise in both simulated (artificial) data and real data. The proposed method produced more accurate results than current state-of-the-art methods used in recurrent CNV detection and exhibited robustness to noise and sample-specific variations.
Collapse
|
12
|
Wang S, Lee S, Chu C, Jain D, Kerpedjiev P, Nelson GM, Walsh JM, Alver BH, Park PJ. HiNT: a computational method for detecting copy number variations and translocations from Hi-C data. Genome Biol 2020; 21:73. [PMID: 32293513 PMCID: PMC7087379 DOI: 10.1186/s13059-020-01986-5] [Citation(s) in RCA: 55] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2019] [Accepted: 03/05/2020] [Indexed: 12/25/2022] Open
Abstract
The three-dimensional conformation of a genome can be profiled using Hi-C, a technique that combines chromatin conformation capture with high-throughput sequencing. However, structural variations often yield features that can be mistaken for chromosomal interactions. Here, we describe a computational method HiNT (Hi-C for copy Number variation and Translocation detection), which detects copy number variations and interchromosomal translocations within Hi-C data with breakpoints at single base-pair resolution. We demonstrate that HiNT outperforms existing methods on both simulated and real data. We also show that Hi-C can supplement whole-genome sequencing in structure variant detection by locating breakpoints in repetitive regions.
Collapse
Affiliation(s)
- Su Wang
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Soohyun Lee
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Chong Chu
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Dhawal Jain
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Peter Kerpedjiev
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Geoffrey M Nelson
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Jennifer M Walsh
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Burak H Alver
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Peter J Park
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
13
|
Cheng D, He Z, Schwartzman A. Multiple testing of local extrema for detection of change points. Electron J Stat 2020. [DOI: 10.1214/20-ejs1751] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
14
|
Wang X, Lebarbier E, Aubert J, Robin S. Variational Inference for Coupled Hidden Markov Models Applied to the Joint Detection of Copy Number Variations. Int J Biostat 2019; 15:/j/ijb.ahead-of-print/ijb-2018-0023/ijb-2018-0023.xml. [PMID: 30779702 DOI: 10.1515/ijb-2018-0023] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2018] [Accepted: 11/21/2018] [Indexed: 02/04/2023]
Abstract
Hidden Markov models provide a natural statistical framework for the detection of the copy number variations (CNV) in genomics. In this context, we define a hidden Markov process that underlies all individuals jointly in order to detect and to classify genomics regions in different states (typically, deletion, normal or amplification). Structural variations from different individuals may be dependent. It is the case in agronomy where varietal selection program exists and species share a common phylogenetic past. We propose to take into account these dependencies inthe HMM model. When dealing with a large number of series, maximum likelihood inference (performed classically using the EM algorithm) becomes intractable. We thus propose an approximate inference algorithm based on a variational approach (VEM), implemented in the CHMM R package. A simulation study is performed to assess the performance of the proposed method and an application to the detection of structural variations in plant genomes is presented.
Collapse
Affiliation(s)
- Xiaoqiang Wang
- School of Mathematics and Statistics, Shandong University (Weihai), Weihai,Shandong, China
| | - Emilie Lebarbier
- UMR MIA-Paris, AgroParisTech, INRA, Université Paris-Saclay, Paris, France
| | - Julie Aubert
- UMR MIA-Paris, AgroParisTech, INRA, Université Paris-Saclay, Paris, France
| | - Stéphane Robin
- UMR MIA-Paris, AgroParisTech, INRA, Université Paris-Saclay, Paris, France
| |
Collapse
|
15
|
Li H, Guo Q, Munk A. Multiscale change-point segmentation: beyond step functions. Electron J Stat 2019. [DOI: 10.1214/19-ejs1608] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
16
|
Collilieux X, Lebarbier E, Robin S. A factor model approach for the joint segmentation with between‐series correlation. Scand Stat Theory Appl 2018. [DOI: 10.1111/sjos.12368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
Affiliation(s)
- Xavier Collilieux
- Laboratoire de Recherche en Géodésie (LAREG), l'Institut National de l'information Géographique et forestière (IGN)Université Paris Diderot Paris France
| | - Emilie Lebarbier
- UMR MIA‐Paris, AgroParisTech, INRAUniversité Paris‐Saclay Paris France
| | - Stéphane Robin
- UMR MIA‐Paris, AgroParisTech, INRAUniversité Paris‐Saclay Paris France
| |
Collapse
|
17
|
Nguyen N, Vo A, Sun H, Huang H. Heavy-Tailed Noise Suppression and Derivative Wavelet Scalogram for Detecting DNA Copy Number Aberrations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1625-1635. [PMID: 28692986 DOI: 10.1109/tcbb.2017.2723884] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Most existing array comparative genomic hybridization (array CGH) data processing methods and evaluation models assumed that the probability density function (pdf) of noise in array CGH data is a Gaussian distribution. However, in practice, such noise distribution is peaky and heavy-tailed. Therefore, a Gaussian pdf is not adequate to approximate the noise in array CGH data and hence introduces wrong detections of chromosomal aberrations and leads misunderstanding on disease pathogenesis. A more accurate and sufficient model of noise in array CGH data is necessary and beneficial to the detection of DNA copy number variations. We analyze the real array CGH data from different platforms and show that the distribution of noise in array CGH data is fitted very well by generalized Gaussian distribution (GGD). Based on our new noise model, we propose a novel array CGH processing method combining the advantages of both the smoothing and segmentation approaches. The new method uses generalized Gaussian bivariate shrinkage function and one-directional derivative wavelet scalogram in generalized Gaussian noise. In the smoothing step, with the new generalized Gaussian noise model, we derive the heavy-tailed noise suppression algorithm in stationary wavelet domain. In the segmentation step, the 1D Gaussian derivative wavelet scalogram is employed to detect break points. Both real and simulated array CGH data with different noises (such as Gaussian noise, GGD noise, and real noise) are used in our experiments. We demonstrate that our new method outperforms other state-of-the-art methods, in terms of both root mean squared errors and receiver operating characteristic curves.
Collapse
|
18
|
Liu J, Zhou Y, Liu S, Song X, Yang XZ, Fan Y, Chen W, Akdemir ZC, Yan Z, Zuo Y, Du R, Liu Z, Yuan B, Zhao S, Liu G, Chen Y, Zhao Y, Lin M, Zhu Q, Niu Y, Liu P, Ikegawa S, Song YQ, Posey JE, Qiu G, Zhang F, Wu Z, Lupski JR, Wu N. The coexistence of copy number variations (CNVs) and single nucleotide polymorphisms (SNPs) at a locus can result in distorted calculations of the significance in associating SNPs to disease. Hum Genet 2018; 137:553-567. [PMID: 30019117 DOI: 10.1007/s00439-018-1910-3] [Citation(s) in RCA: 51] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2018] [Accepted: 07/07/2018] [Indexed: 01/25/2023]
Abstract
With the recent advance in genome-wide association studies (GWAS), disease-associated single nucleotide polymorphisms (SNPs) and copy number variants (CNVs) have been extensively reported. Accordingly, the issue of incorrect identification of recombination events that can induce the distortion of multi-allelic or hemizygous variants has received more attention. However, the potential distorted calculation bias or significance of a detected association in a GWAS due to the coexistence of CNVs and SNPs in the same genomic region may remain under-recognized. Here we performed the association study within a congenital scoliosis (CS) cohort whose genetic etiology was recently elucidated as a compound inheritance model, including mostly one rare variant deletion CNV null allele and one common variant non-coding hypomorphic haplotype of the TBX6 gene. We demonstrated that the existence of a deletion in TBX6 led to an overestimation of the contribution of the SNPs on the hypomorphic allele. Furthermore, we generalized a model to explain the calculation bias, or distorted significance calculation for an association study, that can be 'induced' by CNVs at a locus. Meanwhile, overlapping between the disease-associated SNPs from published GWAS and common CNVs (overlap 10%) and pathogenic/likely pathogenic CNVs (overlap 99.69%) was significantly higher than the random distribution (p < 1 × 10-6 and p = 0.034, respectively), indicating that such co-existence of CNV and SNV alleles might generally influence data interpretation and potential outcomes of a GWAS. We also verified and assessed the influence of colocalizing CNVs to the detection sensitivity of disease-associated SNP variant alleles in another adolescent idiopathic scoliosis (AIS) genome-wide association study. We proposed that detecting co-existent CNVs when evaluating the association signals between SNPs and disease traits could improve genetic model analyses and better integrate GWAS with robust Mendelian principles.
Collapse
Affiliation(s)
- Jiaqi Liu
- Department of Orthopedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China.,Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China.,Department of Breast Surgical Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China
| | - Yangzhong Zhou
- Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China.,Department of Internal Medicine, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, 100730, China
| | - Sen Liu
- Department of Orthopedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China.,Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China.,Medical Research Center of Orthopedics, Chinese Academy of Medical Sciences, Beijing, 100730, China
| | - Xiaofei Song
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Xin-Zhuang Yang
- Department of Central Laboratory, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, 100730, China
| | - Yanhui Fan
- School of Biomedical Sciences, The University of Hong Kong, Hong Kong, China
| | - Weisheng Chen
- Department of Orthopedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China.,Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China.,Medical Research Center of Orthopedics, Chinese Academy of Medical Sciences, Beijing, 100730, China
| | - Zeynep Coban Akdemir
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Zihui Yan
- Department of Orthopedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China.,Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China.,Medical Research Center of Orthopedics, Chinese Academy of Medical Sciences, Beijing, 100730, China
| | - Yuzhi Zuo
- Department of Orthopedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China.,Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China.,Medical Research Center of Orthopedics, Chinese Academy of Medical Sciences, Beijing, 100730, China
| | - Renqian Du
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Zhenlei Liu
- Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China.,Department of Neurosurgery, Xuanwu Hospital, Capital Medical University, Beijing, 100053, China
| | - Bo Yuan
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Sen Zhao
- Department of Orthopedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China.,Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China.,Medical Research Center of Orthopedics, Chinese Academy of Medical Sciences, Beijing, 100730, China
| | - Gang Liu
- Department of Orthopedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China.,Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China.,Medical Research Center of Orthopedics, Chinese Academy of Medical Sciences, Beijing, 100730, China
| | - Yixin Chen
- Department of Orthopedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China.,Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China.,Medical Research Center of Orthopedics, Chinese Academy of Medical Sciences, Beijing, 100730, China
| | - Yanxue Zhao
- Department of Orthopedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China.,Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China.,Medical Research Center of Orthopedics, Chinese Academy of Medical Sciences, Beijing, 100730, China
| | - Mao Lin
- Department of Orthopedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China.,Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China.,Medical Research Center of Orthopedics, Chinese Academy of Medical Sciences, Beijing, 100730, China
| | - Qiankun Zhu
- Department of Orthopedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China.,Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China.,Medical Research Center of Orthopedics, Chinese Academy of Medical Sciences, Beijing, 100730, China
| | - Yuchen Niu
- Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China.,Medical Research Center of Orthopedics, Chinese Academy of Medical Sciences, Beijing, 100730, China.,Department of Central Laboratory, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, 100730, China
| | - Pengfei Liu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Shiro Ikegawa
- Laboratory of Bone and Joint Diseases, Center for Integrative Medical Sciences, RIKEN, Tokyo, 108-8639, Japan
| | - You-Qiang Song
- School of Biomedical Sciences, The University of Hong Kong, Hong Kong, China
| | - Jennifer E Posey
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Guixing Qiu
- Department of Orthopedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China.,Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China.,Medical Research Center of Orthopedics, Chinese Academy of Medical Sciences, Beijing, 100730, China
| | | | - Feng Zhang
- Obstetrics and Gynecology Hospital, Institute of Reproduction and Development, Fudan University, Shanghai, 200433, China.,Collaborative Innovation Center for Genetics and Development, School of Life Sciences, Fudan University, Shanghai, 200433, China
| | - Zhihong Wu
- Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China.,Medical Research Center of Orthopedics, Chinese Academy of Medical Sciences, Beijing, 100730, China.,Department of Central Laboratory, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing, 100730, China
| | - James R Lupski
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA.,Department of Pediatrics, Baylor College of Medicine, Houston, TX, 77030, USA.,Texas Children's Hospital, Houston, TX, 77030, USA
| | - Nan Wu
- Department of Orthopedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No.1 Shuaifuyuan, Beijing, 100730, China. .,Beijing Key Laboratory for Genetic Research of Skeletal Deformity, Beijing, 100730, China. .,Medical Research Center of Orthopedics, Chinese Academy of Medical Sciences, Beijing, 100730, China.
| |
Collapse
|
19
|
Montoril MH, Pinheiro A, Vidakovic B. Wavelet‐based estimators for mixture regression. Scand Stat Theory Appl 2018. [DOI: 10.1111/sjos.12344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Michel H. Montoril
- Department of StatisticsFederal University of Juiz de Fora Juiz de Fora Brazil
| | | | - Brani Vidakovic
- H. Milton Stewart School of Industrial and Systems EngineeringGeorgia Institute of Technology Atlanta GA USA
| |
Collapse
|
20
|
Girimurugan SB, Liu Y, Lung PY, Vera DL, Dennis JH, Bass HW, Zhang J. iSeg: an efficient algorithm for segmentation of genomic and epigenomic data. BMC Bioinformatics 2018; 19:131. [PMID: 29642840 PMCID: PMC5896135 DOI: 10.1186/s12859-018-2140-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2017] [Accepted: 03/26/2018] [Indexed: 11/16/2022] Open
Abstract
Background Identification of functional elements of a genome often requires dividing a sequence of measurements along a genome into segments where adjacent segments have different properties, such as different mean values. Despite dozens of algorithms developed to address this problem in genomics research, methods with improved accuracy and speed are still needed to effectively tackle both existing and emerging genomic and epigenomic segmentation problems. Results We designed an efficient algorithm, called iSeg, for segmentation of genomic and epigenomic profiles. iSeg first utilizes dynamic programming to identify candidate segments and test for significance. It then uses a novel data structure based on two coupled balanced binary trees to detect overlapping significant segments and update them simultaneously during searching and refinement stages. Refinement and merging of significant segments are performed at the end to generate the final set of segments. By using an objective function based on the p-values of the segments, the algorithm can serve as a general computational framework to be combined with different assumptions on the distributions of the data. As a general segmentation method, it can segment different types of genomic and epigenomic data, such as DNA copy number variation, nucleosome occupancy, nuclease sensitivity, and differential nuclease sensitivity data. Using simple t-tests to compute p-values across multiple datasets of different types, we evaluate iSeg using both simulated and experimental datasets and show that it performs satisfactorily when compared with some other popular methods, which often employ more sophisticated statistical models. Implemented in C++, iSeg is also very computationally efficient, well suited for large numbers of input profiles and data with very long sequences. Conclusions We have developed an efficient general-purpose segmentation tool and showed that it had comparable or more accurate results than many of the most popular segment-calling algorithms used in contemporary genomic data analysis. iSeg is capable of analyzing datasets that have both positive and negative values. Tunable parameters allow users to readily adjust the statistical stringency to best match the biological nature of individual datasets, including widely or sparsely mapped genomic datasets or those with non-normal distributions. Electronic supplementary material The online version of this article (10.1186/s12859-018-2140-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Yuhang Liu
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Pei-Yau Lung
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Daniel L Vera
- Center for Genomics and Personalized Medicine, Florida State University, Tallahassee, FL, USA
| | - Jonathan H Dennis
- Department of Biological Science, Florida State University, Tallahassee, FL, USA
| | - Hank W Bass
- Department of Biological Science, Florida State University, Tallahassee, FL, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, USA.
| |
Collapse
|
21
|
|
22
|
Antunes de Lemos MV, Berton MP, Ferreira de Camargo GM, Peripolli E, de Oliveira Silva RM, Ferreira Olivieri B, Cesar AS, Pereira ASC, de Albuquerque LG, de Oliveira HN, Tonhati H, Baldi F. Copy number variation regions in Nellore cattle: Evidences of environment adaptation. Livest Sci 2018. [DOI: 10.1016/j.livsci.2017.11.008] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
23
|
Fan Z, Mackey L. Empirical Bayesian analysis of simultaneous changepoints in multiple data sequences. Ann Appl Stat 2017. [DOI: 10.1214/17-aoas1075] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
24
|
Van De Wiel MA, Van Wieringen WN. CGHregions: Dimension Reduction for Array CGH Data with Minimal Information Loss. Cancer Inform 2017. [DOI: 10.1177/117693510700300031] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
An algorithm to reduce multi-sample array CGH data from thousands of clones to tens or hundreds of clone regions is introduced. This reduction of the data is performed such that little information is lost, which is possible due to the high dependencies between neighboring clones. The algorithm is explained using a small example. The potential beneficial effects of the algorithm for downstream analysis are illustrated by re-analysis of previously published colorectal cancer data. Using multiple testing corrections suitable for these data, we provide statistical evidence for genomic differences on several clone regions between MSI+ and CIN+ tumors. The algorithm, named CGHregions, is available as an easy-to-use script in R.
Collapse
Affiliation(s)
- Mark A. Van De Wiel
- Department of Pathology and Department of Biostatistics (KEB), VU University Medical Center, Amsterdam, The Netherlands
- Department of Mathematics, Vrije Universiteit, Amsterdam, The Netherlands
| | | |
Collapse
|
25
|
Chen H, Jiang Y, Maxwell KN, Nathanson KL, Zhang N. ALLELE-SPECIFIC COPY NUMBER ESTIMATION BY WHOLE EXOME SEQUENCING. Ann Appl Stat 2017; 11:1169-1192. [PMID: 28989557 DOI: 10.1214/17-aoas1043] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Whole exome sequencing is currently a technology of choice in large-scale cancer genomics studies, where the priority is to identify cancer-associated variants in coding regions. We describe a method for estimating allele-specific copy number using whole exome sequencing data from tumor and matched normal.
Collapse
|
26
|
Delatola EI, Lebarbier E, Mary-Huard T, Radvanyi F, Robin S, Wong J. SegCorr a statistical procedure for the detection of genomic regions of correlated expression. BMC Bioinformatics 2017; 18:333. [PMID: 28697800 PMCID: PMC5504623 DOI: 10.1186/s12859-017-1742-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2016] [Accepted: 06/26/2017] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND Detecting local correlations in expression between neighboring genes along the genome has proved to be an effective strategy to identify possible causes of transcriptional deregulation in cancer. It has been successfully used to illustrate the role of mechanisms such as copy number variation (CNV) or epigenetic alterations as factors that may significantly alter expression in large chromosomal regions (gene silencing or gene activation). RESULTS The identification of correlated regions requires segmenting the gene expression correlation matrix into regions of homogeneously correlated genes and assessing whether the observed local correlation is significantly higher than the background chromosomal correlation. A unified statistical framework is proposed to achieve these two tasks, where optimal segmentation is efficiently performed using dynamic programming algorithm, and detection of highly correlated regions is then achieved using an exact test procedure. We also propose a simple and efficient procedure to correct the expression signal for mechanisms already known to impact expression correlation. The performance and robustness of the proposed procedure, called SegCorr, are evaluated on simulated data. The procedure is illustrated on cancer data, where the signal is corrected for correlations caused by copy number variation. It permitted the detection of regions with high correlations linked to epigenetic marks like DNA methylation. CONCLUSIONS SegCorr is a novel method that performs correlation matrix segmentation and applies a test procedure in order to detect highly correlated regions in gene expression.
Collapse
Affiliation(s)
- Eleni Ioanna Delatola
- AgroParisTech UMR518, Paris, 75005, France.
- INRA UMR518, Paris, 75005, France.
- Institut Curie, PSL Research University, Cedex 05, Paris, 75248, France.
- CNRS UMR144, Equipe Labellisee par La Ligue Nationale contre le Cancer, Cedex 05, Paris, 75248, France.
| | - Emilie Lebarbier
- AgroParisTech UMR518, Paris, 75005, France
- INRA UMR518, Paris, 75005, France
| | - Tristan Mary-Huard
- AgroParisTech UMR518, Paris, 75005, France
- INRA UMR518, Paris, 75005, France
- INRA, UMR 0320 - UMR 8120 Genetique Quantitative et Evolution-Le Moulon, Gif-sur-Yvette, F-91190, France
| | - François Radvanyi
- Institut Curie, PSL Research University, Cedex 05, Paris, 75248, France
- CNRS UMR144, Equipe Labellisee par La Ligue Nationale contre le Cancer, Cedex 05, Paris, 75248, France
| | - Stéphane Robin
- AgroParisTech UMR518, Paris, 75005, France
- INRA UMR518, Paris, 75005, France
| | - Jennifer Wong
- Institut Curie, PSL Research University, Cedex 05, Paris, 75248, France
- CNRS UMR144, Equipe Labellisee par La Ligue Nationale contre le Cancer, Cedex 05, Paris, 75248, France
- Molecular Oncology Unit, Department of Biochemistry, Hospital Saint Louis, AP-HP, Cedex 10, Paris, 75475, France
- Université Paris Diderot, Sorbonne Paris Cité, CNRS UMR7212/INSERM U944, Cedex 10, Paris, 75475, France
| |
Collapse
|
27
|
SLMSuite: a suite of algorithms for segmenting genomic profiles. BMC Bioinformatics 2017; 18:321. [PMID: 28659129 PMCID: PMC5490196 DOI: 10.1186/s12859-017-1734-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2016] [Accepted: 06/20/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The identification of copy number variants (CNVs) is essential to study human genetic variation and to understand the genetic basis of mendelian disorders and cancers. At present, genome-wide detection of CNVs can be achieved using microarray or second generation sequencing (SGS) data. Although these technologies are very different, the genomic profiles that they generate are mathematically very similar and consist of noisy signals in which a decrease or increase of consecutive data represent deletions or duplication of DNA. In this framework, the most important step of the analysis consists of segmenting genomic profiles for the identification of the boundaries of genomic regions with increased or decreased signal. RESULTS Here we introduce SLMSuite, a collection of algorithms, based on shifting level models (SLM), to segment genomic profiles from array and SGS experiments. The SLM algorithms take as input the log-transformed genomic profiles from SGS or microarray experiments and output segmentation results. We apply our method to the analysis of synthetic genomic profiles and real whole genome sequencing data and we demonstrate that it outperforms the state of the art circular binary segmentation algorithm in terms of sensitivity, specificity and computational speed. CONCLUSION The SLMSuite contains an R library with the segmentation methods and three wrappers that allow to use them in Python, Ruby and C++. SLMSuite is freely available at https://sourceforge.net/projects/slmsuite .
Collapse
|
28
|
Chakar S, Lebarbier E, Lévy-Leduc C, Robin S. A robust approach for estimating change-points in the mean of an $\operatorname{AR}(1)$ process. BERNOULLI 2017. [DOI: 10.3150/15-bej782] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
29
|
Gao Y, Jiang J, Yang S, Hou Y, Liu GE, Zhang S, Zhang Q, Sun D. CNV discovery for milk composition traits in dairy cattle using whole genome resequencing. BMC Genomics 2017; 18:265. [PMID: 28356085 PMCID: PMC5371188 DOI: 10.1186/s12864-017-3636-3] [Citation(s) in RCA: 65] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2016] [Accepted: 03/17/2017] [Indexed: 01/08/2023] Open
Abstract
Background Copy number variations (CNVs) are important and widely distributed in the genome. CNV detection opens a new avenue for exploring genes associated with complex traits in humans, animals and plants. Herein, we present a genome-wide assessment of CNVs that are potentially associated with milk composition traits in dairy cattle. Results In this study, CNVs were detected based on whole genome re-sequencing data of eight Holstein bulls from four half- and/or full-sib families, with extremely high and low estimated breeding values (EBVs) of milk protein percentage and fat percentage. The range of coverage depth per individual was 8.2–11.9×. Using CNVnator, we identified a total of 14,821 CNVs, including 5025 duplications and 9796 deletions. Among them, 487 differential CNV regions (CNVRs) comprising ~8.23 Mb of the cattle genome were observed between the high and low groups. Annotation of these differential CNVRs were performed based on the cattle genome reference assembly (UMD3.1) and totally 235 functional genes were found within the CNVRs. By Gene Ontology and KEGG pathway analyses, we found that genes were significantly enriched for specific biological functions related to protein and lipid metabolism, insulin/IGF pathway-protein kinase B signaling cascade, prolactin signaling pathway and AMPK signaling pathways. These genes included INS, IGF2, FOXO3, TH, SCD5, GALNT18, GALNT16, ART3, SNCA and WNT7A, implying their potential association with milk protein and fat traits. In addition, 95 CNVRs were overlapped with 75 known QTLs that are associated with milk protein and fat traits of dairy cattle (Cattle QTLdb). Conclusions In conclusion, based on NGS of 8 Holstein bulls with extremely high and low EBVs for milk PP and FP, we identified a total of 14,821 CNVs, 487 differential CNVRs between groups, and 10 genes, which were suggested as promising candidate genes for milk protein and fat traits. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3636-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Yahui Gao
- Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, 100193, China
| | - Jianping Jiang
- Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, 100193, China
| | - Shaohua Yang
- Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, 100193, China
| | - Yali Hou
- CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China
| | - George E Liu
- Animal Genomics and Improvement Laboratory, BARC, USDA-ARS, Beltsville, Md, 20705, USA
| | - Shengli Zhang
- Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, 100193, China
| | - Qin Zhang
- Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, 100193, China
| | - Dongxiao Sun
- Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, 100193, China.
| |
Collapse
|
30
|
Lim HK, Lee J, Cheon S. Stochastic approximation Monte Carlo EM for change-point analysis. J STAT COMPUT SIM 2017. [DOI: 10.1080/00949655.2016.1192630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Hwa Kyung Lim
- Department of Statistics, Seoul National University, Seoul, South Korea
| | - Jaejun Lee
- Center for Military Planning, Korea Institute for Defense Analyses, Seoul, South Korea
| | - Sooyoung Cheon
- Department of Applied Statistics, Korea University, Sejong, South Korea
| |
Collapse
|
31
|
Cleynen A, Lebarbier E. Model selection for the segmentation of multiparameter exponential family distributions. Electron J Stat 2017. [DOI: 10.1214/17-ejs1246] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
32
|
Bertin K, Collilieux X, Lebarbier E, Meza C. Semi-parametric segmentation of multiple series using a DP-Lasso strategy. J STAT COMPUT SIM 2016. [DOI: 10.1080/00949655.2016.1260726] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
33
|
Kaveh F, Baumbusch LO, Nebdal D, Børresen-Dale AL, Lingjærde OC, Edvardsen H, Kristensen VN, Solvang HK. A systematic comparison of copy number alterations in four types of female cancer. BMC Cancer 2016; 16:913. [PMID: 27876019 PMCID: PMC5120489 DOI: 10.1186/s12885-016-2899-4] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2016] [Accepted: 10/30/2016] [Indexed: 01/06/2023] Open
Abstract
Background Detection and localization of genomic alterations and breakpoints are crucial in cancer research. The purpose of this study was to investigate, in a methodological and biological perspective, different female, hormone-dependent cancers to identify common and diverse DNA aberrations, genes, and pathways. Methods In this work, we analyzed tissue samples from patients with breast (n = 112), ovarian (n = 74), endometrial (n = 84), or cervical (n = 76) cancer. To identify genomic aberrations, the Circular Binary Segmentation (CBS) and Piecewise Constant Fitting (PCF) algorithms were used and segmentation thresholds optimized. The Genomic Identification of Significant Targets in Cancer (GISTIC) algorithm was applied to the segmented data to identify significantly altered regions and the associated genes were analyzed by Ingenuity Pathway Analysis (IPA) to detect over-represented pathways and functions within the identified gene sets. Results and Discussion Analyses of high-resolution copy number alterations in four different female cancer types are presented. For appropriately adjusted segmentation parameters the two segmentation algorithms CBS and PCF performed similarly. We identified one region at 8q24.3 with focal aberrations that was altered at significant frequency across all four cancer types. Considering both, broad regions and focal peaks, three additional regions with gains at significant frequency were revealed at 1p21.1, 8p22, and 13q21.33, respectively. Several of these events involve known cancer-related genes, like PPP2R2A, PSCA, PTP4A3, and PTK2. In the female reproductive system (ovarian, endometrial, and cervix [OEC]), we discovered three common events: copy number gains at 5p15.33 and 15q11.2, further a copy number loss at 8p21.2. Interestingly, as many as 75% of the aberrations (75% amplifications and 86% deletions) identified by GISTIC were specific for just one cancer type and represented distinct molecular pathways. Conclusions Our results disclose that some prominent copy number changes are shared in the four examined female, hormone-dependent cancer whereas others are definitive to specific cancer types. Electronic supplementary material The online version of this article (doi:10.1186/s12885-016-2899-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Fatemeh Kaveh
- Department of Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, Oslo, Norway.,Medical Genetics Department, Oslo University Hospital Ullevål, Oslo, Norway.,Department of Pediatric Research, Division of Pediatric and Adolescent Medicine, Oslo University Hospital Rikshospitalet, Oslo, Norway
| | - Lars O Baumbusch
- Department of Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, Oslo, Norway.,Department of Pediatric Research, Division of Pediatric and Adolescent Medicine, Oslo University Hospital Rikshospitalet, Oslo, Norway
| | - Daniel Nebdal
- Department of Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, Oslo, Norway
| | - Anne-Lise Børresen-Dale
- Department of Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, Oslo, Norway
| | - Ole Christian Lingjærde
- Department of Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, Oslo, Norway.,Department of Computer Science, University of Oslo, Oslo, Norway
| | - Hege Edvardsen
- Department of Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, Oslo, Norway
| | - Vessela N Kristensen
- Department of Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, Oslo, Norway. .,Department of Clinical Molecular Biology (EpiGen), Medical Division, Akershus University Hospital, Lørenskog, Norway.
| | - Hiroko K Solvang
- Marine Mammals Research Group, Institute of Marine Research, Bergen, Norway
| |
Collapse
|
34
|
CNARA: reliability assessment for genomic copy number profiles. BMC Genomics 2016; 17:799. [PMID: 27733115 PMCID: PMC5062840 DOI: 10.1186/s12864-016-3074-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2016] [Accepted: 09/07/2016] [Indexed: 01/22/2023] Open
Abstract
Background DNA copy number profiles from microarray and sequencing experiments sometimes contain wave artefacts which may be introduced during sample preparation and cannot be removed completely by existing preprocessing methods. Besides, large derivative log ratio spread (DLRS) of the probes correlating with poor DNA quality is sometimes observed in genome screening experiments and may lead to unreliable copy number profiles. Depending on the extent of these artefacts and the resulting misidentification of copy number alterations/variations (CNA/CNV), it may be desirable to exclude such samples from analyses or to adapt the downstream data analysis strategy accordingly. Results Here, we propose a method to distinguish reliable genomic copy number profiles from those containing heavy wave artefacts and/or large DLRS. We define four features that adequately summarize the copy number profiles for reliability assessment, and train a classifier on a dataset of 1522 copy number profiles from various microarray platforms. The method can be applied to predict the reliability of copy number profiles irrespective of the underlying microarray platform and may be adapted for those sequencing platforms from which copy number estimates could be computed as a piecewise constant signal. Further details can be found at https://github.com/baudisgroup/CNARA. Conclusions We have developed a method for the assessment of genomic copy number profiling data, and suggest to apply the method in addition to and after other state-of-the-art noise correction and quality control procedures. CNARA could be instrumental in improving the assessment of data used for genomic data mining experiments and support the reliable functional attribution of copy number aberrations especially in cancer research. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3074-7) contains supplementary material, which is available to authorized users.
Collapse
|
35
|
Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression. PLoS Comput Biol 2016; 12:e1004871. [PMID: 27177143 PMCID: PMC4866742 DOI: 10.1371/journal.pcbi.1004871] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2015] [Accepted: 03/14/2016] [Indexed: 11/22/2022] Open
Abstract
By integrating Haar wavelets with Hidden Markov Models, we achieve drastically reduced running times for Bayesian inference using Forward-Backward Gibbs sampling. We show that this improves detection of genomic copy number variants (CNV) in array CGH experiments compared to the state-of-the-art, including standard Gibbs sampling. The method concentrates computational effort on chromosomal segments which are difficult to call, by dynamically and adaptively recomputing consecutive blocks of observations likely to share a copy number. This makes routine diagnostic use and re-analysis of legacy data collections feasible; to this end, we also propose an effective automatic prior. An open source software implementation of our method is available at http://schlieplab.org/Software/HaMMLET/ (DOI: 10.5281/zenodo.46262). This paper was selected for oral presentation at RECOMB 2016, and an abstract is published in the conference proceedings.
Collapse
|
36
|
Huang MC, Chuang TP, Chen CH, Wu JY, Chen YT, Li LH, Yang HC. An integrated analysis tool for analyzing hybridization intensities and genotypes using new-generation population-optimized human arrays. BMC Genomics 2016; 17:266. [PMID: 27029637 PMCID: PMC4815280 DOI: 10.1186/s12864-016-2478-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2015] [Accepted: 02/16/2016] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Affymetrix Axiom single nucleotide polymorphism (SNP) arrays provide a cost-effective, high-density, and high-throughput genotyping solution for population-optimized analyses. However, no public software is available for the integrated genomic analysis of hybridization intensities and genotypes for this new-generation population-optimized genotyping platform. RESULTS A set of statistical methods was developed for an integrated analysis of allele frequency (AF), allelic imbalance (AI), loss of heterozygosity (LOH), long contiguous stretch of homozygosity (LCSH), and copy number variation or alteration (CNV/CNA) on the basis of SNP probe hybridization intensities and genotypes. This study analyzed 3,236 samples that were genotyped using different SNP platforms. The proposed AF adjustment method considerably increased the accuracy of AF estimation. The proposed quick circular binary segmentation algorithm for segmenting copy number reduced the computation time of the original segmentation method by 30-67 %. The proposed CNV/CNA detection, which integrates AI and LOH/LCSH detection, had a promising true positive rate and well-controlled false positive rate in simulation studies. Moreover, our real-time quantitative polymerase chain reaction experiments successfully validated the CNVs/CNAs that were identified in the Axiom data analyses using the proposed methods; some of the validated CNVs/CNAs were not detected in the Affymetrix Array 6.0 data analysis using the Affymetrix Genotyping Console. All the analysis functions are packaged into the ALICE (AF/LOH/LCSH/AI/CNV/CNA Enterprise) software. CONCLUSIONS ALICE and the used genomic reference databases, which can be downloaded from http://hcyang.stat.sinica.edu.tw/software/ALICE.html , are useful resources for analyzing genomic data from the Axiom and other SNP arrays.
Collapse
Affiliation(s)
- Mei-Chu Huang
- Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei, 115, Taiwan.,Institute of Statistical Science, Academia Sinica, No 128, Academia Rd, Sec 2, Nankang, Taipei, 115, Taiwan.,Institute of Biomedical Informatics, National Yang-Ming University, Taipei, 112, Taiwan
| | - Tzu-Po Chuang
- Taiwan International Graduate Program in Molecular Medicine, National Yang-Ming University and Academia Sinica, Taipei, 115, Taiwan.,Institute of Biochemistry and Molecular Biology, National Yang-Ming University, Taipei, 112, Taiwan
| | - Chien-Hsiun Chen
- Institute of Biomedical Sciences, Academia Sinica, Academia Rd, Sec 2, Nankang, Taipei, 115, Taiwan
| | - Jer-Yuarn Wu
- Institute of Biomedical Sciences, Academia Sinica, Academia Rd, Sec 2, Nankang, Taipei, 115, Taiwan
| | - Yuan-Tsong Chen
- Institute of Biomedical Sciences, Academia Sinica, Academia Rd, Sec 2, Nankang, Taipei, 115, Taiwan
| | - Ling-Hui Li
- Institute of Biomedical Sciences, Academia Sinica, Academia Rd, Sec 2, Nankang, Taipei, 115, Taiwan.
| | - Hsin-Chou Yang
- Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei, 115, Taiwan. .,Institute of Statistical Science, Academia Sinica, No 128, Academia Rd, Sec 2, Nankang, Taipei, 115, Taiwan. .,Institute of Public Health, National Yang Ming University, Taipei, 112, Taiwan. .,Department of Statistics, National Cheng Kung University, Tainan, 701, Taiwan. .,Institute of Statistics, National Tsing Hua University, Hsinchu, 300, Taiwan. .,School of Public Health, National Defense Medical Center, Taipei, 114, Taiwan.
| |
Collapse
|
37
|
Abunimer AN, Salazar J, Noursi DP, Abu-Asab MS. A Systems Biology Interpretation of Array Comparative Genomic Hybridization (aCGH) Data through Phylogenetics. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2016; 20:169-79. [PMID: 26983023 PMCID: PMC4799695 DOI: 10.1089/omi.2015.0184] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Array Comparative Genomic Hybridization (aCGH) is a rapid screening technique to detect gene deletions and duplications, providing an overview of chromosomal aberrations throughout the entire genome of a tumor, without the need for cell culturing. However, the heterogeneity of aCGH data obfuscates existing methods of data analysis. Analysis of aCGH data from a systems biology perspective or in the context of total aberrations is largely absent in the published literature. We present here a novel alternative to the functional analysis of aCGH data using the phylogenetic paradigm that is well-suited to high dimensional datasets of heterogeneous nature, but has not been widely adapted to aCGH data. Maximum parsimony phylogenetic analysis sorts out genetic data through the simplest presentation of the data on a cladogram, a graphical evolutionary tree, thus providing a powerful and efficient method for aCGH data analysis. For example, the cladogram models the multiphasic changes in the cancer genome and identifies shared early mutations in the disease progression, providing a simple yet powerful means of aCGH data interpretation. As such, applying maximum parsimony phylogenetic analysis to aCGH results allows for the differentiation between drivers and passenger genes aberrations in cancer specimens. In addition to offering a novel methodology to analyze aCGH results, we present here a crucial software suite that we wrote to carry out the analysis. In a broader context, we wish to underscore that phylogenetic analysis of aCGH data is a non-parametric method that circumvents the pitfalls and frustrations of standard analytical techniques that rely on parametric statistics. Organizing the data in a cladogram as explained in this research article provides insights into the disease common aberrations, as well as the disease subtypes and their shared aberrations (the synapomorphies) of each subtype. Hence, we report the method and make the software suite publicly and freely available at http://software.phylomcs.com so that researchers can test alternative and innovative approaches to the analysis of aCGH data.
Collapse
Affiliation(s)
- Ayman N. Abunimer
- Virginia Tech Carilion School of Medicine and Research Institute, Roanoke, Virginia
| | - Jose Salazar
- The Electrical Engineering and Computer Science Department, Massachusetts Institute of Technology, Cambridge, Massachusetts
| | | | - Mones S. Abu-Asab
- National Eye Institute, National Institutes of Health, Bethesda, Maryland
| |
Collapse
|
38
|
Gao X. Penalized weighted low-rank approximation for robust recovery of recurrent copy number variations. BMC Bioinformatics 2015; 16:407. [PMID: 26652207 PMCID: PMC4676147 DOI: 10.1186/s12859-015-0835-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2015] [Accepted: 11/23/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Copy number variation (CNV) analysis has become one of the most important research areas for understanding complex disease. With increasing resolution of array-based comparative genomic hybridization (aCGH) arrays, more and more raw copy number data are collected for multiple arrays. It is natural to realize the co-existence of both recurrent and individual-specific CNVs, together with the possible data contamination during the data generation process. Therefore, there is a great need for an efficient and robust statistical model for simultaneous recovery of both recurrent and individual-specific CNVs. RESULT We develop a penalized weighted low-rank approximation method (WPLA) for robust recovery of recurrent CNVs. In particular, we formulate multiple aCGH arrays into a realization of a hidden low-rank matrix with some random noises and let an additional weight matrix account for those individual-specific effects. Thus, we do not restrict the random noise to be normally distributed, or even homogeneous. We show its performance through three real datasets and twelve synthetic datasets from different types of recurrent CNV regions associated with either normal random errors or heavily contaminated errors. CONCLUSION Our numerical experiments have demonstrated that the WPLA can successfully recover the recurrent CNV patterns from raw data under different scenarios. Compared with two other recent methods, it performs the best regarding its ability to simultaneously detect both recurrent and individual-specific CNVs under normal random errors. More importantly, the WPLA is the only method which can effectively recover the recurrent CNVs region when the data is heavily contaminated.
Collapse
Affiliation(s)
- Xiaoli Gao
- Department of Mathematics and Statistics, University of North Carolina at Greensboro, 1400 Spring Garden St, Greensoboro, NC, USA.
| |
Collapse
|
39
|
Mohammadi M, Hodtani GA, Yassi M. A robust Correntropy-based method for analyzing multisample aCGH data. Genomics 2015; 106:257-64. [DOI: 10.1016/j.ygeno.2015.07.008] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2015] [Revised: 07/14/2015] [Accepted: 07/20/2015] [Indexed: 11/16/2022]
|
40
|
|
41
|
Anjum S, Morganella S, D'Angelo F, Iavarone A, Ceccarelli M. VEGAWES: variational segmentation on whole exome sequencing for copy number detection. BMC Bioinformatics 2015; 16:315. [PMID: 26416038 PMCID: PMC4587906 DOI: 10.1186/s12859-015-0748-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2015] [Accepted: 09/16/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Copy number variations are important in the detection and progression of significant tumors and diseases. Recently, Whole Exome Sequencing is gaining popularity with copy number variations detection due to low cost and better efficiency. In this work, we developed VEGAWES for accurate and robust detection of copy number variations on WES data. VEGAWES is an extension to a variational based segmentation algorithm, VEGA: Variational estimator for genomic aberrations, which has previously outperformed several algorithms on segmenting array comparative genomic hybridization data. RESULTS We tested this algorithm on synthetic data and 100 Glioblastoma Multiforme primary tumor samples. The results on the real data were analyzed with segmentation obtained from Single-nucleotide polymorphism data as ground truth. We compared our results with two other segmentation algorithms and assessed the performance based on accuracy and time. CONCLUSIONS In terms of both accuracy and time, VEGAWES provided better results on the synthetic data and tumor samples demonstrating its potential in robust detection of aberrant regions in the genome.
Collapse
Affiliation(s)
- Samreen Anjum
- Computational Sciences and Engineering, Qatar Computing Research Institute, Doha, P. O. Box 5825, Qatar.
| | - Sandro Morganella
- European Molecular Biology Laboratory, European Bioinformatics Institute, (EMBL -EBI), Wellcome Trust Genome Campus, Cambridge, CB10 1SD, UK.
| | | | - Antonio Iavarone
- Institute for Cancer Genetics, Columbia University, New York, 10027, USA.
| | - Michele Ceccarelli
- Computational Sciences and Engineering, Qatar Computing Research Institute, Doha, P. O. Box 5825, Qatar. .,Department of Science and Technology, University of Sannio, Benevento, 82100, Italy.
| |
Collapse
|
42
|
Masecchia S, Coco S, Barla A, Verri A, Tonini GP. Genome instability model of metastatic neuroblastoma tumorigenesis by a dictionary learning algorithm. BMC Med Genomics 2015; 8:57. [PMID: 26358114 PMCID: PMC4566396 DOI: 10.1186/s12920-015-0132-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2015] [Accepted: 08/28/2015] [Indexed: 12/21/2022] Open
Abstract
Background Metastatic neuroblastoma (NB) occurs in pediatric patients as stage 4S or stage 4 and it is characterized by heterogeneous clinical behavior associated with diverse genotypes. Tumors of stage 4 contain several structural copy number aberrations (CNAs) rarely found in stage 4S. To date, the NB tumorigenesis is not still elucidated, although it is evident that genomic instability plays a critical role in the genesis of the tumor. Here we propose a mathematical approach to decipher genomic data and we provide a new model of NB metastatic tumorigenesis. Method We elucidate NB tumorigenesis using Enhanced Fused Lasso Latent Feature Model (E-FLLat) modeling the array comparative chromosome hybridization (aCGH) data of 190 metastatic NBs (63 stage 4S and 127 stage 4). This model for aCGH segmentation, based on the minimization of functional dictionary learning (DL), combines several penalties tailored to the specificities of aCGH data. In DL, the original signal is approximated by a linear weighted combination of atoms: the elements of the learned dictionary. Results The hierarchical structures for stage 4S shows at the first level of the oncogenetic tree several whole chromosome gains except to the unbalanced gains of 17q, 2p and 2q. Conversely, the high CNA complexity found in stage 4 tumors, requires two different trees. Both stage 4 oncogenetic trees are marked diverged, up to five sublevels and the 17q gain is the most common event at the first level (2/3 nodes). Moreover the 11q deletion, one of the major unfavorable marker of disease progression, occurs before 3p loss indicating that critical chromosome aberrations appear at early stages of tumorigenesis. Finally, we also observed a significant (p = 0.025) association between patient age and chromosome loss in stage 4 cases. Conclusion These results led us to propose a genome instability progressive model in which NB cells initiate with a DNA synthesis uncoupled from cell division, that leads to stage 4S tumors, primarily characterized by numerical aberrations, or stage 4 tumors with high levels of genome instability resulting in complex chromosome rearrangements associated with high tumor aggressiveness and rapid disease progression. Electronic supplementary material The online version of this article (doi:10.1186/s12920-015-0132-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Simona Coco
- Lung Cancer Unit; IRCCS A.O.U. San Martino - IST, Genova, Italy.
| | - Annalisa Barla
- DIBRIS, Università degli Studi di Genova, Genova, Italy.
| | | | - Gian Paolo Tonini
- Neuroblastoma Laboratory, Onco/Hematology Laboratory, Department of Woman and Child Health, University of Padua, Pediatric Research Institute, Fondazione Città della Speranza, Padua, Corso Stati Uniti, 4, 35127, Padua, Italy.
| |
Collapse
|
43
|
Arsuaga J, Borrman T, Cavalcante R, Gonzalez G, Park C. Identification of Copy Number Aberrations in Breast Cancer Subtypes Using Persistence Topology. MICROARRAYS (BASEL, SWITZERLAND) 2015; 4:339-69. [PMID: 27600228 PMCID: PMC4996377 DOI: 10.3390/microarrays4030339] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/09/2015] [Accepted: 08/03/2015] [Indexed: 01/01/2023]
Abstract
DNA copy number aberrations (CNAs) are of biological and medical interest because they help identify regulatory mechanisms underlying tumor initiation and evolution. Identification of tumor-driving CNAs (driver CNAs) however remains a challenging task, because they are frequently hidden by CNAs that are the product of random events that take place during tumor evolution. Experimental detection of CNAs is commonly accomplished through array comparative genomic hybridization (aCGH) assays followed by supervised and/or unsupervised statistical methods that combine the segmented profiles of all patients to identify driver CNAs. Here, we extend a previously-presented supervised algorithm for the identification of CNAs that is based on a topological representation of the data. Our method associates a two-dimensional (2D) point cloud with each aCGH profile and generates a sequence of simplicial complexes, mathematical objects that generalize the concept of a graph. This representation of the data permits segmenting the data at different resolutions and identifying CNAs by interrogating the topological properties of these simplicial complexes. We tested our approach on a published dataset with the goal of identifying specific breast cancer CNAs associated with specific molecular subtypes. Identification of CNAs associated with each subtype was performed by analyzing each subtype separately from the others and by taking the rest of the subtypes as the control. Our results found a new amplification in 11q at the location of the progesterone receptor in the Luminal A subtype. Aberrations in the Luminal B subtype were found only upon removal of the basal-like subtype from the control set. Under those conditions, all regions found in the original publication, except for 17q, were confirmed; all aberrations, except those in chromosome arms 8q and 12q were confirmed in the basal-like subtype. These two chromosome arms, however, were detected only upon removal of three patients with exceedingly large copy number values. More importantly, we detected 10 and 21 additional regions in the Luminal B and basal-like subtypes, respectively. Most of the additional regions were either validated on an independent dataset and/or using GISTIC. Furthermore, we found three new CNAs in the basal-like subtype: a combination of gains and losses in 1p, a gain in 2p and a loss in 14q. Based on these results, we suggest that topological approaches that incorporate multiresolution analyses and that interrogate topological properties of the data can help in the identification of copy number changes in cancer.
Collapse
Affiliation(s)
- Javier Arsuaga
- Department of Mathematics, University of California Davis, 1 Shields Avenue, Davis, CA 95616, USA.
- Department of Molecular and Cellular Biology, University of California Davis, 1 Shields Avenue, Davis, CA 95616, USA.
| | - Tyler Borrman
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA 01605, USA.
| | - Raymond Cavalcante
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.
| | - Georgina Gonzalez
- Department of Mathematics, San Francisco State University, 1600 Holloway Avenue, San Francisco, CA 96132, USA.
| | - Catherine Park
- Helen Diller Comprehensive Cancer Center,University of California San Francisco, 1600 Divisadero Street, San Francisco, CA 94143, USA.
| |
Collapse
|
44
|
Yokoyama T, Miura F, Araki H, Okamura K, Ito T. Changepoint detection in base-resolution methylome data reveals a robust signature of methylated domain landscape. BMC Genomics 2015; 16:594. [PMID: 26265481 PMCID: PMC4534107 DOI: 10.1186/s12864-015-1809-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2014] [Accepted: 08/03/2015] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Base-resolution methylome data generated by whole-genome bisulfite sequencing (WGBS) is often used to segment the genome into domains with distinct methylation levels. However, most segmentation methods include many parameters to be carefully tuned and/or fail to exploit the unsurpassed resolution of the data. Furthermore, there is no simple method that displays the composition of the domains to grasp global trends in each methylome. RESULTS We propose to use changepoint detection for domain demarcation based on base-resolution methylome data. While the proposed method segments the methylome in a largely comparable manner to conventional approaches, it has only a single parameter to be tuned. Furthermore, it fully exploits the base-resolution of the data to enable simultaneous detection of methylation changes in even contrasting size ranges, such as focal hypermethylation and global hypomethylation in cancer methylomes. We also propose a simple plot termed methylated domain landscape (MDL) that globally displays the size, the methylation level and the number of the domains thus defined, thereby enabling one to intuitively grasp trends in each methylome. Since the pattern of MDL often reflects cell lineages and is largely unaffected by data size, it can serve as a novel signature of methylome. CONCLUSIONS Changepoint detection in base-resolution methylome data followed by MDL plotting provides a novel method for methylome characterization and will facilitate global comparison among various WGBS data differing in size and even species origin.
Collapse
Affiliation(s)
- Takao Yokoyama
- Department of Computational Biology, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa 277-8561, Japan.
| | - Fumihito Miura
- Department of Biochemistry, Kyushu University Graduate School of Medical Sciences, 3-1-1 Maidashi, Higashi-ku, Fukuoka 812-8582, Japan. .,Core Research for Evolutional Science and Technology (CREST), Japan Science and Technology Agency (JST), 3-1-1 Maidashi, Higashi-ku, Fukuoka 812-8582, Japan.
| | - Hiromitsu Araki
- Department of Biochemistry, Kyushu University Graduate School of Medical Sciences, 3-1-1 Maidashi, Higashi-ku, Fukuoka 812-8582, Japan.
| | - Kohji Okamura
- Department of Systems Biomedicine, National Research Institute for Child Health and Development, National Center for Child Health and Development, 2-10-1 Okura, Setagaya-ku, Tokyo 157-8535, Japan.
| | - Takashi Ito
- Department of Biochemistry, Kyushu University Graduate School of Medical Sciences, 3-1-1 Maidashi, Higashi-ku, Fukuoka 812-8582, Japan. .,Core Research for Evolutional Science and Technology (CREST), Japan Science and Technology Agency (JST), 3-1-1 Maidashi, Higashi-ku, Fukuoka 812-8582, Japan.
| |
Collapse
|
45
|
Nam JY, Kim NKD, Kim SC, Joung JG, Xi R, Lee S, Park PJ, Park WY. Evaluation of somatic copy number estimation tools for whole-exome sequencing data. Brief Bioinform 2015. [PMID: 26210357 DOI: 10.1093/bib/bbv055] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Whole-exome sequencing (WES) has become a standard method for detecting genetic variants in human diseases. Although the primary use of WES data has been the identification of single nucleotide variations and indels, these data also offer a possibility of detecting copy number variations (CNVs) at high resolution. However, WES data have uneven read coverage along the genome owing to the target capture step, and the development of a robust WES-based CNV tool is challenging. Here, we evaluate six WES somatic CNV detection tools: ADTEx, CONTRA, Control-FREEC, EXCAVATOR, ExomeCNV and Varscan2. Using WES data from 50 kidney chromophobe, 50 bladder urothelial carcinoma, and 50 stomach adenocarcinoma patients from The Cancer Genome Atlas, we compared the CNV calls from the six tools with a reference CNV set that was identified by both single nucleotide polymorphism array 6.0 and whole-genome sequencing data. We found that these algorithms gave highly variable results: visual inspection reveals significant differences between the WES-based segmentation profiles and the reference profile, as well as among the WES-based profiles. Using a 50% overlap criterion, 13-77% of WES CNV calls were covered by CNVs from the reference set, up to 21% of the copy gains were called as losses or vice versa, and dramatic differences in CNV sizes and CNV numbers were observed. Overall, ADTEx and EXCAVATOR had the best performance with relatively high precision and sensitivity. We suggest that the current algorithms for somatic CNV detection from WES data are limited in their performance and that more robust algorithms are needed.
Collapse
|
46
|
Zhou L, Palais RA, Paxton CN, Geiersbach KB, Wittwer CT. Copy Number Assessment by Competitive PCR with Limiting Deoxynucleotide Triphosphates and High-Resolution Melting. Clin Chem 2015; 61:724-33. [DOI: 10.1373/clinchem.2014.236208] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2014] [Accepted: 02/02/2015] [Indexed: 11/06/2022]
Abstract
Abstract
BACKGROUND
DNA copy number variation is associated with genetic disorders and cancer. Available methods to discern variation in copy number are typically costly, slow, require specialized equipment, and/or lack precision.
METHODS
Multiplex PCR with different primer pairs and limiting deoxynucleotide triphosphates (dNTPs) (3–12 μmol/L) were used for relative quantification and copy number assessment. Small PCR products (50–121 bp) were designed with 1 melting domain, well-separated Tms, minimal internal sequence variation, and no common homologs. PCR products were displayed as melting curves on derivative plots and normalized to the reference peak. Different copy numbers of each target clustered together and were grouped by unbiased hierarchical clustering.
RESULTS
Duplex PCR of a reference gene and a target gene was used to detect copy number variation in chromosomes X, Y, 13, 18, 21, epidermal growth factor receptor (EGFR), survival of motor neuron 1, telomeric (SMN1), and survival of motor neuron 2, centromeric (SMN2). Triplex PCR was used for X and Y and CFTR exons 2 and 3. Blinded studies of 50 potential trisomic samples (13, 18, 21, or normal) and 50 samples with potential sex chromosome abnormalities were concordant to karyotyping, except for 2 samples that were originally mosaics that displayed a single karyotype after growth. Large cystic fibrosis transmembrane conductance regulator (ATP-binding cassette sub-family C, member 7) (CFTR) deletions, EGFR amplifications, and SMN1 and SMN2 copy number assessments were also demonstrated. Under ideal conditions, copy number changes of 1.11-fold or lower could be discerned with CVs of about 1%.
CONCLUSIONS
Relative quantification by restricting the dNTP concentration with melting curve display is a simple and precise way to assess targeted copy number variation.
Collapse
Affiliation(s)
- Luming Zhou
- Department of Pathology, University of Utah School of Medicine, Salt Lake City, UT
| | | | - Christian N Paxton
- ARUP Institute for Clinical and Experimental Pathology, Salt Lake City, UT
| | - Katherine B Geiersbach
- Department of Pathology, University of Utah School of Medicine, Salt Lake City, UT
- ARUP Institute for Clinical and Experimental Pathology, Salt Lake City, UT
| | - Carl T Wittwer
- Department of Pathology, University of Utah School of Medicine, Salt Lake City, UT
- ARUP Institute for Clinical and Experimental Pathology, Salt Lake City, UT
| |
Collapse
|
47
|
Hybrid algorithms for multiple change-point detection in biological sequences. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2015; 823:41-61. [PMID: 25381101 DOI: 10.1007/978-3-319-10984-8_3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/10/2023]
Abstract
Array comparative genomic hybridization (aCGH) is one of the techniques that can be used to detect copy number variations in DNA sequences in high resolution. It has been identified that abrupt changes in the human genome play a vital role in the progression and development of many complex diseases. In this study we propose two distinct hybrid algorithms that combine efficient sequential change-point detection procedures (the Shiryaev-Roberts procedure and the cumulative sum control chart (CUSUM) procedure) with the Cross-Entropy method, which is an evolutionary stochastic optimization technique to estimate both the number of change-points and their corresponding locations in aCGH data. The proposed hybrid algorithms are applied to both artificially generated data and real aCGH experimental data to illustrate their usefulness. Our results show that the proposed methodologies are effective in detecting multiple change-points in biological sequences of continuous measurements.
Collapse
|
48
|
Priyadarshana WJRM, Sofronov G. Multiple Break-Points Detection in Array CGH Data via the Cross-Entropy Method. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:487-498. [PMID: 26357234 DOI: 10.1109/tcbb.2014.2361639] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Array comparative genome hybridization (aCGH) is a widely used methodology to detect copy number variations of a genome in high resolution. Knowing the number of break-points and their corresponding locations in genomic sequences serves different biological needs. Primarily, it helps to identify disease-causing genes that have functional importance in characterizing genome wide diseases. For human autosomes the normal copy number is two, whereas at the sites of oncogenes it increases (gain of DNA) and at the tumour suppressor genes it decreases (loss of DNA). The majority of the current detection methods are deterministic in their set-up and use dynamic programming or different smoothing techniques to obtain the estimates of copy number variations. These approaches limit the search space of the problem due to different assumptions considered in the methods and do not represent the true nature of the uncertainty associated with the unknown break-points in genomic sequences. We propose the Cross-Entropy method, which is a model-based stochastic optimization technique as an exact search method, to estimate both the number and locations of the break-points in aCGH data. We model the continuous scale log-ratio data obtained by the aCGH technique as a multiple break-point problem. The proposed methodology is compared with well established publicly available methods using both artificially generated data and real data. Results show that the proposed procedure is an effective way of estimating number and especially the locations of break-points with high level of precision. Availability: The methods described in this article are implemented in the new R package breakpoint and it is available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=breakpoint.
Collapse
|
49
|
Zhao C, Tynan J, Ehrich M, Hannum G, McCullough R, Saldivar JS, Oeth P, van den Boom D, Deciu C. Detection of fetal subchromosomal abnormalities by sequencing circulating cell-free DNA from maternal plasma. Clin Chem 2015; 61:608-16. [PMID: 25710461 DOI: 10.1373/clinchem.2014.233312] [Citation(s) in RCA: 118] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
BACKGROUND The development of sequencing-based noninvasive prenatal testing (NIPT) has been largely focused on whole-chromosome aneuploidies (chromosomes 13, 18, 21, X, and Y). Collectively, they account for only 30% of all live births with a chromosome abnormality. Various structural chromosome changes, such as microdeletion/microduplication (MD) syndromes are more common but more challenging to detect. Recently, several publications have shown results on noninvasive detection of MDs by deep sequencing. These approaches demonstrated the proof of concept but are not economically feasible for large-scale clinical applications. METHODS We present a novel approach that uses low-coverage whole genome sequencing (approximately 0.2×) to detect MDs genome wide without requiring prior knowledge of the event's location. We developed a normalization method to reduce sequencing noise. We then applied a statistical method to search for consistently increased or decreased regions. A decision tree was used to differentiate whole-chromosome events from MDs. RESULTS We demonstrated via a simulation study that the sensitivity difference between our method and the theoretical limit was <5% for MDs ≥9 Mb. We tested the performance in a blinded study in which the MDs ranged from 3 to 40 Mb. In this study, our algorithm correctly identified 17 of 18 cases with MDs and 156 of 157 unaffected cases. CONCLUSIONS The limit of detection for any given MD syndrome is constrained by 4 factors: fetal fraction, MD size, coverage, and biological and technical variability of the event region. Our algorithm takes these factors into account and achieved 94.4% sensitivity and 99.4% specificity.
Collapse
|
50
|
Du C, Kao CLM, Kou SC. Stepwise Signal Extraction via Marginal Likelihood. J Am Stat Assoc 2015; 111:314-330. [PMID: 27212739 PMCID: PMC4874345 DOI: 10.1080/01621459.2015.1006365] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2013] [Revised: 12/01/2014] [Indexed: 10/24/2022]
Abstract
This paper studies the estimation of stepwise signal. To determine the number and locations of change-points of the stepwise signal, we formulate a maximum marginal likelihood estimator, which can be computed with a quadratic cost using dynamic programming. We carry out extensive investigation on the choice of the prior distribution and study the asymptotic properties of the maximum marginal likelihood estimator. We propose to treat each possible set of change-points equally and adopt an empirical Bayes approach to specify the prior distribution of segment parameters. Detailed simulation study is performed to compare the effectiveness of this method with other existing methods. We demonstrate our method on single-molecule enzyme reaction data and on DNA array CGH data. Our study shows that this method is applicable to a wide range of models and offers appealing results in practice.
Collapse
Affiliation(s)
- Chao Du
- Statistics, University of Virginia, Charlottesville, VA 22904 ( )
| | - Chu-Lan Michael Kao
- Research Center of Adaptive Data Analysis, National Central University, Taoyuan County 32001, Taiwan ( )
| | - S C Kou
- Statistics, Harvard University, Cambridge, MA, 02138 ( )
| |
Collapse
|