1
|
Li F, Xiao Y, Chen Z. Estimation of common breaks in linear panel data models via screening and ranking algorithm. Sci Rep 2025; 15:11338. [PMID: 40175690 PMCID: PMC11965510 DOI: 10.1038/s41598-025-96322-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Accepted: 03/27/2025] [Indexed: 04/04/2025] Open
Abstract
In this paper, we consider the estimation of common breaks for linear panel data models by means of screening and ranking algorithm. For static and dynamic panel data models, we estimate the regression coefficients using covariance estimation and generalized method of moments, respectively, and apply a screening and ranking algorithm on this basis. The possible break points are first screened by constructing local statistics based on the coefficient estimators, then further screened by the thresholding rule, and finally the final break points are screened by the information criterion. Monte Carlo simulations demonstrate that the proposed methods work well in finite samples. We apply the screening and ranking algorithm to study the influence of rural residents' consumption demand on China's economic growth using a panel of 31 provinces from 2005 to 2023 and find a break point in the model.
Collapse
Affiliation(s)
- Fuxiao Li
- Department of Applied Mathematics, Xi'an University of Technology, Xi'an, 710054, China.
| | - Yanting Xiao
- Department of Applied Mathematics, Xi'an University of Technology, Xi'an, 710054, China
| | - Zhanshou Chen
- School of Mathematics and Statistics, Qinghai Normal University, Xining, 810008, China
- Academy of Plateau Science and Sustainability, Qinghai Normal University, Xining, 810008, China
| |
Collapse
|
2
|
Gudkov M, Thibaut L, Khushi M, Blue GM, Winlaw DS, Dunwoodie SL, Giannoulatou E. ConanVarvar: a versatile tool for the detection of large syndromic copy number variation from whole-genome sequencing data. BMC Bioinformatics 2023; 24:49. [PMID: 36792982 PMCID: PMC9930243 DOI: 10.1186/s12859-023-05154-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Accepted: 01/19/2023] [Indexed: 02/17/2023] Open
Abstract
BACKGROUND A wide range of tools are available for the detection of copy number variants (CNVs) from whole-genome sequencing (WGS) data. However, none of them focus on clinically-relevant CNVs, such as those that are associated with known genetic syndromes. Such variants are often large in size, typically 1-5 Mb, but currently available CNV callers have been developed and benchmarked for the discovery of smaller variants. Thus, the ability of these programs to detect tens of real syndromic CNVs remains largely unknown. RESULTS Here we present ConanVarvar, a tool which implements a complete workflow for the targeted analysis of large germline CNVs from WGS data. ConanVarvar comes with an intuitive R Shiny graphical user interface and annotates identified variants with information about 56 associated syndromic conditions. We benchmarked ConanVarvar and four other programs on a dataset containing real and simulated syndromic CNVs larger than 1 Mb. In comparison to other tools, ConanVarvar reports 10-30 times less false-positive variants without compromising sensitivity and is quicker to run, especially on large batches of samples. CONCLUSIONS ConanVarvar is a useful instrument for primary analysis in disease sequencing studies, where large CNVs could be the cause of disease.
Collapse
Affiliation(s)
- Mikhail Gudkov
- grid.1057.30000 0000 9472 3971Victor Chang Cardiac Research Institute, Sydney, NSW 2010 Australia ,grid.1013.30000 0004 1936 834XSchool of Biomedical Engineering, The University of Sydney, Sydney, NSW 2006 Australia ,grid.1005.40000 0004 4902 0432St Vincent’s Clinical Campus, School of Clinical Medicine, Faculty of Medicine and Health, UNSW Sydney, Sydney, NSW 2010 Australia
| | - Loïc Thibaut
- grid.1057.30000 0000 9472 3971Victor Chang Cardiac Research Institute, Sydney, NSW 2010 Australia ,grid.1005.40000 0004 4902 0432School of Mathematics and Statistics, UNSW Sydney, Sydney, NSW 2052 Australia
| | - Matloob Khushi
- grid.1013.30000 0004 1936 834XSchool of Computer Science, The University of Sydney, Sydney, NSW 2006 Australia
| | - Gillian M. Blue
- grid.1013.30000 0004 1936 834XSydney Medical School, The University of Sydney, Sydney, NSW 2006 Australia ,grid.413973.b0000 0000 9690 854XHeart Centre for Children, The Children’s Hospital at Westmead, Sydney, NSW 2145 Australia
| | - David S. Winlaw
- grid.1013.30000 0004 1936 834XSydney Medical School, The University of Sydney, Sydney, NSW 2006 Australia ,grid.413973.b0000 0000 9690 854XHeart Centre for Children, The Children’s Hospital at Westmead, Sydney, NSW 2145 Australia
| | - Sally L. Dunwoodie
- grid.1057.30000 0000 9472 3971Victor Chang Cardiac Research Institute, Sydney, NSW 2010 Australia ,grid.1005.40000 0004 4902 0432St Vincent’s Clinical Campus, School of Clinical Medicine, Faculty of Medicine and Health, UNSW Sydney, Sydney, NSW 2010 Australia ,grid.1005.40000 0004 4902 0432School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Sydney, NSW 2052 Australia
| | - Eleni Giannoulatou
- Victor Chang Cardiac Research Institute, Sydney, NSW, 2010, Australia. .,St Vincent's Clinical Campus, School of Clinical Medicine, Faculty of Medicine and Health, UNSW Sydney, Sydney, NSW, 2010, Australia.
| |
Collapse
|
3
|
Cheng A, Mao D, Zhang Y, Glaz J, Ouyang Z. Translocation detection from Hi-C data via scan statistics. Biometrics 2022. [PMID: 35861170 DOI: 10.1111/biom.13724] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Accepted: 05/02/2022] [Indexed: 11/30/2022]
Abstract
Recent Hi-C technology enables more comprehensive chromosomal conformation research, including the detection of structural variations, especially translocations. In this paper, we formulate the inter-chromosomal translocation detection as a problem of scan clustering in a spatial point process. We then develop TranScan, a new translocation detection method through scan statistics with the control of false discovery. The simulation shows that TranScan is more powerful than an existing sophisticated scan clustering method, especially under strong signal situations. Evaluation of TranScan against current translocation detection methods on realistic breakpoint simulations generated from real data suggests better discriminative power under the receiver operating characteristic curve. Power analysis also highlights TranScan's consistent outperformance when sequencing depth and heterozygosity rate is varied. Comparatively, Type I error rate is lowest when evaluated using a karyotypically normal cell line. Both the simulation and real data analysis indicate that TranScan has great potentials in inter-chromosomal translocation detection using Hi-C data. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Anthony Cheng
- Department of Genetics and Genome Sciences, University of Connecticut, Farmington, CT, USA.,The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Disheng Mao
- Department of Statistics, University of Connecticut, Storrs, CT, USA
| | - Yuping Zhang
- Department of Statistics, University of Connecticut, Storrs, CT, USA
| | - Joseph Glaz
- Department of Statistics, University of Connecticut, Storrs, CT, USA
| | - Zhengqing Ouyang
- Department of Biostatistics and Epidemiology, University of Massachusetts Amherst, Amherst, MA, USA
| |
Collapse
|
4
|
Cho H, Kirch C. Two-stage data segmentation permitting multiscale change points, heavy tails and dependence. ANN I STAT MATH 2021. [DOI: 10.1007/s10463-021-00811-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
|
5
|
Luo X, Qin F, Cai G, Xiao F. Integrating genomic correlation structure improves copy number variations detection. Bioinformatics 2021; 37:312-317. [PMID: 32805016 DOI: 10.1093/bioinformatics/btaa737] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 07/23/2020] [Accepted: 08/12/2020] [Indexed: 12/16/2022] Open
Abstract
MOTIVATION Copy number variation plays important roles in human complex diseases. The detection of copy number variants (CNVs) is identifying mean shift in genetic intensities to locate chromosomal breakpoints, the step of which is referred to as chromosomal segmentation. Many segmentation algorithms have been developed with a strong assumption of independent observations in the genetic loci, and they assume each locus has an equal chance to be a breakpoint (i.e. boundary of CNVs). However, this assumption is violated in the genetics perspective due to the existence of correlation among genomic positions, such as linkage disequilibrium (LD). Our study showed that the LD structure is related to the location distribution of CNVs, which indeed presents a non-random pattern on the genome. To generate more accurate CNVs, we proposed a novel algorithm, LDcnv, that models the CNV data with its biological characteristics relating to genetic dependence structure (i.e. LD). RESULTS We theoretically demonstrated the correlation structure of CNV data in SNP array, which further supports the necessity of integrating biological structure in statistical methods for CNV detection. Therefore, we developed the LDcnv that integrated the genomic correlation structure with a local search strategy into statistical modeling of the CNV intensities. To evaluate the performance of LDcnv, we conducted extensive simulations and analyzed large-scale HapMap datasets. We showed that LDcnv presented high accuracy, stability and robustness in CNV detection and higher precision in detecting short CNVs compared to existing methods. This new segmentation algorithm has a wide scope of potential application with data from various high-throughput technology platforms. AVAILABILITY AND IMPLEMENTATION https://github.com/FeifeiXiaoUSC/LDcnv. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xizhi Luo
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Fei Qin
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Guoshuai Cai
- Department of Environmental Health Science, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Feifei Xiao
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| |
Collapse
|
6
|
Hao N, Niu YS, Xiao F, Zhang H. A super scalable algorithm for short segment detection. STATISTICS IN BIOSCIENCES 2021; 13:18-33. [PMID: 33737983 DOI: 10.1007/s12561-020-09278-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
In many applications such as copy number variant (CNV) detection, the goal is to identify short segments on which the observations have different means or medians from the background. Those segments are usually short and hidden in a long sequence, and hence are very challenging to find. We study a super scalable short segment (4S) detection algorithm in this paper. This nonparametric method clusters the locations where the observations exceed a threshold for segment detection. It is computationally efficient and does not rely on Gaussian noise assumption. Moreover, we develop a framework to assign significance levels for detected segments. We demonstrate the advantages of our proposed method by theoretical, simulation, and real data studies.
Collapse
Affiliation(s)
- Ning Hao
- Department of Mathematics, University of Arizona, Tucson, AZ 85721
| | - Yue Selena Niu
- Department of Mathematics, University of Arizona, Tucson, AZ 85721
| | - Feifei Xiao
- Department of Epidemiology and Biostatistics, University of South Carolina, Columbia, SC 29201
| | - Heping Zhang
- Department of Biostatistics, Yale School of Public Health, New Haven, CT 06510
| |
Collapse
|
7
|
Zhao Z, Yau CY. Alternating Pruned Dynamic Programming for Multiple Epidemic Change-Point Estimation. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2020.1868304] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Affiliation(s)
- Zifeng Zhao
- Mendoza College of Business, University of Notre Dame, Notre Dame, IN
| | - Chun Yip Yau
- Department of Statistics, Chinese University of Hong Kong, Shatin, NT, Hong Kong
| |
Collapse
|
8
|
Yan Q, Liu Y, Liu S, Ma T. Change-point detection based on adjusted shape context cost method. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2020.08.112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
9
|
Xiao F, Luo X, Hao N, Niu YS, Xiao X, Cai G, Amos CI, Zhang H. An accurate and powerful method for copy number variation detection. Bioinformatics 2020; 35:2891-2898. [PMID: 30649252 DOI: 10.1093/bioinformatics/bty1041] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2018] [Revised: 11/28/2018] [Accepted: 01/09/2019] [Indexed: 12/12/2022] Open
Abstract
MOTIVATION Integration of multiple genetic sources for copy number variation detection (CNV) is a powerful approach to improve the identification of variants associated with complex traits. Although it has been shown that the widely used change point based methods can increase statistical power to identify variants, it remains challenging to effectively detect CNVs with weak signals due to the noisy nature of genotyping intensity data. We previously developed modSaRa, a normal mean-based model on a screening and ranking algorithm for copy number variation identification which presented desirable sensitivity with high computational efficiency. To boost statistical power for the identification of variants, here we present a novel improvement that integrates the relative allelic intensity with external information from empirical statistics with modeling, which we called modSaRa2. RESULTS Simulation studies illustrated that modSaRa2 markedly improved both sensitivity and specificity over existing methods for analyzing array-based data. The improvement in weak CNV signal detection is the most substantial, while it also simultaneously improves stability when CNV size varies. The application of the new method to a whole genome melanoma dataset identified novel candidate melanoma risk associated deletions on chromosome bands 1p22.2 and duplications on 6p22, 6q25 and 19p13 regions, which may facilitate the understanding of the possible roles of germline copy number variants in the etiology of melanoma. AVAILABILITY AND IMPLEMENTATION http://c2s2.yale.edu/software/modSaRa2 or https://github.com/FeifeiXiaoUSC/modSaRa2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Feifei Xiao
- Department of Epidemiology and Biostatistics, University of South Carolina, Columbia, SC, USA
| | - Xizhi Luo
- Department of Epidemiology and Biostatistics, University of South Carolina, Columbia, SC, USA
| | - Ning Hao
- Department of Mathematics, University of Arizona, Tucson, AZ, USA
| | - Yue S Niu
- Department of Mathematics, University of Arizona, Tucson, AZ, USA
| | - Xiangjun Xiao
- Department of Quantitative Sciences, Baylor College of Medicine, Houston, TX, USA
| | - Guoshuai Cai
- Department of Environmental Health Science, University of South Carolina, Columbia, SC, USA
| | - Christopher I Amos
- Department of Quantitative Sciences, Baylor College of Medicine, Houston, TX, USA
| | - Heping Zhang
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| |
Collapse
|
10
|
Chen J, Deng S. Detection of Copy Number Variation Regions Using the DNA-Sequencing Data from Multiple Profiles with Correlated Structure. J Comput Biol 2018; 25:1128-1140. [PMID: 30052071 DOI: 10.1089/cmb.2018.0053] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In this article, we investigate the problem of detecting boundaries of DNA copy number variation (CNV) regions using the DNA-sequencing data from multiple subject samples. Genomic features along the linear realization of the actual genome are correlated, especially within vicinity of a locus, so are the sequencing reads along the genome. It is then crucial to take the correlated structure of such high-throughput genomic data into consideration when modeling DNA-sequencing data for CNV detection from statistical and computational viewpoints. We use the framework of a fused Lasso latent feature model to solve the problem, and propose a modified information criterion for selecting the tuning parameter when search for common CNVs is shared by multiple subjects. Simulation studies and application on multiple subjects' next-generation sequencing data, downloaded from the 1000 Genome Project, showed that the proposed approach can effectively identify individual CNVs of a single subject profile and common CNVs shared by multiple subjects.
Collapse
Affiliation(s)
- Jie Chen
- 1 Division of Biostatistics and Data Science, Department of Population Health Sciences, Medical College of Georgia, Augusta University , Augusta, Georgia
| | - Shirong Deng
- 2 School of Mathematics and Statistics, Wuhan University , Wuhan, China
| |
Collapse
|
11
|
Xiao F, Niu Y, Hao N, Xu Y, Jin Z, Zhang H. modSaRa: a computationally efficient R package for CNV identification. Bioinformatics 2018; 33:2384-2385. [PMID: 28453611 DOI: 10.1093/bioinformatics/btx212] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2017] [Indexed: 01/02/2023] Open
Abstract
Summary Chromosomal copy number variation (CNV) refers to a polymorphism that a DNA segment presents deletion or duplication in the population. The computational algorithms developed to identify this type of variation are usually of high computational complexity. Here we present a user-friendly R package, modSaRa, designed to perform copy number variants identification. The package is developed based on a change-point based method with optimal computational complexity and desirable accuracy. The current version of modSaRa package is a comprehensive tool with integration of preprocessing steps and main CNV calling steps. Availability and Implementation modSaRa is an R package written in R, C ++ and Rcpp and is now freely available for download at http://c2s2.yale.edu/software/modSaRa . Contact heping.zhang@yale.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Feifei Xiao
- Department of Epidemiology and Biostatistics, University of South Carolina, Columbia, SC 29201, USA
| | - Yue Niu
- Department of Mathematics, University of Arizona, Tucson, AZ 85721, USA
| | - Ning Hao
- Department of Mathematics, University of Arizona, Tucson, AZ 85721, USA
| | - Yanxun Xu
- Department of Applied Mathematics & Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Zhilin Jin
- Department of Applied Mathematics & Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Heping Zhang
- Department of Biostatistics, Yale School of Public Health, New Haven, CT 06520, USA
| |
Collapse
|
12
|
Song C, Min X, Zhang H. THE SCREENING AND RANKING ALGORITHM FOR CHANGE-POINTS DETECTION IN MULTIPLE SAMPLES. Ann Appl Stat 2017; 10:2102-2129. [PMID: 28090239 DOI: 10.1214/16-aoas966] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
The chromosome copy number variation (CNV) is the deviation of genomic regions from their normal copy number states, which may associate with many human diseases. Current genetic studies usually collect hundreds to thousands of samples to study the association between CNV and diseases. CNVs can be called by detecting the change-points in mean for sequences of array-based intensity measurements. Although multiple samples are of interest, the majority of the available CNV calling methods are single sample based. Only a few multiple sample methods have been proposed using scan statistics that are computationally intensive and designed toward either common or rare change-points detection. In this paper, we propose a novel multiple sample method by adaptively combining the scan statistic of the screening and ranking algorithm (SaRa), which is computationally efficient and is able to detect both common and rare change-points. We prove that asymptotically this method can find the true change-points with almost certainty and show in theory that multiple sample methods are superior to single sample methods when shared change-points are of interest. Additionally, we report extensive simulation studies to examine the performance of our proposed method. Finally, using our proposed method as well as two competing approaches, we attempt to detect CNVs in the data from the Primary Open-Angle Glaucoma Genes and Environment study, and conclude that our method is faster and requires less information while our ability to detect the CNVs is comparable or better.
Collapse
|