1
|
Yu X, Qin F, Liu S, Brown NJ, Lu Q, Cai G, Guler JL, Xiao F. HapCNV: A Comprehensive Framework for CNV Detection in Low-input DNA Sequencing Data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.12.19.629494. [PMID: 39763944 PMCID: PMC11702719 DOI: 10.1101/2024.12.19.629494] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/15/2025]
Abstract
Copy number variants (CNVs) are prevalent in both diploid and haploid genomes, with the latter containing a single copy of each gene. Studying CNVs in genomes from single or few cells is significantly advancing our knowledge in human disorders and disease susceptibility. Low-input including low-cell and single-cell sequencing data for haploid and diploid organisms generally displays shallow and highly non-uniform read counts resulting from the whole genome amplification steps that introduce amplification biases. In addition, haploid organisms typically possess relatively short genomes and require a higher degree of DNA amplification compared to diploid organisms. However, most CNV detection methods are specifically developed for diploid genomes without specific consideration of effects on haploid genomes. Challenges also reside in reference samples or normal controls which are used to provide baseline signals for defining copy number losses or gains. In traditional methods, references are usually pre-specified from cells that are assumed to be normal or disease-free. However, the use of pre-defined reference cells can bias results if common CNVs are present. Here, we present the development of a comprehensive statistical framework for data normalization and CNV detection in haploid single- or low-cell DNA sequencing data called HapCNV. The prominent advancement is the construction of a novel genomic location specific pseudo-reference that selects unbiased references using a preliminary cell clustering method. This approach effectively preserves common CNVs. Using simulations, we demonstrated that HapCNV outperformed existing methods by generating more accurate CNV detection, especially for short CNVs. Superior performance of HapCNV was also validated in detecting known CNVs in a real P. falciparum parasite dataset. In conclusion, HapCNV provides a novel and useful approach for CNV detection in haploid low-input sequencing datasets, with easy applicability to diploids.
Collapse
Affiliation(s)
- Xuanxuan Yu
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC, USA
| | - Fei Qin
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, 9609 Medical Center Drive, Rockville, MD, 20850, USA
| | - Shiwei Liu
- Center for Neuroimaging, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, Indianapolis, Indiana, USA
| | - Noah J. Brown
- Department of Biology, University of Virginia, Charlottesville, VA, USA
| | - Qing Lu
- Department of Biostatistics, College of Public Health and Health Promotions & College of Medicine, University of Florida, Gainesville, FL, USA
| | - Guoshuai Cai
- Department of Surgery, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Jennifer L. Guler
- Department of Biology, University of Virginia, Charlottesville, VA, USA
| | - Feifei Xiao
- Department of Biostatistics, College of Public Health and Health Promotions & College of Medicine, University of Florida, Gainesville, FL, USA
| |
Collapse
|
2
|
Brown N, Luniewski A, Yu X, Warthan M, Liu S, Zulawinska J, Ahmad S, Congdon M, Santos W, Xiao F, Guler JL. Replication stress increases de novo CNVs across the malaria parasite genome. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.12.19.629492. [PMID: 39803504 PMCID: PMC11722320 DOI: 10.1101/2024.12.19.629492] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/18/2025]
Abstract
Changes in the copy number of large genomic regions, termed copy number variations (CNVs), contribute to important phenotypes in many organisms. CNVs are readily identified using conventional approaches when present in a large fraction of the cell population. However, CNVs that are present in only a few genomes across a population are often overlooked but important; if beneficial under specific conditions, a de novo CNV that arises in a single genome can expand during selection to create a larger population of cells with novel characteristics. While the reach of single cell methods to study de novo CNVs is increasing, we continue to lack information about CNV dynamics in rapidly evolving microbial populations. Here, we investigated de novo CNVs in the genome of the Plasmodium parasite that causes human malaria. The highly AT-rich P. falciparum genome readily accumulates CNVs that facilitate rapid adaptation to new drugs and host environments. We employed a low-input genomics approach optimized for this unique genome as well as specialized computational tools to evaluate the de novo CNV rate both before and after the application of stress. We observed a significant increase in genomewide de novo CNVs following treatment with a replication inhibitor. These stress-induced de novo CNVs encompassed genes that contribute to various cellular pathways and tended to be altered in clinical parasite genomes. This snapshot of CNV dynamics emphasizes the connection between replication stress, DNA repair, and CNV generation in this important microbial pathogen.
Collapse
Affiliation(s)
- Noah Brown
- University of Virginia, Department of Biology, Charlottesville, VA, USA
| | | | - Xuanxuan Yu
- Unifersity of Florida, Department of Biostatistics, Gainesville, FL, USA
- Unifersity of Florida, Department of Surgery, College of Medicine, Gainesville, FL, USA
| | - Michelle Warthan
- University of Virginia, Department of Biology, Charlottesville, VA, USA
| | - Shiwei Liu
- University of Virginia, Department of Biology, Charlottesville, VA, USA
- Current affiliation: Indiana University School of Medicine, Indianapolis, IN, USA
| | - Julia Zulawinska
- University of Virginia, Department of Biology, Charlottesville, VA, USA
| | - Syed Ahmad
- University of Virginia, Department of Biology, Charlottesville, VA, USA
| | - Molly Congdon
- Virginia Tech, Department of Chemistry, Blacksburg, VA, USA
| | - Webster Santos
- Virginia Tech, Department of Chemistry, Blacksburg, VA, USA
| | - Feifei Xiao
- Unifersity of Florida, Department of Biostatistics, Gainesville, FL, USA
| | - Jennifer L Guler
- University of Virginia, Department of Biology, Charlottesville, VA, USA
| |
Collapse
|
3
|
Yu X, Luo X, Cai G, Xiao F. OSCAA: A two-dimensional Gaussian mixture model for copy number variation association analysis. Genet Epidemiol 2024. [PMID: 38533840 DOI: 10.1002/gepi.22558] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 01/30/2024] [Accepted: 03/05/2024] [Indexed: 03/28/2024]
Abstract
Copy number variants (CNVs) are prevalent in the human genome and are found to have a profound effect on genomic organization and human diseases. Discovering disease-associated CNVs is critical for understanding the pathogenesis of diseases and aiding their diagnosis and treatment. However, traditional methods for assessing the association between CNVs and disease risks adopt a two-stage strategy conducting quantitative CNV measurements first and then testing for association, which may lead to biased association estimation and low statistical power, serving as a major barrier in routine genome-wide assessment of such variation. In this article, we developed One-Stage CNV-disease Association Analysis (OSCAA), a flexible algorithm to discover disease-associated CNVs for both quantitative and qualitative traits. OSCAA employs a two-dimensional Gaussian mixture model that is built upon the PCs from copy number intensities, accounting for technical biases in CNV detection while simultaneously testing for their effect on outcome traits. In OSCAA, CNVs are identified and their associations with disease risk are evaluated simultaneously in a single step, taking into account the uncertainty of CNV identification in the statistical model. Our simulations demonstrated that OSCAA outperformed the existing one-stage method and traditional two-stage methods by yielding a more accurate estimate of the CNV-disease association, especially for short CNVs or CNVs with weak signals. In conclusion, OSCAA is a powerful and flexible approach for CNV association testing with high sensitivity and specificity, which can be easily applied to different traits and clinical risk predictions.
Collapse
Affiliation(s)
- Xuanxuan Yu
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, South Carolina, USA
| | - Xizhi Luo
- Data and Statistical Sciences, AbbVie Inc., North Chicago, Illinois, USA
| | - Guoshuai Cai
- Department of Surgery, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Feifei Xiao
- Department of Biostatistics, College of Public Health and Health Promotion & College of Medicine, University of Florida, Gainesville, Florida, USA
| |
Collapse
|
4
|
Yu X, Luo X, Cai G, Xiao F. OSCAA: A Two-Dimensional Gaussian Mixture Model for Copy Number Variation Association Analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.25.559392. [PMID: 37808739 PMCID: PMC10557568 DOI: 10.1101/2023.09.25.559392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
Abstract
Copy number variants (CNVs) are prevalent in the human genome which provide profound effect on genomic organization and human diseases. Discovering disease associated CNVs is critical for understanding the pathogenesis of diseases and aiding their diagnosis and treatment. However, traditional methods for assessing the association between CNVs and disease risks adopt a two-stage strategy conducting quantitative CNV measurements first and then testing for association, which may lead to biased association estimation and low statistical power, serving as a major barrier in routine genome wide assessment of such variation. In this article, we developed OSCAA, a flexible algorithm to discover disease associated CNVs for both quantitative and qualitative traits. OSCAA employs a two-dimensional Gaussian mixture model that is built upon the principal components from copy number intensities, accounting for technical biases in CNV detection while simultaneously testing for their effect on outcome traits. In OSCAA, CNVs are identified and their associations with disease risk are evaluated simultaneously in a single step, taking into account the uncertainty of CNV identification in the statistical model. Our simulations demonstrated that OSCAA outperformed the existing one-stage method and traditional two-stage methods by yielding a more accurate estimate of the CNV-disease association, especially for short CNVs or CNVs with weak signal. In conclusion, OSCAA is a powerful and flexible approach for CNV association testing with high sensitivity and specificity, which can be easily applied to different traits and clinical risk predictions.
Collapse
|
5
|
Cho H, Kirch C. Two-stage data segmentation permitting multiscale change points, heavy tails and dependence. ANN I STAT MATH 2021. [DOI: 10.1007/s10463-021-00811-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
|
6
|
Qin F, Luo X, Cai G, Xiao F. Shall genomic correlation structure be considered in copy number variants detection? Brief Bioinform 2021; 22:6295811. [PMID: 34114005 DOI: 10.1093/bib/bbab215] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 04/16/2021] [Accepted: 05/17/2021] [Indexed: 11/14/2022] Open
Abstract
Copy number variation has been identified as a major source of genomic variation associated with disease susceptibility. With the advent of whole-exome sequencing (WES) technology, massive WES data have been generated, allowing for the identification of copy number variants (CNVs) in the protein-coding regions with direct functional interpretation. We have previously shown evidence of the genomic correlation structure in array data and developed a novel chromosomal breakpoint detection algorithm, LDcnv, which showed significantly improved detection power through integrating the correlation structure in a systematic modeling manner. However, it remains unexplored whether the genomic correlation exists in WES data and how such correlation structure integration can improve the CNV detection accuracy. In this study, we first explored the correlation structure of the WES data using the 1000 Genomes Project data. Both real raw read depth and median-normalized data showed strong evidence of the correlation structure. Motivated by this fact, we proposed a correlation-based method, CORRseq, as a novel release of the LDcnv algorithm in profiling WES data. The performance of CORRseq was evaluated in extensive simulation studies and real data analysis from the 1000 Genomes Project. CORRseq outperformed the existing methods in detecting medium and large CNVs. In conclusion, it would be more advantageous to model genomic correlation structure in detecting relatively long CNVs. This study provides great insights for methodology development of CNV detection with NGS data.
Collapse
Affiliation(s)
- Fei Qin
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina (USC), Discovery 449, 915 Greene St, Columbia, SC 29208, USA
| | - Xizhi Luo
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, USC, Discovery 449, 915 Greene St, Columbia, SC 29208, USA
| | - Guoshuai Cai
- Department of Environmental Health Science, Arnold School of Public Health, USC, Discovery 449, 915 Greene St, Columbia, SC 29208, USA
| | - Feifei Xiao
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, USC, Discovery 449, 915 Greene St, Columbia, SC 29208, USA
| |
Collapse
|
7
|
Luo X, Qin F, Cai G, Xiao F. Integrating genomic correlation structure improves copy number variations detection. Bioinformatics 2021; 37:312-317. [PMID: 32805016 DOI: 10.1093/bioinformatics/btaa737] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 07/23/2020] [Accepted: 08/12/2020] [Indexed: 12/16/2022] Open
Abstract
MOTIVATION Copy number variation plays important roles in human complex diseases. The detection of copy number variants (CNVs) is identifying mean shift in genetic intensities to locate chromosomal breakpoints, the step of which is referred to as chromosomal segmentation. Many segmentation algorithms have been developed with a strong assumption of independent observations in the genetic loci, and they assume each locus has an equal chance to be a breakpoint (i.e. boundary of CNVs). However, this assumption is violated in the genetics perspective due to the existence of correlation among genomic positions, such as linkage disequilibrium (LD). Our study showed that the LD structure is related to the location distribution of CNVs, which indeed presents a non-random pattern on the genome. To generate more accurate CNVs, we proposed a novel algorithm, LDcnv, that models the CNV data with its biological characteristics relating to genetic dependence structure (i.e. LD). RESULTS We theoretically demonstrated the correlation structure of CNV data in SNP array, which further supports the necessity of integrating biological structure in statistical methods for CNV detection. Therefore, we developed the LDcnv that integrated the genomic correlation structure with a local search strategy into statistical modeling of the CNV intensities. To evaluate the performance of LDcnv, we conducted extensive simulations and analyzed large-scale HapMap datasets. We showed that LDcnv presented high accuracy, stability and robustness in CNV detection and higher precision in detecting short CNVs compared to existing methods. This new segmentation algorithm has a wide scope of potential application with data from various high-throughput technology platforms. AVAILABILITY AND IMPLEMENTATION https://github.com/FeifeiXiaoUSC/LDcnv. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xizhi Luo
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Fei Qin
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Guoshuai Cai
- Department of Environmental Health Science, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Feifei Xiao
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| |
Collapse
|
8
|
Xiao F, Luo X, Hao N, Niu YS, Xiao X, Cai G, Amos CI, Zhang H. An accurate and powerful method for copy number variation detection. Bioinformatics 2020; 35:2891-2898. [PMID: 30649252 DOI: 10.1093/bioinformatics/bty1041] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2018] [Revised: 11/28/2018] [Accepted: 01/09/2019] [Indexed: 12/12/2022] Open
Abstract
MOTIVATION Integration of multiple genetic sources for copy number variation detection (CNV) is a powerful approach to improve the identification of variants associated with complex traits. Although it has been shown that the widely used change point based methods can increase statistical power to identify variants, it remains challenging to effectively detect CNVs with weak signals due to the noisy nature of genotyping intensity data. We previously developed modSaRa, a normal mean-based model on a screening and ranking algorithm for copy number variation identification which presented desirable sensitivity with high computational efficiency. To boost statistical power for the identification of variants, here we present a novel improvement that integrates the relative allelic intensity with external information from empirical statistics with modeling, which we called modSaRa2. RESULTS Simulation studies illustrated that modSaRa2 markedly improved both sensitivity and specificity over existing methods for analyzing array-based data. The improvement in weak CNV signal detection is the most substantial, while it also simultaneously improves stability when CNV size varies. The application of the new method to a whole genome melanoma dataset identified novel candidate melanoma risk associated deletions on chromosome bands 1p22.2 and duplications on 6p22, 6q25 and 19p13 regions, which may facilitate the understanding of the possible roles of germline copy number variants in the etiology of melanoma. AVAILABILITY AND IMPLEMENTATION http://c2s2.yale.edu/software/modSaRa2 or https://github.com/FeifeiXiaoUSC/modSaRa2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Feifei Xiao
- Department of Epidemiology and Biostatistics, University of South Carolina, Columbia, SC, USA
| | - Xizhi Luo
- Department of Epidemiology and Biostatistics, University of South Carolina, Columbia, SC, USA
| | - Ning Hao
- Department of Mathematics, University of Arizona, Tucson, AZ, USA
| | - Yue S Niu
- Department of Mathematics, University of Arizona, Tucson, AZ, USA
| | - Xiangjun Xiao
- Department of Quantitative Sciences, Baylor College of Medicine, Houston, TX, USA
| | - Guoshuai Cai
- Department of Environmental Health Science, University of South Carolina, Columbia, SC, USA
| | - Christopher I Amos
- Department of Quantitative Sciences, Baylor College of Medicine, Houston, TX, USA
| | - Heping Zhang
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| |
Collapse
|