1
|
Yu Z, Liu F, Li Y. scTCA: a hybrid Transformer-CNN architecture for imputation and denoising of scDNA-seq data. Brief Bioinform 2024; 25:bbae577. [PMID: 39523623 PMCID: PMC11551055 DOI: 10.1093/bib/bbae577] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Revised: 10/05/2024] [Accepted: 10/29/2024] [Indexed: 11/16/2024] Open
Abstract
Single-cell DNA sequencing (scDNA-seq) has been widely used to unmask tumor copy number alterations (CNAs) at single-cell resolution. Despite that arm-level CNAs can be accurately detected from single-cell read counts, it is difficult to precisely identify focal CNAs as the read counts are featured with high dimensionality, high sparsity and low signal-to-noise ratio. This gives rise to a desperate demand for reconstructing high-quality scDNA-seq data. We develop a new method called scTCA for imputation and denoising of single-cell read counts, thus aiding in downstream analysis of both arm-level and focal CNAs. scTCA employs hybrid Transformer-CNN architectures to identify local and non-local correlations between genes for precise recovery of the read counts. Unlike conventional Transformers, the Transformer block in scTCA is a two-stage attention module containing a stepwise self-attention layer and a window Transformer, and can efficiently deal with the high-dimensional read counts data. We showcase the superior performance of scTCA through comparison with the state-of-the-arts on both synthetic and real datasets. The results indicate it is highly effective in imputation and denoising of scDNA-seq data.
Collapse
Affiliation(s)
- Zhenhua Yu
- School of Information Engineering, Ningxia University, 750021 Ningxia, China
- Ningxia Key Laboratory of Artificial Intelligence and Information Security for Channeling Computing Resources from the East to the West, Ningxia University, 750021 Ningxia, China
| | - Furui Liu
- School of Information Engineering, Ningxia University, 750021 Ningxia, China
| | - Yang Li
- School of Information Engineering, Ningxia University, 750021 Ningxia, China
| |
Collapse
|
2
|
Liu F, Shi F, Du F, Cao X, Yu Z. CoT: a transformer-based method for inferring tumor clonal copy number substructure from scDNA-seq data. Brief Bioinform 2024; 25:bbae187. [PMID: 38670159 PMCID: PMC11052634 DOI: 10.1093/bib/bbae187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 03/08/2024] [Accepted: 04/16/2024] [Indexed: 04/28/2024] Open
Abstract
Single-cell DNA sequencing (scDNA-seq) has been an effective means to unscramble intra-tumor heterogeneity, while joint inference of tumor clones and their respective copy number profiles remains a challenging task due to the noisy nature of scDNA-seq data. We introduce a new bioinformatics method called CoT for deciphering clonal copy number substructure. The backbone of CoT is a Copy number Transformer autoencoder that leverages multi-head attention mechanism to explore correlations between different genomic regions, and thus capture global features to create latent embeddings for the cells. CoT makes it convenient to first infer cell subpopulations based on the learned embeddings, and then estimate single-cell copy numbers through joint analysis of read counts data for the cells belonging to the same cluster. This exploitation of clonal substructure information in copy number analysis helps to alleviate the effect of read counts non-uniformity, and yield robust estimations of the tumor copy numbers. Performance evaluation on synthetic and real datasets showcases that CoT outperforms the state of the arts, and is highly useful for deciphering clonal copy number substructure.
Collapse
Affiliation(s)
- Furui Liu
- School of Information Engineering, Ningxia University, 750021, Ningxia, China
| | - Fangyuan Shi
- School of Information Engineering, Ningxia University, 750021, Ningxia, China
- Collaborative Innovation Center for Ningxia Big Data and Artificial Intelligence Co-founded by Ningxia Municipality and Ministry of Education, Ningxia University, 750021, Ningxia, China
| | - Fang Du
- School of Information Engineering, Ningxia University, 750021, Ningxia, China
- Collaborative Innovation Center for Ningxia Big Data and Artificial Intelligence Co-founded by Ningxia Municipality and Ministry of Education, Ningxia University, 750021, Ningxia, China
| | - Xiangmei Cao
- Basic Medical School, Ningxia Medical University, 750001, Ningxia, China
| | - Zhenhua Yu
- School of Information Engineering, Ningxia University, 750021, Ningxia, China
- Collaborative Innovation Center for Ningxia Big Data and Artificial Intelligence Co-founded by Ningxia Municipality and Ministry of Education, Ningxia University, 750021, Ningxia, China
| |
Collapse
|
3
|
Liu F, Shi F, Yu Z. Inferring single-cell copy number profiles through cross-cell segmentation of read counts. BMC Genomics 2024; 25:25. [PMID: 38166601 PMCID: PMC10762977 DOI: 10.1186/s12864-023-09901-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 12/12/2023] [Indexed: 01/05/2024] Open
Abstract
BACKGROUND Copy number alteration (CNA) is one of the major genomic variations that frequently occur in cancers, and accurate inference of CNAs is essential for unmasking intra-tumor heterogeneity (ITH) and tumor evolutionary history. Single-cell DNA sequencing (scDNA-seq) makes it convenient to profile CNAs at single-cell resolution, and thus aids in better characterization of ITH. Despite that several computational methods have been proposed to decipher single-cell CNAs, their performance is limited in either breakpoint detection or copy number estimation due to the high dimensionality and noisy nature of read counts data. RESULTS By treating breakpoint detection as a process to segment high dimensional read count sequence, we develop a novel method called DeepCNA for cross-cell segmentation of read count sequence and per-cell inference of CNAs. To cope with the difficulty of segmentation, an autoencoder (AE) network is employed in DeepCNA to project the original data into a low-dimensional space, where the breakpoints can be efficiently detected along each latent dimension and further merged to obtain the final breakpoints. Unlike the existing methods that manually calculate certain statistics of read counts to find breakpoints, the AE model makes it convenient to automatically learn the representations. Based on the inferred breakpoints, we employ a mixture model to predict copy numbers of segments for each cell, and leverage expectation-maximization algorithm to efficiently estimate cell ploidy by exploring the most abundant copy number state. Benchmarking results on simulated and real data demonstrate our method is able to accurately infer breakpoints as well as absolute copy numbers and surpasses the existing methods under different test conditions. DeepCNA can be accessed at: https://github.com/zhyu-lab/deepcna . CONCLUSIONS Profiling single-cell CNAs based on deep learning is becoming a new paradigm of scDNA-seq data analysis, and DeepCNA is an enhancement to the current arsenal of computational methods for investigating cancer genomics.
Collapse
Affiliation(s)
- Furui Liu
- School of Information Engineering, Ningxia University, Yinchuan, 750021, China
| | - Fangyuan Shi
- School of Information Engineering, Ningxia University, Yinchuan, 750021, China
- Collaborative Innovation Center for Ningxia Big Data and Artificial Intelligence Co-Founded By Ningxia Municipality and Ministry of Education, Ningxia University, Yinchuan, 750021, China
| | - Zhenhua Yu
- School of Information Engineering, Ningxia University, Yinchuan, 750021, China.
- Collaborative Innovation Center for Ningxia Big Data and Artificial Intelligence Co-Founded By Ningxia Municipality and Ministry of Education, Ningxia University, Yinchuan, 750021, China.
| |
Collapse
|
4
|
Rossi N, Gigante N, Vitacolonna N, Piazza C. Inferring Markov Chains to Describe Convergent Tumor Evolution With CIMICE. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:106-119. [PMID: 38015671 DOI: 10.1109/tcbb.2023.3337258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/30/2023]
Abstract
The field of tumor phylogenetics focuses on studying the differences within cancer cell populations. Many efforts are done within the scientific community to build cancer progression models trying to understand the heterogeneity of such diseases. These models are highly dependent on the kind of data used for their construction, therefore, as the experimental technologies evolve, it is of major importance to exploit their peculiarities. In this work we describe a cancer progression model based on Single Cell DNA Sequencing data. When constructing the model, we focus on tailoring the formalism on the specificity of the data. We operate by defining a minimal set of assumptions needed to reconstruct a flexible DAG structured model, capable of identifying progression beyond the limitation of the infinite site assumption. Our proposal is conservative in the sense that we aim to neither discard nor infer knowledge which is not represented in the data. We provide simulations and analytical results to show the features of our model, test it on real data, show how it can be integrated with other approaches to cope with input noise. Moreover, our framework can be exploited to produce simulated data that follows our theoretical assumptions. Finally, we provide an open source R implementation of our approach, called CIMICE, that is publicly available on BioConductor.
Collapse
|
5
|
Feng X, Chen L. SCSilicon: a tool for synthetic single-cell DNA sequencing data generation. BMC Genomics 2022; 23:359. [PMID: 35546390 PMCID: PMC9092674 DOI: 10.1186/s12864-022-08566-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 04/19/2022] [Indexed: 11/25/2022] Open
Abstract
Background Single-cell DNA sequencing is getting indispensable in the study of cell-specific cancer genomics. The performance of computational tools that tackle single-cell genome aberrations may be nevertheless undervalued or overvalued, owing to the insufficient size of benchmarking data. In silicon simulation is a cost-effective approach to generate as many single-cell genomes as possible in a controlled manner to make reliable and valid benchmarking. Results This study proposes a new tool, SCSilicon, which efficiently generates single-cell in silicon DNA reads with minimum manual intervention. SCSilicon automatically creates a set of genomic aberrations, including SNP, SNV, Indel, and CNV. Besides, SCSilicon yields the ground truth of CNV segmentation breakpoints and subclone cell labels. We have manually inspected a series of synthetic variations. We conducted a sanity check of the start-of-the-art single-cell CNV callers and found SCYN was the most robust one. Conclusions SCSilicon is a user-friendly software package for users to develop and benchmark single-cell CNV callers. Source code of SCSilicon is available at https://github.com/xikanfeng2/SCSilicon. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-022-08566-w).
Collapse
Affiliation(s)
- Xikang Feng
- School of Software, Northwestern Polytechnical University, Xi'an, Shaanxi, 710072, China.
| | - Lingxi Chen
- Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| |
Collapse
|
6
|
Xi J, Yu Z. Editorial: Unsupervised Learning Models for Unlabeled Genomic, Transcriptomic & Proteomic Data. Front Genet 2021; 12:781698. [PMID: 34858487 PMCID: PMC8631860 DOI: 10.3389/fgene.2021.781698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Accepted: 10/25/2021] [Indexed: 11/13/2022] Open
Affiliation(s)
- Jianing Xi
- School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi'an, China
| | - Zhenhua Yu
- School of Information Engineering, Ningxia University, Yinchuan, China
| |
Collapse
|
7
|
Feng X, Chen L, Qing Y, Li R, Li C, Li SC. SCYN: single cell CNV profiling method using dynamic programming. BMC Genomics 2021; 22:651. [PMID: 34789142 PMCID: PMC8596905 DOI: 10.1186/s12864-021-07941-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 08/20/2021] [Indexed: 11/11/2022] Open
Abstract
BACKGROUND Copy number variation is crucial in deciphering the mechanism and cure of complex disorders and cancers. The recent advancement of scDNA sequencing technology sheds light upon addressing intratumor heterogeneity, detecting rare subclones, and reconstructing tumor evolution lineages at single-cell resolution. Nevertheless, the current circular binary segmentation based approach proves to fail to efficiently and effectively identify copy number shifts on some exceptional trails. RESULTS Here, we propose SCYN, a CNV segmentation method powered with dynamic programming. SCYN resolves the precise segmentation on in silico dataset. Then we verified SCYN manifested accurate copy number inferring on triple negative breast cancer scDNA data, with array comparative genomic hybridization results of purified bulk samples as ground truth validation. We tested SCYN on two datasets of the newly emerged 10x Genomics CNV solution. SCYN successfully recognizes gastric cancer cells from 1% and 10% spike-ins 10x datasets. Moreover, SCYN is about 150 times faster than state of the art tool when dealing with the datasets of approximately 2000 cells. CONCLUSIONS SCYN robustly and efficiently detects segmentations and infers copy number profiles on single cell DNA sequencing data. It serves to reveal the tumor intra-heterogeneity. The source code of SCYN can be accessed in https://github.com/xikanfeng2/SCYN .
Collapse
Affiliation(s)
- Xikang Feng
- School of Software, Northwestern Polytechnical University, Xi’an Shaanxi, 710072 China
- Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| | - Lingxi Chen
- Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| | - Yuhao Qing
- Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| | - Ruikang Li
- Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| | - Chaohui Li
- Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| | - Shuai Cheng Li
- Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
- Department of Biomedical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| |
Collapse
|
8
|
Giguere C, Dubey HV, Sarsani VK, Saddiki H, He S, Flaherty P. SCSIM: Jointly simulating correlated single-cell and bulk next-generation DNA sequencing data. BMC Bioinformatics 2020; 21:215. [PMID: 32456609 PMCID: PMC7249349 DOI: 10.1186/s12859-020-03550-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Accepted: 05/18/2020] [Indexed: 11/21/2022] Open
Abstract
Background Recently, it has become possible to collect next-generation DNA sequencing data sets that are composed of multiple samples from multiple biological units where each of these samples may be from a single cell or bulk tissue. Yet, there does not yet exist a tool for simulating DNA sequencing data from such a nested sampling arrangement with single-cell and bulk samples so that developers of analysis methods can assess accuracy and precision. Results We have developed a tool that simulates DNA sequencing data from hierarchically grouped (correlated) samples where each sample is designated bulk or single-cell. Our tool uses a simple configuration file to define the experimental arrangement and can be integrated into software pipelines for testing of variant callers or other genomic tools. Conclusions The DNA sequencing data generated by our simulator is representative of real data and integrates seamlessly with standard downstream analysis tools.
Collapse
Affiliation(s)
- Collin Giguere
- Department of Mathematics & Statistics, University of Massachusetts Amherst, 710 N. Pleasant St., Amherst, 01003, USA
| | - Harsh Vardhan Dubey
- Department of Mathematics & Statistics, University of Massachusetts Amherst, 710 N. Pleasant St., Amherst, 01003, USA
| | - Vishal Kumar Sarsani
- Department of Mathematics & Statistics, University of Massachusetts Amherst, 710 N. Pleasant St., Amherst, 01003, USA
| | - Hachem Saddiki
- School of Public Health, University of Massachusetts Amherst, Amherst, 01003, USA
| | - Shai He
- Department of Mathematics & Statistics, University of Massachusetts Amherst, 710 N. Pleasant St., Amherst, 01003, USA
| | - Patrick Flaherty
- Department of Mathematics & Statistics, University of Massachusetts Amherst, 710 N. Pleasant St., Amherst, 01003, USA.
| |
Collapse
|