1
|
Ogundijo OE, Zhu K, Wang X, Anastassiou D. Characterizing Intra-Tumor Heterogeneity From Somatic Mutations Without Copy-Neutral Assumption. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2271-2280. [PMID: 32070995 DOI: 10.1109/tcbb.2020.2973635] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Bulk samples of the same patient are heterogeneous in nature, comprising of different subpopulations (subclones) of cancer cells. Cells in a tumor subclone are characterized by unique mutational genotype profile. Resolving tumor heterogeneity by estimating the genotypes, cellular proportions and the number of subclones present in the tumor can help in understanding cancer progression and treatment. We present a novel method, ChaClone2, to efficiently deconvolve the observed variant allele fractions (VAFs), with consideration for possible effects from copy number aberrations at the mutation loci. Our method describes a state-space formulation of the feature allocation model, deconvolving the observed VAFs from samples of the same patient into three matrices: subclonal total and variant copy numbers for mutated genes, and proportions of subclones in each sample. We describe an efficient sequential Monte Carlo (SMC) algorithm to estimate these matrices. Extensive simulation shows that the ChaClone2 yields better accuracy when compared with other state-of-the-art methods for addressing similar problem and it offers scalability to large datasets. Also, ChaClone2 features that the model parameter estimates can be refined whenever new mutation data of freshly sequenced genomic locations are available. MATLAB code and datasets are available to download at: https://github.com/moyanre/method2.
Collapse
|
2
|
Ogundijo OE, Wang X. SeqClone: sequential Monte Carlo based inference of tumor subclones. BMC Bioinformatics 2019; 20:6. [PMID: 30611189 PMCID: PMC6320595 DOI: 10.1186/s12859-018-2562-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Accepted: 12/06/2018] [Indexed: 11/13/2022] Open
Abstract
Background Tumor samples are heterogeneous. They consist of varying cell populations or subclones and each subclone is characterized with a distinct single nucleotide variant (SNV) profile. This explains the source of genetic heterogeneity observed in tumor sequencing data. To make precise prognosis and design effective therapy for cancer, ascertaining the subclonal composition of a tumor is of great importance. Results In this paper, we propose a state-space formulation of the feature allocation model. This model is interpreted as the blind deconvolution of the expected variant allele fractions (VAFs). VAFs are deconvolved into a binary matrix of genotypes and a matrix of genotype proportions in the samples. Specifically, we consider a sequential construction of the genotype matrix which we model by Indian buffet process (IBP). We describe an efficient sequential Monte Carlo (SMC) algorithm, SeqClone, that jointly estimates the genotypes of subclones and their proportions in the samples. When compared to other methods for resolving tumor heterogeneity, SeqClone provides comparable and sometimes, better estimates of model parameters. By design, SeqClone conveniently handles any number of probed SNVs in the samples. In particular, we can analyze VAFs from newly probed SNVs to improve existing estimates, an attribute not present in existing solutions. Conclusions We show that the SMC algorithm for deconvolving VAFs from tumor sequencing data is a robust and promising alternative for explaining the observed genetic heterogeneity in tumor samples. Electronic supplementary material The online version of this article (10.1186/s12859-018-2562-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Oyetunji E Ogundijo
- Department of Electrical Engineering, Columbia University, New York, NY 10027, USA
| | - Xiaodong Wang
- Department of Electrical Engineering, Columbia University, New York, NY 10027, USA.
| |
Collapse
|
3
|
Noor A, Ahmad A, Serpedin E. SparseNCA: Sparse Network Component Analysis for Recovering Transcription Factor Activities with Incomplete Prior Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:387-395. [PMID: 26529780 DOI: 10.1109/tcbb.2015.2495224] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Network component analysis (NCA) is an important method for inferring transcriptional regulatory networks (TRNs) and recovering transcription factor activities (TFAs) using gene expression data, and the prior information about the connectivity matrix. The algorithms currently available crucially depend on the completeness of this prior information. However, inaccuracies in the measurement process may render incompleteness in the available knowledge about the connectivity matrix. Hence, computationally efficient algorithms are needed to overcome the possible incompleteness in the available data. We present a sparse network component analysis algorithm (sparseNCA), which incorporates the effect of incompleteness in the estimation of TRNs by imposing an additional sparsity constraint using the norm, which results in a greater estimation accuracy. In order to improve the computational efficiency, an iterative re-weighted method is proposed for the NCA problem which not only promotes sparsity but is hundreds of times faster than the norm based solution. The performance of sparseNCA is rigorously compared to that of FastNCA and NINCA using synthetic data as well as real data. It is shown that sparseNCA outperforms the existing state-of-the-art algorithms both in terms of estimation accuracy and consistency with the added advantage of low computational complexity. The performance of sparseNCA compared to its predecessors is particularly pronounced in case of incomplete prior information about the sparsity of the network. Subnetwork analysis is performed on the E.coli data which reiterates the superior consistency of the proposed algorithm.
Collapse
|
4
|
Local network component analysis for quantifying transcription factor activities. Methods 2017; 124:25-35. [PMID: 28710010 DOI: 10.1016/j.ymeth.2017.06.018] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2017] [Revised: 05/02/2017] [Accepted: 06/17/2017] [Indexed: 12/16/2022] Open
Abstract
Transcription factors (TFs) could regulate physiological transitions or determine stable phenotypic diversity. The accurate estimation on TF regulatory signals or functional activities is of great significance to guide biological experiments or elucidate molecular mechanisms, but still remains challenging. Traditional methods identify TF regulatory signals at the population level, which masks heterogeneous regulation mechanisms in individuals or subgroups, thus resulting in inaccurate analyses. Here, we propose a novel computational framework, namely local network component analysis (LNCA), to exploit data heterogeneity and automatically quantify accurate transcription factor activity (TFA) in practical terms, through integrating the partitioned expression sets (i.e., local information) and prior TF-gene regulatory knowledge. Specifically, LNCA adopts an adaptive optimization strategy, which evaluates the local similarities of regulation controls and corrects biases during data integration, to construct the TFA landscape. In particular, we first numerically demonstrate the effectiveness of LNCA for the simulated data sets, compared with traditional methods, such as FastNCA, ROBNCA and NINCA. Then, we apply our model to two real data sets with implicit temporal or spatial regulation variations. The results show that LNCA not only recognizes the periodic mode along the S. cerevisiae cell cycle process, but also substantially outperforms over other methods in terms of accuracy and consistency. In addition, the cross-validation study for glioblastomas multiforme (GBM) indicates that the TFAs, identified by LNCA, can better distinguish clinically distinct tumor groups than the expression values of the corresponding TFs, thus opening a new way to classify tumor subtypes and also providing a novel insight into cancer heterogeneity. AVAILABILITY LNCA was implemented as a Matlab package, which is available at http://sysbio.sibcb.ac.cn/cb/chenlab/software.htm/LNCApackage_0.1.rar.
Collapse
|
5
|
Alcántara-Silva R, Alvarado-Hermida M, Díaz-Contreras G, Sánchez-Barrios M, Carrera S, Galván SC. PISMA: A Visual Representation of Motif Distribution in DNA Sequences. Bioinform Biol Insights 2017; 11:1177932217700907. [PMID: 28469418 PMCID: PMC5390925 DOI: 10.1177/1177932217700907] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Accepted: 02/19/2017] [Indexed: 11/17/2022] Open
Abstract
Background: Because the graphical presentation and analysis of motif distribution can provide insights for experimental hypothesis, PISMA aims at identifying motifs on DNA sequences, counting and showing them graphically. The motif length ranges from 2 to 10 bases, and the DNA sequences range up to 10 kb. The motif distribution is shown as a bar-code–like, as a gene-map–like, and as a transcript scheme. Results: We obtained graphical schemes of the CpG site distribution from 91 human papillomavirus genomes. Also, we present 2 analyses: one of DNA motifs associated with either methylation-resistant or methylation-sensitive CpG islands and another analysis of motifs associated with exosome RNA secretion. Availability and Implementation: PISMA is developed in Java; it is executable in any type of hardware and in diverse operating systems. PISMA is freely available to noncommercial users. The English version and the User Manual are provided in Supplementary Files 1 and 2, and a Spanish version is available at www.biomedicas.unam.mx/wp-content/software/pisma.zip and www.biomedicas.unam.mx/wp-content/pdf/manual/pisma.pdf .
Collapse
Affiliation(s)
- Rogelio Alcántara-Silva
- División de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad Nacional Autónoma de México (UNAM), México City, México
| | - Moisés Alvarado-Hermida
- División de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad Nacional Autónoma de México (UNAM), México City, México
| | - Gibrán Díaz-Contreras
- División de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad Nacional Autónoma de México (UNAM), México City, México
| | - Martha Sánchez-Barrios
- Unidad de Posgrado, Facultad de Química, Universidad Nacional Autónoma de México (UNAM), México City, México
| | - Samantha Carrera
- Faculty of Biology, Medicine and Health, The University of Manchester, UK
| | - Silvia Carolina Galván
- Instituto de Investigaciones Biomédicas, Universidad Nacional Autónoma de México (UNAM), México City, México
| |
Collapse
|
6
|
Elmas A, Wang X, Samoilov MS. Reconstruction of novel transcription factor regulons through inference of their binding sites. BMC Bioinformatics 2015; 16:299. [PMID: 26388177 PMCID: PMC4576408 DOI: 10.1186/s12859-015-0685-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2015] [Accepted: 07/24/2015] [Indexed: 02/04/2023] Open
Abstract
Background In most sequenced organisms the number of known regulatory genes (e.g., transcription factors (TFs)) vastly exceeds the number of experimentally-verified regulons that could be associated with them. At present, identification of TF regulons is mostly done through comparative genomics approaches. Such methods could miss organism-specific regulatory interactions and often require expensive and time-consuming experimental techniques to generate the underlying data. Results In this work, we present an efficient algorithm that aims to identify a given transcription factor’s regulon through inference of its unknown binding sites, based on the discovery of its binding motif. The proposed approach relies on computational methods that utilize gene expression data sets and knockout fitness data sets which are available or may be straightforwardly obtained for many organisms. We computationally constructed the profiles of putative regulons for the TFs LexA, PurR and Fur in E. coli K12 and identified their binding motifs. Comparisons with an experimentally-verified database showed high recovery rates of the known regulon members, and indicated good predictions for the newly found genes with high biological significance. The proposed approach is also applicable to novel organisms for predicting unknown regulons of the transcriptional regulators. Results for the hypothetical protein Dde0289 in D. alaskensis include the discovery of a Fis-type TF binding motif. Conclusions The proposed motif-based regulon inference approach can discover the organism-specific regulatory interactions on a single genome, which may be missed by current comparative genomics techniques due to their limitations. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0685-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Abdulkadir Elmas
- Department of Electrical Engineering, Columbia University, 500W 120th Street, New York, 10027, NY, USA.
| | - Xiaodong Wang
- Department of Electrical Engineering, Columbia University, 500W 120th Street, New York, 10027, NY, USA.
| | - Michael S Samoilov
- Department of Bioengineering, QB3 California Institute for Quantitative Biosciences UC Berkeley, 1700 4th St #214, Berkeley, 94720, California, USA.
| |
Collapse
|
7
|
Noor A, Ahmad A, Serpedin E, Nounou M, Nounou H. ROBNCA: robust network component analysis for recovering transcription factor activities. ACTA ACUST UNITED AC 2013; 29:2410-8. [PMID: 23940252 DOI: 10.1093/bioinformatics/btt433] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Network component analysis (NCA) is an efficient method of reconstructing the transcription factor activity (TFA), which makes use of the gene expression data and prior information available about transcription factor (TF)-gene regulations. Most of the contemporary algorithms either exhibit the drawback of inconsistency and poor reliability, or suffer from prohibitive computational complexity. In addition, the existing algorithms do not possess the ability to counteract the presence of outliers in the microarray data. Hence, robust and computationally efficient algorithms are needed to enable practical applications. RESULTS We propose ROBust Network Component Analysis (ROBNCA), a novel iterative algorithm that explicitly models the possible outliers in the microarray data. An attractive feature of the ROBNCA algorithm is the derivation of a closed form solution for estimating the connectivity matrix, which was not available in prior contributions. The ROBNCA algorithm is compared with FastNCA and the non-iterative NCA (NI-NCA). ROBNCA estimates the TF activity profiles as well as the TF-gene control strength matrix with a much higher degree of accuracy than FastNCA and NI-NCA, irrespective of varying noise, correlation and/or amount of outliers in case of synthetic data. The ROBNCA algorithm is also tested on Saccharomyces cerevisiae data and Escherichia coli data, and it is observed to outperform the existing algorithms. The run time of the ROBNCA algorithm is comparable with that of FastNCA, and is hundreds of times faster than NI-NCA. AVAILABILITY The ROBNCA software is available at http://people.tamu.edu/∼amina/ROBNCA
Collapse
Affiliation(s)
- Amina Noor
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA, Corporate Research and Development, Qualcomm Technologies Inc., San Diego, CA 92121, USA, Department of Chemical Engineering and Department of Electrical Engineering, Texas A&M University at Qatar, Doha Qatar
| | | | | | | | | |
Collapse
|
8
|
Yu Q, Huo H, Zhang Y, Guo H, Guo H. PairMotif+: a fast and effective algorithm for de novo motif discovery in DNA sequences. Int J Biol Sci 2013; 9:412-24. [PMID: 23678291 PMCID: PMC3654438 DOI: 10.7150/ijbs.5786] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2013] [Accepted: 04/15/2013] [Indexed: 11/25/2022] Open
Abstract
The planted (l, d) motif search is one of the most widely studied problems in bioinformatics, which plays an important role in the identification of transcription factor binding sites in DNA sequences. However, it is still a challenging task to identify highly degenerate motifs, since current algorithms either output the exact results with a high computational cost or accomplish the computation in a short time but very often fall into a local optimum. In order to make a better trade-off between accuracy and efficiency, we propose a new pattern-driven algorithm, named PairMotif+. At first, some pairs of l-mers are extracted from input sequences according to probabilistic analysis and statistical method so that one or more pairs of motif instances are included in them. Then an approximate strategy for refining pairs of l-mers with high accuracy is adopted in order to avoid the verification of most candidate motifs. Experimental results on the simulated data show that PairMotif+ can solve various (l, d) problems within an hour on a PC with 2.67 GHz processor, and has a better identification accuracy than the compared algorithms MEME, AlignACE and VINE. Also, the validity of the proposed algorithm is tested on multiple real data sets.
Collapse
Affiliation(s)
- Qiang Yu
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
| | | | | | | | | |
Collapse
|