1
|
Khayatian E, Valiente G, Zhang L. The k-Robinson-Foulds Dissimilarity Measures for Comparison of Labeled Trees. J Comput Biol 2024; 31:328-344. [PMID: 38271573 PMCID: PMC11057537 DOI: 10.1089/cmb.2023.0312] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2024] Open
Abstract
Understanding the mutational history of tumor cells is a critical endeavor in unraveling the mechanisms that drive the onset and progression of cancer. Modeling tumor cell evolution with labeled trees motivates researchers to develop different measures to compare labeled trees. Although the Robinson-Foulds (RF) distance is widely used for comparing species trees, its applicability to labeled trees reveals certain limitations. This study introduces the k-RF dissimilarity measures, tailored to address the challenges of labeled tree comparison. The RF distance is succinctly expressed as n-RF in the space of labeled trees with n nodes. Like the RF distance, the k-RF is a pseudometric for multiset-labeled trees and becomes a metric in the space of 1-labeled trees. By setting k to a small value, the k-RF dissimilarity can capture analogous local regions in two labeled trees with different size or different labels.
Collapse
Affiliation(s)
- Elahe Khayatian
- Department of Mathematics, National University of Singapore, Singapore, Singapore
| | - Gabriel Valiente
- Department of Computer Science, Technical University of Catalonia, Barcelona, Spain
| | - Louxin Zhang
- Department of Mathematics, National University of Singapore, Singapore, Singapore
| |
Collapse
|
2
|
Qi Y, El-Kebir M. Consensus Tree Under the Ancestor-Descendant Distance is NP-Hard. J Comput Biol 2024; 31:58-70. [PMID: 38010616 DOI: 10.1089/cmb.2023.0262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2023] Open
Abstract
Due to uncertainty in tumor phylogeny inference from sequencing data, many methods infer multiple, equally plausible phylogenies for the same cancer. To summarize the solution space T of tumor phylogenies, consensus tree methods seek a single best representative tree S under a specified pairwise tree distance function. One such distance function is the ancestor-descendant (AD) distance [Formula: see text] , which equals the size of the symmetric difference of the transitive closures of the edge sets [Formula: see text] and [Formula: see text] . Here, we show that finding a consensus tree S for tumor phylogenies T that minimizes the total AD distance [Formula: see text] is NP-hard.
Collapse
Affiliation(s)
- Yuanyuan Qi
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| | - Mohammed El-Kebir
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
- Cancer Center at Illinois, University of Illinois Urbana-Champaign, Urbana, Illinois, USA
| |
Collapse
|
3
|
Guang Z, Smith-Erb M, Oesper L. A weighted distance-based approach for deriving consensus tumor evolutionary trees. Bioinformatics 2023; 39:i204-i212. [PMID: 37387177 DOI: 10.1093/bioinformatics/btad230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION The acquisition of somatic mutations by a tumor can be modeled by a type of evolutionary tree. However, it is impossible to observe this tree directly. Instead, numerous algorithms have been developed to infer such a tree from different types of sequencing data. But such methods can produce conflicting trees for the same patient, making it desirable to have approaches that can combine several such tumor trees into a consensus or summary tree. We introduce The Weighted m-Tumor Tree Consensus Problem (W-m-TTCP) to find a consensus tree among multiple plausible tumor evolutionary histories, each assigned a confidence weight, given a specific distance measure between tumor trees. We present an algorithm called TuELiP that is based on integer linear programming which solves the W-m-TTCP, and unlike other existing consensus methods, allows the input trees to be weighted differently. RESULTS On simulated data we show that TuELiP outperforms two existing methods at correctly identifying the true underlying tree used to create the simulations. We also show that the incorporation of weights can lead to more accurate tree inference. On a Triple-Negative Breast Cancer dataset, we show that including confidence weights can have important impacts on the consensus tree identified. AVAILABILITY An implementation of TuELiP and simulated datasets are available at https://bitbucket.org/oesperlab/consensus-ilp/src/main/.
Collapse
Affiliation(s)
- Ziyun Guang
- Department of Computer Science, Carleton College, Northfield, MN 55057, USA
| | - Matthew Smith-Erb
- Department of Computer Science, Carleton College, Northfield, MN 55057, USA
| | - Layla Oesper
- Department of Computer Science, Carleton College, Northfield, MN 55057, USA
| |
Collapse
|
4
|
Baghaarabani L, Goliaei S, Foroughmand-Araabi MH, Shariatpanahi SP, Goliaei B. Conifer: clonal tree inference for tumor heterogeneity with single-cell and bulk sequencing data. BMC Bioinformatics 2021; 22:416. [PMID: 34461827 PMCID: PMC8404257 DOI: 10.1186/s12859-021-04338-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2021] [Accepted: 08/16/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genetic heterogeneity of a cancer tumor that develops during clonal evolution is one of the reasons for cancer treatment failure, by increasing the chance of drug resistance. Clones are cell populations with different genotypes, resulting from differences in somatic mutations that occur and accumulate during cancer development. An appropriate approach for identifying clones is determining the variant allele frequency of mutations that occurred in the tumor. Although bulk sequencing data can be used to provide that information, the frequencies are not informative enough for identifying different clones with the same prevalence and their evolutionary relationships. On the other hand, single-cell sequencing data provides valuable information about branching events in the evolution of a cancerous tumor. However, the temporal order of mutations may be determined with ambiguities using only single-cell data, while variant allele frequencies from bulk sequencing data can provide beneficial information for inferring the temporal order of mutations with fewer ambiguities. RESULT In this study, a new method called Conifer (ClONal tree Inference For hEterogeneity of tumoR) is proposed which combines aggregated variant allele frequency from bulk sequencing data with branching event information from single-cell sequencing data to more accurately identify clones and their evolutionary relationships. It is proven that the accuracy of clone identification and clonal tree inference is increased by using Conifer compared to other existing methods on various sets of simulated data. In addition, it is discussed that the evolutionary tree provided by Conifer on real cancer data sets is highly consistent with information in both bulk and single-cell data. CONCLUSIONS In this study, we have provided an accurate and robust method to identify clones of tumor heterogeneity and their evolutionary history by combining single-cell and bulk sequencing data.
Collapse
Affiliation(s)
- Leila Baghaarabani
- Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| | - Sama Goliaei
- Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran
| | | | | | - Bahram Goliaei
- Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.
| |
Collapse
|
5
|
Jahn K, Beerenwinkel N, Zhang L. The Bourque distances for mutation trees of cancers. Algorithms Mol Biol 2021; 16:9. [PMID: 34112201 PMCID: PMC8193869 DOI: 10.1186/s13015-021-00188-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Accepted: 06/02/2021] [Indexed: 12/02/2022] Open
Abstract
Background Mutation trees are rooted trees in which nodes are of arbitrary degree and labeled with a mutation set. These trees, also referred to as clonal trees, are used in computational oncology to represent the mutational history of tumours. Classical tree metrics such as the popular Robinson–Foulds distance are of limited use for the comparison of mutation trees. One reason is that mutation trees inferred with different methods or for different patients often contain different sets of mutation labels. Results We generalize the Robinson–Foulds distance into a set of distance metrics called Bourque distances for comparing mutation trees. We show the basic version of the Bourque distance for mutation trees can be computed in linear time. We also make a connection between the Robinson–Foulds distance and the nearest neighbor interchange distance. Supplementary Information The online version contains supplementary material available at 10.1186/s13015-021-00188-3.
Collapse
|
6
|
Ciccolella S, Bernardini G, Denti L, Bonizzoni P, Previtali M, Della Vedova G. Triplet-based similarity score for fully multilabeled trees with poly-occurring labels. Bioinformatics 2021; 37:178-184. [PMID: 32730595 PMCID: PMC8055217 DOI: 10.1093/bioinformatics/btaa676] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 06/29/2020] [Accepted: 07/22/2020] [Indexed: 01/06/2023] Open
Abstract
MOTIVATION The latest advances in cancer sequencing, and the availability of a wide range of methods to infer the evolutionary history of tumors, have made it important to evaluate, reconcile and cluster different tumor phylogenies. Recently, several notions of distance or similarities have been proposed in the literature, but none of them has emerged as the golden standard. Moreover, none of the known similarity measures is able to manage mutations occurring multiple times in the tree, a circumstance often occurring in real cases. RESULTS To overcome these limitations, in this article, we propose MP3, the first similarity measure for tumor phylogenies able to effectively manage cases where multiple mutations can occur at the same time and mutations can occur multiple times. Moreover, a comparison of MP3 with other measures shows that it is able to classify correctly similar and dissimilar trees, both on simulated and on real data. AVAILABILITY AND IMPLEMENTATION An open source implementation of MP3 is publicly available at https://github.com/AlgoLab/mp3treesim. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Simone Ciccolella
- Department of Informatics, Systems and Communication, University of Milano-Biccoca, Milan 20126, Italy
| | - Giulia Bernardini
- Department of Informatics, Systems and Communication, University of Milano-Biccoca, Milan 20126, Italy
| | - Luca Denti
- Department of Informatics, Systems and Communication, University of Milano-Biccoca, Milan 20126, Italy
| | - Paola Bonizzoni
- Department of Informatics, Systems and Communication, University of Milano-Biccoca, Milan 20126, Italy
| | - Marco Previtali
- Department of Informatics, Systems and Communication, University of Milano-Biccoca, Milan 20126, Italy
| | - Gianluca Della Vedova
- Department of Informatics, Systems and Communication, University of Milano-Biccoca, Milan 20126, Italy
| |
Collapse
|
7
|
Sadeqi Azer E, Rashidi Mehrabadi F, Malikić S, Li XC, Bartok O, Litchfield K, Levy R, Samuels Y, Schäffer AA, Gertz EM, Day CP, Pérez-Guijarro E, Marie K, Lee MP, Merlino G, Ergun F, Sahinalp SC. PhISCS-BnB: a fast branch and bound algorithm for the perfect tumor phylogeny reconstruction problem. Bioinformatics 2021; 36:i169-i176. [PMID: 32657358 DOI: 10.1093/bioinformatics/btaa464] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
MOTIVATION Recent advances in single-cell sequencing (SCS) offer an unprecedented insight into tumor emergence and evolution. Principled approaches to tumor phylogeny reconstruction via SCS data are typically based on general computational methods for solving an integer linear program, or a constraint satisfaction program, which, although guaranteeing convergence to the most likely solution, are very slow. Others based on Monte Carlo Markov Chain or alternative heuristics not only offer no such guarantee, but also are not faster in practice. As a result, novel methods that can scale up to handle the size and noise characteristics of emerging SCS data are highly desirable to fully utilize this technology. RESULTS We introduce PhISCS-BnB (phylogeny inference using SCS via branch and bound), a branch and bound algorithm to compute the most likely perfect phylogeny on an input genotype matrix extracted from an SCS dataset. PhISCS-BnB not only offers an optimality guarantee, but is also 10-100 times faster than the best available methods on simulated tumor SCS data. We also applied PhISCS-BnB on a recently published large melanoma dataset derived from the sublineages of a cell line involving 20 clones with 2367 mutations, which returned the optimal tumor phylogeny in <4 h. The resulting phylogeny agrees with and extends the published results by providing a more detailed picture on the clonal evolution of the tumor. AVAILABILITY AND IMPLEMENTATION https://github.com/algo-cancer/PhISCS-BnB. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Erfan Sadeqi Azer
- Department of Computer Science, Indiana University, Bloomington, IN 47408, USA
| | - Farid Rashidi Mehrabadi
- Department of Computer Science, Indiana University, Bloomington, IN 47408, USA.,Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Salem Malikić
- Department of Computer Science, Indiana University, Bloomington, IN 47408, USA
| | - Xuan Cindy Li
- Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.,Program in Computational Biology, Bioinformatics and Genomics, University of Maryland, College Park, MD 20742, USA
| | - Osnat Bartok
- Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot, Israel
| | - Kevin Litchfield
- Cancer Evolution and Genome Instability Laboratory, Francis Crick Institute, London NW1 1AT, UK.,Cancer Research UK Lung Cancer Centre of Excellence London, University College London Cancer Institute, London WC1E 6DD, UK
| | - Ronen Levy
- Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot, Israel
| | - Yardena Samuels
- Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot, Israel
| | - Alejandro A Schäffer
- Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - E Michael Gertz
- Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Chi-Ping Day
- Laboratory of Cancer Biology and Genetics, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Eva Pérez-Guijarro
- Laboratory of Cancer Biology and Genetics, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Kerrie Marie
- Laboratory of Cancer Biology and Genetics, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Maxwell P Lee
- Laboratory of Cancer Biology and Genetics, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Glenn Merlino
- Laboratory of Cancer Biology and Genetics, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Funda Ergun
- Department of Computer Science, Indiana University, Bloomington, IN 47408, USA
| | - S Cenk Sahinalp
- Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| |
Collapse
|
8
|
Christensen S, Kim J, Chia N, Koyejo O, El-Kebir M. Detecting evolutionary patterns of cancers using consensus trees. Bioinformatics 2021; 36:i684-i691. [PMID: 33381820 DOI: 10.1093/bioinformatics/btaa801] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION While each cancer is the result of an isolated evolutionary process, there are repeated patterns in tumorigenesis defined by recurrent driver mutations and their temporal ordering. Such repeated evolutionary trajectories hold the potential to improve stratification of cancer patients into subtypes with distinct survival and therapy response profiles. However, current cancer phylogeny methods infer large solution spaces of plausible evolutionary histories from the same sequencing data, obfuscating repeated evolutionary patterns. RESULTS To simultaneously resolve ambiguities in sequencing data and identify cancer subtypes, we propose to leverage common patterns of evolution found in patient cohorts. We first formulate the Multiple Choice Consensus Tree problem, which seeks to select a tumor tree for each patient and assign patients into clusters in such a way that maximizes consistency within each cluster of patient trees. We prove that this problem is NP-hard and develop a heuristic algorithm, Revealing Evolutionary Consensus Across Patients (RECAP), to solve this problem in practice. Finally, on simulated data, we show RECAP outperforms existing methods that do not account for patient subtypes. We then use RECAP to resolve ambiguities in patient trees and find repeated evolutionary trajectories in lung and breast cancer cohorts. AVAILABILITY AND IMPLEMENTATION https://github.com/elkebir-group/RECAP. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Juho Kim
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Nicholas Chia
- Microbiome Program, Center for Individualized Medicine.,Division of Surgical Research, Department of Surgery, Mayo Clinic, Rochester, MN, 55905, USA
| | | | | |
Collapse
|
9
|
Sadeqi Azer E, Haghir Ebrahimabadi M, Malikić S, Khardon R, Sahinalp SC. Tumor Phylogeny Topology Inference via Deep Learning. iScience 2020; 23:101655. [PMID: 33117968 PMCID: PMC7582044 DOI: 10.1016/j.isci.2020.101655] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2020] [Revised: 08/10/2020] [Accepted: 10/02/2020] [Indexed: 01/24/2023] Open
Abstract
Principled computational approaches for tumor phylogeny reconstruction via single-cell sequencing typically aim to build the most likely perfect phylogeny tree from the noisy genotype matrix - which represents genotype calls of single cells. This problem is NP-hard, and as a result, existing approaches aim to solve relatively small instances of it through combinatorial optimization techniques or Bayesian inference. As expected, even when the goal is to infer basic topological features of the tumor phylogeny, rather than reconstructing the topology entirely, these approaches could be prohibitively slow. In this paper, we introduce fast deep learning solutions to the problems of inferring whether the most likely tree has a linear (chain) or branching topology and whether a perfect phylogeny is feasible from a given genotype matrix. We also present a reinforcement learning approach for reconstructing the most likely tumor phylogeny. This preliminary work demonstrates that data-driven approaches can reconstruct key features of tumor evolution.
Collapse
Affiliation(s)
- Erfan Sadeqi Azer
- Department of Computer Science, Indiana University, Bloomington, IN 47408, USA
| | - Mohammad Haghir Ebrahimabadi
- Department of Computer Science, Indiana University, Bloomington, IN 47408, USA
- Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Salem Malikić
- Department of Computer Science, Indiana University, Bloomington, IN 47408, USA
- Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Roni Khardon
- Department of Computer Science, Indiana University, Bloomington, IN 47408, USA
| | - S. Cenk Sahinalp
- Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| |
Collapse
|
10
|
Zhang Z, Wang W, Xia R, Pan G, Wang J, Tang J. Achieving large and distant ancestral genome inference by using an improved discrete quantum-behaved particle swarm optimization algorithm. BMC Bioinformatics 2020; 21:516. [PMID: 33176688 PMCID: PMC7656761 DOI: 10.1186/s12859-020-03833-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Accepted: 10/23/2020] [Indexed: 11/16/2022] Open
Abstract
Background Reconstructing ancestral genomes is one of the central problems presented in genome rearrangement analysis since finding the most likely true ancestor is of significant importance in phylogenetic reconstruction. Large scale genome rearrangements can provide essential insights into evolutionary processes. However, when the genomes are large and distant, classical median solvers have failed to adequately address these challenges due to the exponential increase of the search space. Consequently, solving ancestral genome inference problems constitutes a task of paramount importance that continues to challenge the current methods used in this area, whose difficulty is further increased by the ongoing rapid accumulation of whole-genome data. Results In response to these challenges, we provide two contributions for ancestral genome inference. First, an improved discrete quantum-behaved particle swarm optimization algorithm (IDQPSO) by averaging two of the fitness values is proposed to address the discrete search space. Second, we incorporate DCJ sorting into the IDQPSO (IDQPSO-Median). In comparison with the other methods, when the genomes are large and distant, IDQPSO-Median has the lowest median score, the highest adjacency accuracy, and the closest distance to the true ancestor. In addition, we have integrated our IDQPSO-Median approach with the GRAPPA framework. Our experiments show that this new phylogenetic method is very accurate and effective by using IDQPSO-Median. Conclusions Our experimental results demonstrate the advantages of IDQPSO-Median approach over the other methods when the genomes are large and distant. When our experimental results are evaluated in a comprehensive manner, it is clear that the IDQPSO-Median approach we propose achieves better scalability compared to existing algorithms. Moreover, our experimental results by using simulated and real datasets confirm that the IDQPSO-Median, when integrated with the GRAPPA framework, outperforms other heuristics in terms of accuracy, while also continuing to infer phylogenies that were equivalent or close to the true trees within 5 days of computation, which is far beyond the difficulty level that can be handled by GRAPPA.
Collapse
Affiliation(s)
- Zhaojuan Zhang
- College of Computer Science and Technology, Zhejiang University of Technology, Liuhe Road, Hangzhou, China
| | - Wanliang Wang
- College of Computer Science and Technology, Zhejiang University of Technology, Liuhe Road, Hangzhou, China.
| | - Ruofan Xia
- Department of Computer Science and Engineering, University of South Carolina, Assembly Street, Columbia, USA
| | - Gaofeng Pan
- Department of Computer Science and Engineering, University of South Carolina, Assembly Street, Columbia, USA
| | - Jiandong Wang
- Department of Computer Science and Engineering, University of South Carolina, Assembly Street, Columbia, USA
| | - Jijun Tang
- Department of Computer Science and Engineering, University of South Carolina, Assembly Street, Columbia, USA.,Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin University, Yaguan Road, Tianjin, China
| |
Collapse
|
11
|
Malikic S, Mehrabadi FR, Ciccolella S, Rahman MK, Ricketts C, Haghshenas E, Seidman D, Hach F, Hajirasouliha I, Sahinalp SC. PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data. Genome Res 2019; 29:1860-1877. [PMID: 31628256 PMCID: PMC6836735 DOI: 10.1101/gr.234435.118] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2018] [Accepted: 09/11/2019] [Indexed: 12/29/2022]
Abstract
Available computational methods for tumor phylogeny inference via single-cell sequencing (SCS) data typically aim to identify the most likely perfect phylogeny tree satisfying the infinite sites assumption (ISA). However, the limitations of SCS technologies including frequent allele dropout and variable sequence coverage may prohibit a perfect phylogeny. In addition, ISA violations are commonly observed in tumor phylogenies due to the loss of heterozygosity, deletions, and convergent evolution. In order to address such limitations, we introduce the optimal subperfect phylogeny problem which asks to integrate SCS data with matching bulk sequencing data by minimizing a linear combination of potential false negatives (due to allele dropout or variance in sequence coverage), false positives (due to read errors) among mutation calls, and the number of mutations that violate ISA (real or because of incorrect copy number estimation). We then describe a combinatorial formulation to solve this problem which ensures that several lineage constraints imposed by the use of variant allele frequencies (VAFs, derived from bulk sequence data) are satisfied. We express our formulation both in the form of an integer linear program (ILP) and—as a first in tumor phylogeny reconstruction—a Boolean constraint satisfaction problem (CSP) and solve them by leveraging state-of-the-art ILP/CSP solvers. The resulting method, which we name PhISCS, is the first to integrate SCS and bulk sequencing data while accounting for ISA violating mutations. In contrast to the alternative methods, typically based on probabilistic approaches, PhISCS provides a guarantee of optimality in reported solutions. Using simulated and real data sets, we demonstrate that PhISCS is more general and accurate than all available approaches.
Collapse
Affiliation(s)
- Salem Malikic
- School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| | - Farid Rashidi Mehrabadi
- Department of Computer Science, Indiana University, Bloomington, Indiana 47408, USA.,Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Simone Ciccolella
- Department of Computer Systems and Communication, University of Milano-Bicocca, 20136 Milan, Italy.,Institute for Computational Biomedicine, Weill Cornell Medicine, New York, New York 10065, USA
| | - Md Khaledur Rahman
- Department of Computer Science, Indiana University, Bloomington, Indiana 47408, USA
| | - Camir Ricketts
- Institute for Computational Biomedicine, Weill Cornell Medicine, New York, New York 10065, USA.,Tri-I Computational Biology and Medicine Graduate Program, Cornell University, New York, New York 10065, USA
| | - Ehsan Haghshenas
- School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| | - Daniel Seidman
- Tri-I Computational Biology and Medicine Graduate Program, Cornell University, New York, New York 10065, USA
| | - Faraz Hach
- School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada.,Department of Urologic Sciences, University of British Columbia, Vancouver, BC V5Z 1M9, Canada.,Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada
| | - Iman Hajirasouliha
- Institute for Computational Biomedicine, Weill Cornell Medicine, New York, New York 10065, USA.,Department of Physiology and Biophysics, Englander Institute for Precision Medicine, The Meyer Cancer Center, Weill Cornell Medicine, New York, New York 10065, USA
| | - S Cenk Sahinalp
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| |
Collapse
|