1
|
Sena F, Ingervo E, Khan S, Prjibelski A, Schmidt S, Tomescu A. Flowtigs: Safety in flow decompositions for assembly graphs. iScience 2024; 27:111208. [PMID: 39759024 PMCID: PMC11700653 DOI: 10.1016/j.isci.2024.111208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2024] [Revised: 09/30/2024] [Accepted: 10/15/2024] [Indexed: 01/07/2025] Open
Abstract
A decomposition of a network flow is a set of weighted walks whose superposition equals the flow. In this article, we give a simple and linear-time-verifiable complete characterization (flowtigs) of walks that are safe in such general flow decompositions, i.e., that are subwalks of any possible flow decomposition. We provide an O(mn)-time algorithm that identifies all maximal flowtigs and represents them inside a compact structure. On the practical side, we study flowtigs in the use-case of metagenomic assembly. By using the species abundances as flow values of the metagenomic assembly graph, we can model the possible assembly solutions as flow decompositions into weighted closed walks. On simulated data, compared to reporting unitigs or maximal safe walks based only on the graph structure, reporting flowtigs results in a notably more contiguous assembly. On real data, we frame flowtigs as a heuristic and provide an algorithm that is guided by this heuristic.
Collapse
Affiliation(s)
| | | | - Shahbaz Khan
- Indian Institute of Technology Roorkee, Roorkee, India
| | | | | | | |
Collapse
|
2
|
Chitpin JG, Perkins TJ. A Markov constraint to uniquely identify elementary flux mode weights in unimolecular metabolic networks. J Theor Biol 2023; 575:111632. [PMID: 37804942 DOI: 10.1016/j.jtbi.2023.111632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 09/21/2023] [Accepted: 10/01/2023] [Indexed: 10/09/2023]
Abstract
Elementary flux modes (EFMs) are minimal, steady state pathways characterizing a flux network. Fundamentally, all steady state fluxes in a network are decomposable into a linear combination of EFMs. While there is typically no unique set of EFM weights that reconstructs these fluxes, several optimization-based methods have been proposed to constrain the solution space by enforcing some notion of parsimony. However, it has long been recognized that optimization-based approaches may fail to uniquely identify EFM weights and return different feasible solutions across objective functions and solvers. Here we show that, for flux networks only involving single molecule transformations, these problems can be avoided by imposing a Markovian constraint on EFM weights. Our Markovian constraint guarantees a unique solution to the flux decomposition problem, and that solution is arguably more biophysically plausible than other solutions. We describe an algorithm for computing Markovian EFM weights via steady state analysis of a certain discrete-time Markov chain, based on the flux network, which we call the cycle-history Markov chain. We demonstrate our method with a differential analysis of EFM activity in a lipid metabolic network comparing healthy and Alzheimer's disease patients. Our method is the first to uniquely decompose steady state fluxes into EFM weights for any unimolecular metabolic network.
Collapse
Affiliation(s)
- Justin G Chitpin
- Ottawa Hospital Research Institute, 501 Smyth Road, Ottawa, K1H 8L6, Ontario, Canada; Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa, 451 Smyth Road, Ottawa, K1H 8M5, Ontario, Canada.
| | - Theodore J Perkins
- Ottawa Hospital Research Institute, 501 Smyth Road, Ottawa, K1H 8L6, Ontario, Canada; Ottawa Institute of Systems Biology, Department of Biochemistry, Microbiology and Immunology, University of Ottawa, 451 Smyth Road, Ottawa, K1H 8M5, Ontario, Canada.
| |
Collapse
|
3
|
Dias FHC, Cáceres M, Williams L, Mumey B, Tomescu AI. A safety framework for flow decomposition problems via integer linear programming. Bioinformatics 2023; 39:btad640. [PMID: 37862229 PMCID: PMC10628435 DOI: 10.1093/bioinformatics/btad640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 09/05/2023] [Accepted: 10/19/2023] [Indexed: 10/22/2023] Open
Abstract
MOTIVATION Many important problems in Bioinformatics (e.g. assembly or multiassembly) admit multiple solutions, while the final objective is to report only one. A common approach to deal with this uncertainty is finding "safe" partial solutions (e.g. contigs) which are common to all solutions. Previous research on safety has focused on polynomially time solvable problems, whereas many successful and natural models are NP-hard to solve, leaving a lack of "safety tools" for such problems. We propose the first method for computing all safe solutions for an NP-hard problem, "minimum flow decomposition" (MFD). We obtain our results by developing a "safety test" for paths based on a general integer linear programming (ILP) formulation. Moreover, we provide implementations with practical optimizations aimed to reduce the total ILP time, the most efficient of these being based on a recursive group-testing procedure. RESULTS Experimental results on transcriptome datasets show that all safe paths for MFDs correctly recover up to 90% of the full RNA transcripts, which is at least 25% more than previously known safe paths. Moreover, despite the NP-hardness of the problem, we can report all safe paths for 99.8% of the over 27 000 non-trivial graphs of this dataset in only 1.5 h. Our results suggest that, on perfect data, there is less ambiguity than thought in the notoriously hard RNA assembly problem. AVAILABILITY AND IMPLEMENTATION https://github.com/algbio/mfd-safety.
Collapse
Affiliation(s)
- Fernando H C Dias
- Department of Computer Science, University of Helsinki, Helsinki 00560, Finland
| | - Manuel Cáceres
- Department of Computer Science, University of Helsinki, Helsinki 00560, Finland
| | - Lucia Williams
- School of Computing, Montana State University, Bozeman, MT 59717, United States
| | - Brendan Mumey
- School of Computing, Montana State University, Bozeman, MT 59717, United States
| | - Alexandru I Tomescu
- Department of Computer Science, University of Helsinki, Helsinki 00560, Finland
| |
Collapse
|
4
|
Khan S, Kortelainen M, Cáceres M, Williams L, Tomescu AI. Improving RNA Assembly via Safety and Completeness in Flow Decompositions. J Comput Biol 2022; 29:1270-1287. [PMID: 36288562 PMCID: PMC9807076 DOI: 10.1089/cmb.2022.0261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
Decomposing a network flow into weighted paths is a problem with numerous applications, ranging from networking, transportation planning, to bioinformatics. In some applications we look for a decomposition that is optimal with respect to some property, such as the number of paths used, robustness to edge deletion, or length of the longest path. However, in many bioinformatic applications, we seek a specific decomposition where the paths correspond to some underlying data that generated the flow. In these cases, no optimization criteria guarantee the identification of the correct decomposition. Therefore, we propose to instead report the safe paths, which are subpaths of at least one path in every flow decomposition. In this work, we give the first local characterization of safe paths for flow decompositions in directed acyclic graphs, leading to a practical algorithm for finding the complete set of safe paths. In addition, we evaluate our algorithm on RNA transcript data sets against a trivial safe algorithm (extended unitigs), the recently proposed safe paths for path covers (TCBB 2021) and the popular heuristic greedy-width. On the one hand, we found that besides maintaining perfect precision, our safe and complete algorithm reports a significantly higher coverage (≈50% more) compared with the other safe algorithms. On the other hand, the greedy-width algorithm although reporting a better coverage, it also reports a significantly lower precision on complex graphs (for genes expressing a large number of transcripts). Overall, our safe and complete algorithm outperforms (by ≈20%) greedy-width on a unified metric (F-score) considering both coverage and precision when the evaluated data set has a significant number of complex graphs. Moreover, it also has a superior time (4-5×) and space performance (1.2-2.2×), resulting in a better and more practical approach for bioinformatic applications of flow decomposition.
Collapse
Affiliation(s)
- Shahbaz Khan
- Department of Computer Science and Engineering, IIT Roorkee, Roorkee, India.,Department of Computer Science, University of Helsinki, Helsinki, Finland.,Address correspondence to: Prof. Shahbaz Khan, Department of Computer Science and Engineering, IIT Roorkee, Haridwar Highway, Roorkee 247667, Uttarakhand, India
| | - Milla Kortelainen
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Manuel Cáceres
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Lucia Williams
- School of Computing, Montana State University, Bozeman, Montana, USA
| | | |
Collapse
|
5
|
Dias FH, Williams L, Mumey B, Tomescu AI. Efficient Minimum Flow Decomposition via Integer Linear Programming. J Comput Biol 2022; 29:1252-1267. [PMID: 36260412 PMCID: PMC9700332 DOI: 10.1089/cmb.2022.0257] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Minimum flow decomposition (MFD) is an NP-hard problem asking to decompose a network flow into a minimum set of paths (together with associated weights). Variants of it are powerful models in multiassembly problems in Bioinformatics, such as RNA assembly. Owing to its hardness, practical multiassembly tools either use heuristics or solve simpler, polynomial time-solvable versions of the problem, which may yield solutions that are not minimal or do not perfectly decompose the flow. Here, we provide the first fast and exact solver for MFD on acyclic flow networks, based on Integer Linear Programming (ILP). Key to our approach is an encoding of all the exponentially many solution paths using only a quadratic number of variables. We also extend our ILP formulation to many practical variants, such as incorporating longer or paired-end reads, or minimizing flow errors. On both simulated and real-flow splicing graphs, our approach solves any instance in <13 seconds. We hope that our formulations can lie at the core of future practical RNA assembly tools. Our implementations are freely available on Github.
Collapse
Affiliation(s)
- Fernando H.C. Dias
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Lucia Williams
- School of Computing, Montana State University, Bozeman, Montana, USA
| | - Brendan Mumey
- School of Computing, Montana State University, Bozeman, Montana, USA
| | | |
Collapse
|