1
|
Sumanaweera D, Suo C, Cujba AM, Muraro D, Dann E, Polanski K, Steemers AS, Lee W, Oliver AJ, Park JE, Meyer KB, Dumitrascu B, Teichmann SA. Gene-level alignment of single-cell trajectories. Nat Methods 2025; 22:68-81. [PMID: 39300283 PMCID: PMC11725504 DOI: 10.1038/s41592-024-02378-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 07/12/2024] [Indexed: 09/22/2024]
Abstract
Single-cell data analysis can infer dynamic changes in cell populations, for example across time, space or in response to perturbation, thus deriving pseudotime trajectories. Current approaches comparing trajectories often use dynamic programming but are limited by assumptions such as the existence of a definitive match. Here we describe Genes2Genes, a Bayesian information-theoretic dynamic programming framework for aligning single-cell trajectories. It is able to capture sequential matches and mismatches of individual genes between a reference and query trajectory, highlighting distinct clusters of alignment patterns. Across both real world and simulated datasets, it accurately inferred alignments and demonstrated its utility in disease cell-state trajectory analysis. In a proof-of-concept application, Genes2Genes revealed that T cells differentiated in vitro match an immature in vivo state while lacking expression of genes associated with TNF signaling. This demonstrates that precise trajectory alignment can pinpoint divergence from the in vivo system, thus guiding the optimization of in vitro culture conditions.
Collapse
Affiliation(s)
- Dinithi Sumanaweera
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
- Theory of Condensed Matter, Cavendish Laboratory, Department of Physics, University of Cambridge, Cambridge, UK
| | - Chenqu Suo
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
- Department of Paediatrics, Cambridge University Hospitals; Hills Road, Cambridge, UK
| | - Ana-Maria Cujba
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Daniele Muraro
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Emma Dann
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Krzysztof Polanski
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Alexander S Steemers
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
- Princess Máxima Center for Pediatric Oncology, Utrecht, Netherlands
| | - Woochan Lee
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
- Department of Biomedical Sciences, Seoul National University, Seoul, Korea
| | - Amanda J Oliver
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Jong-Eun Park
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
- Graduate School of Medical Science and Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea
| | - Kerstin B Meyer
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Bianca Dumitrascu
- Department of Statistics, Columbia University, New York, NY, USA
- Irving Institute for Cancer Dynamics, Columbia University, New York, NY, USA
| | - Sarah A Teichmann
- Wellcome Sanger Institute; Wellcome Genome Campus, Hinxton, Cambridge, UK.
- Cambridge Stem Cell Institute, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, University of Cambridge, Cambridge, UK.
- Department of Medicine, University of Cambridge, Cambridge, UK.
- Co-director of CIFAR Macmillan Research Program, Toronto, Ontario, Canada.
| |
Collapse
|
2
|
Sumanaweera D, Allison L, Konagurthu AS. Bridging the gaps in statistical models of protein alignment. Bioinformatics 2022; 38:i229-i237. [PMID: 35758809 PMCID: PMC9235498 DOI: 10.1093/bioinformatics/btac246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Summary Sequences of proteins evolve by accumulating substitutions together with insertions and deletions (indels) of amino acids. However, it remains a common practice to disconnect substitutions and indels, and infer approximate models for each of them separately, to quantify sequence relationships. Although this approach brings with it computational convenience (which remains its primary motivation), there is a dearth of attempts to unify and model them systematically and together. To overcome this gap, this article demonstrates how a complete statistical model quantifying the evolution of pairs of aligned proteins can be constructed using a time-parameterized substitution matrix and a time-parameterized alignment state machine. Methods to derive all parameters of such a model from any benchmark collection of aligned protein sequences are described here. This has not only allowed us to generate a unified statistical model for each of the nine widely used substitution matrices (PAM, JTT, BLOSUM, JO, WAG, VTML, LG, MIQS and PFASUM), but also resulted in a new unified model, MMLSUM. Our underlying methodology measures the Shannon information content using each model to explain losslessly any given collection of alignments, which has allowed us to quantify the performance of all the above models on six comprehensive alignment benchmarks. Our results show that MMLSUM results in a new and clear overall best performance, followed by PFASUM, VTML, BLOSUM and MIQS, respectively, amongst the top five. We further analyze the statistical properties of MMLSUM model and contrast it with others. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dinithi Sumanaweera
- Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Clayton, VIC 3800, Australia
| | - Lloyd Allison
- Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Clayton, VIC 3800, Australia
| | - Arun S Konagurthu
- Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Clayton, VIC 3800, Australia
| |
Collapse
|
3
|
Waman VP, Orengo C, Kleywegt GJ, Lesk AM. Three-dimensional Structure Databases of Biological Macromolecules. Methods Mol Biol 2022; 2449:43-91. [PMID: 35507259 DOI: 10.1007/978-1-0716-2095-3_3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Databases of three-dimensional structures of proteins (and their associated molecules) provide: (a) Curated repositories of coordinates of experimentally determined structures, including extensive metadata; for instance information about provenance, details about data collection and interpretation, and validation of results. (b) Information-retrieval tools to allow searching to identify entries of interest and provide access to them. (c) Links among databases, especially to databases of amino-acid and genetic sequences, and of protein function; and links to software for analysis of amino-acid sequence and protein structure, and for structure prediction. (d) Collections of predicted three-dimensional structures of proteins. These will become more and more important after the breakthrough in structure prediction achieved by AlphaFold2. The single global archive of experimentally determined biomacromolecular structures is the Protein Data Bank (PDB). It is managed by wwPDB, a consortium of five partner institutions: the Protein Data Bank in Europe (PDBe), the Research Collaboratory for Structural Bioinformatics (RCSB), the Protein Data Bank Japan (PDBj), the BioMagResBank (BMRB), and the Electron Microscopy Data Bank (EMDB). In addition to jointly managing the PDB repository, the individual wwPDB partners offer many tools for analysis of protein and nucleic acid structures and their complexes, including providing computer-graphic representations. Their collective and individual websites serve as hubs of the community of structural biologists, offering newsletters, reports from Task Forces, training courses, and "helpdesks," as well as links to external software.Many specialized projects are based on the information contained in the PDB. Especially important are SCOP, CATH, and ECOD, which present classifications of protein domains.
Collapse
Affiliation(s)
- Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Gerard J Kleywegt
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Arthur M Lesk
- Department of Biochemistry and Molecular Biology and Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, PA, USA.
| |
Collapse
|
4
|
Rajapaksa S, Sumanaweera D, Lesk AM, Allison L, Stuckey PJ, Garcia de la Banda M, Abramson D, Konagurthu AS. OUP accepted manuscript. Bioinformatics 2022; 38:i255-i263. [PMID: 35758808 PMCID: PMC9235515 DOI: 10.1093/bioinformatics/btac247] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/09/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Alignments are correspondences between sequences. How reliable are alignments of amino acid sequences of proteins, and what inferences about protein relationships can be drawn? Using techniques not previously applied to these questions, by weighting every possible sequence alignment by its posterior probability we derive a formal mathematical expectation, and develop an efficient algorithm for computation of the distance between alternative alignments allowing quantitative comparisons of sequence-based alignments with corresponding reference structure alignments. RESULTS By analyzing the sequences and structures of 1 million protein domain pairs, we report the variation of the expected distance between sequence-based and structure-based alignments, as a function of (Markov time of) sequence divergence. Our results clearly demarcate the 'daylight', 'twilight' and 'midnight' zones for interpreting residue-residue correspondences from sequence information alone. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sandun Rajapaksa
- Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Clayton, VIC 3800, Australia
| | | | - Arthur M Lesk
- Department of Biochemistry and Molecular Biology and Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, PA 16802, USA
| | - Lloyd Allison
- Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Clayton, VIC 3800, Australia
| | - Peter J Stuckey
- Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Clayton, VIC 3800, Australia
| | - Maria Garcia de la Banda
- Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Clayton, VIC 3800, Australia
| | - David Abramson
- Research Computing Center, University of Queensland, St Lucia, QLD 4067, Australia
| | | |
Collapse
|
5
|
Maiolo M, Gatti L, Frei D, Leidi T, Gil M, Anisimova M. ProPIP: a tool for progressive multiple sequence alignment with Poisson Indel Process. BMC Bioinformatics 2021; 22:518. [PMID: 34689750 PMCID: PMC8543915 DOI: 10.1186/s12859-021-04442-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Accepted: 10/13/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Current alignment tools typically lack an explicit model of indel evolution, leading to artificially short inferred alignments (i.e., over-alignment) due to inconsistencies between the indel history and the phylogeny relating the input sequences. RESULTS We present a new progressive multiple sequence alignment tool ProPIP. The process of insertions and deletions is described using an explicit evolutionary model-the Poisson Indel Process or PIP. The method is based on dynamic programming and is implemented in a frequentist framework. The source code can be compiled on Linux, macOS and Microsoft Windows platforms. The algorithm is implemented in C++ as standalone program. The source code is freely available on GitHub at https://github.com/acg-team/ProPIP and is distributed under the terms of the GNU GPL v3 license. CONCLUSIONS The use of an explicit indel evolution model allows to avoid over-alignment, to infer gaps in a phylogenetically consistent way and to make inferences about the rates of insertions and deletions. Instead of the arbitrary gap penalties, the parameters used by ProPIP are the insertion and deletion rates, which have biological interpretation and are contextualized in a probabilistic environment. As a result, indel rate settings may be optimised in order to infer phylogenetically meaningful gap patterns.
Collapse
Affiliation(s)
- Massimo Maiolo
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Schloss 1, Postfach, 8820, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics (SIB), Quartier Sorge - Batiment Amphipole, 1015, Lausanne, Switzerland
| | - Lorenzo Gatti
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Schloss 1, Postfach, 8820, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics (SIB), Quartier Sorge - Batiment Amphipole, 1015, Lausanne, Switzerland
| | - Diego Frei
- Institute of Information Systems and Networking, University of Applied Sciences and Arts of Southern Switzerland, Galleria 2, Via Cantonale 2c, 6928, Manno, Switzerland
| | - Tiziano Leidi
- Institute of Information Systems and Networking, University of Applied Sciences and Arts of Southern Switzerland, Galleria 2, Via Cantonale 2c, 6928, Manno, Switzerland
| | - Manuel Gil
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Schloss 1, Postfach, 8820, Wädenswil, Switzerland.,Swiss Institute of Bioinformatics (SIB), Quartier Sorge - Batiment Amphipole, 1015, Lausanne, Switzerland
| | - Maria Anisimova
- Institute of Applied Simulation, School of Life Sciences and Facility Management, Zurich University of Applied Sciences (ZHAW), Schloss 1, Postfach, 8820, Wädenswil, Switzerland. .,Swiss Institute of Bioinformatics (SIB), Quartier Sorge - Batiment Amphipole, 1015, Lausanne, Switzerland.
| |
Collapse
|