1
|
Plata G, Srinivasan K, Krishnamurthy M, Herron L, Dixit P. Designing host-associated microbiomes using the consumer/resource model. mSystems 2025; 10:e0106824. [PMID: 39651880 PMCID: PMC11748559 DOI: 10.1128/msystems.01068-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2024] [Accepted: 11/06/2024] [Indexed: 12/18/2024] Open
Abstract
A key step toward rational microbiome engineering is in silico sampling of realistic microbial communities that correspond to desired host phenotypes, and vice versa. This remains challenging due to a lack of generative models that simultaneously capture compositions of host-associated microbiomes and host phenotypes. To that end, we present a generative model based on the mechanistic consumer/resource (C/R) framework. In the model, variation in microbial ecosystem composition arises due to differences in the availability of effective resources (inferred latent variables), while species' resource preferences remain conserved. Simultaneously, the latent variables are used to model phenotypic states of hosts. In silico microbiomes generated by our model accurately reproduce universal and dataset-specific statistics of bacterial communities. The model allows us to address three salient questions in host-associated microbial ecologies: (i) which host phenotypes maximally constrain the composition of the host-associated microbiomes? (ii) how context-specific are phenotype/microbiome associations, and (iii) what are plausible microbiome compositions that correspond to desired host phenotypes? Our approach aids the analysis and design of microbial communities associated with host phenotypes of interest. IMPORTANCE Generative models are extremely popular in modern biology. They have been used to model the variation of protein sequences, entire genomes, and RNA sequencing profiles. Importantly, generative models have been used to extrapolate and interpolate to unobserved regimes of data to design biological systems with desired properties. For example, there has been a boom in machine-learning models aiding in the design of proteins with user-specified structures or functions. Host-associated microbiomes play important roles in animal health and disease, as well as the productivity and environmental footprint of livestock species. However, there are no generative models of host-associated microbiomes. One chief reason is that off-the-shelf machine-learning models are data hungry, and microbiome studies usually deal with large variability and small sample sizes. Moreover, microbiome compositions are heavily context dependent, with characteristics of the host and the abiotic environment leading to distinct patterns in host-microbiome associations. Consequently, off-the-shelf generative modeling has not been successfully applied to microbiomes.To address these challenges, we develop a generative model for host-associated microbiomes derived from the consumer/resource (C/R) framework. This derivation allows us to fit the model to readily available cross-sectional microbiome profile data. Using data from three animal hosts, we show that this mechanistic generative model has several salient features: the model identifies a latent space that represents variables that determine the growth and, therefore, relative abundances of microbial species. Probabilistic modeling of variation in this latent space allows us to generate realistic in silico microbial communities. The model can assign probabilities to microbiomes, thereby allowing us to discriminate between dissimilar ecosystems. Importantly, the model predictively captures host-associated microbiomes and the corresponding hosts' phenotypes, enabling the design of microbial communities associated with user-specified host characteristics.
Collapse
Affiliation(s)
- Germán Plata
- Computational Sciences, BiomEdit, LLC., Fishers, Indiana, USA
| | - Karthik Srinivasan
- Department of Biomedical Engineering, Yale University, New Haven, Connecticut, USA
| | | | - Lukas Herron
- Department of Physics, University of Florida, Gainesville, Florida, USA
| | - Purushottam Dixit
- Department of Biomedical Engineering, Yale University, New Haven, Connecticut, USA
- Systems Biology Institute, Yale University, West Haven, Connecticut, USA
| |
Collapse
|
2
|
Jin L, Zhou Y, Zhang S, Chen SJ. mRNA vaccine sequence and structure design and optimization: Advances and challenges. J Biol Chem 2025; 301:108015. [PMID: 39608721 PMCID: PMC11728972 DOI: 10.1016/j.jbc.2024.108015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 11/13/2024] [Accepted: 11/16/2024] [Indexed: 11/30/2024] Open
Abstract
Messenger RNA (mRNA) vaccines have emerged as a powerful tool against communicable diseases and cancers, as demonstrated by their huge success during the coronavirus disease 2019 (COVID-19) pandemic. Despite the outstanding achievements, mRNA vaccines still face challenges such as stringent storage requirements, insufficient antigen expression, and unexpected immune responses. Since the intrinsic properties of mRNA molecules significantly impact vaccine performance, optimizing mRNA design is crucial in preclinical development. In this review, we outline four key principles for optimal mRNA sequence design: enhancing ribosome loading and translation efficiency through untranslated region (UTR) optimization, improving translation efficiency via codon optimization, increasing structural stability by refining global RNA sequence and extending in-cell lifetime and expression fidelity by adjusting local RNA structures. We also explore recent advancements in computational models for designing and optimizing mRNA vaccine sequences following these principles. By integrating current mRNA knowledge, addressing challenges, and examining advanced computational methods, this review aims to promote the application of computational approaches in mRNA vaccine development and inspire novel solutions to existing obstacles.
Collapse
Affiliation(s)
- Lei Jin
- Department of Physics and Astronomy, University of Missouri, Columbia, Missouri, USA
| | - Yuanzhe Zhou
- Department of Physics and Astronomy, University of Missouri, Columbia, Missouri, USA
| | - Sicheng Zhang
- Department of Physics and Astronomy, University of Missouri, Columbia, Missouri, USA
| | - Shi-Jie Chen
- Department of Physics and Astronomy, University of Missouri, Columbia, Missouri, USA; Department of Biochemistry, MU Institute for Data Science and Informatics, University of Missouri, Columbia, Missouri, USA.
| |
Collapse
|
3
|
Schuster V, Dann E, Krogh A, Teichmann SA. multiDGD: A versatile deep generative model for multi-omics data. Nat Commun 2024; 15:10031. [PMID: 39567490 PMCID: PMC11579284 DOI: 10.1038/s41467-024-53340-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Accepted: 10/03/2024] [Indexed: 11/22/2024] Open
Abstract
Recent technological advancements in single-cell genomics have enabled joint profiling of gene expression and alternative modalities at unprecedented scale. Consequently, the complexity of multi-omics data sets is increasing massively. Existing models for multi-modal data are typically limited in functionality or scalability, making data integration and downstream analysis cumbersome. We present multiDGD, a scalable deep generative model providing a probabilistic framework to learn shared representations of transcriptome and chromatin accessibility. It shows outstanding performance on data reconstruction without feature selection. We demonstrate on several data sets from human and mouse that multiDGD learns well-clustered joint representations. We further find that probabilistic modeling of sample covariates enables post-hoc data integration without the need for fine-tuning. Additionally, we show that multiDGD can detect statistical associations between genes and regulatory regions conditioned on the learned representations. multiDGD is available as an scverse-compatible package on GitHub.
Collapse
Affiliation(s)
- Viktoria Schuster
- Department of Computer Science, University of Copenhagen, Universitetsparken 5, Copenhagen, 2100, Denmark
- Center for Health Data Science, University of Copenhagen, Blegdamsvej 3B, Copenhagen, 2200, Denmark
| | - Emma Dann
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom
| | - Anders Krogh
- Department of Computer Science, University of Copenhagen, Universitetsparken 5, Copenhagen, 2100, Denmark.
- Center for Health Data Science, University of Copenhagen, Blegdamsvej 3B, Copenhagen, 2200, Denmark.
| | - Sarah A Teichmann
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom.
- Theory of Condensed Matter Group, Cavendish Laboratory, University of Cambridge, J J Thomson Avenue, Cambridge, CB3 0HE, United Kingdom.
- Cambridge Stem Cell Institute, Jeffrey Cheah Biomedical Centre, Puddicombe Way, Cambrdige, CB2 0AW, United Kingdom.
| |
Collapse
|
4
|
Boyeau P, Bates S, Ergen C, Jordan MI, Yosef N. VI-VS: calibrated identification of feature dependencies in single-cell multiomics. Genome Biol 2024; 25:294. [PMID: 39548591 PMCID: PMC11566124 DOI: 10.1186/s13059-024-03419-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 10/08/2024] [Indexed: 11/18/2024] Open
Abstract
Unveiling functional relationships between various molecular cell phenotypes from data using machine learning models is a key promise of multiomics. Existing methods either use flexible but hard-to-interpret models or simpler, misspecified models. VI-VS (Variational Inference for Variable Selection) balances flexibility and interpretability to identify relevant feature relationships in multiomic data. It uses deep generative models to identify conditionally dependent features, with false discovery rate control. VI-VS is available as an open-source Python package, providing a robust solution to identify features more likely representing genuine causal relationships.
Collapse
Affiliation(s)
- Pierre Boyeau
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA
| | - Stephen Bates
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, USA
| | - Can Ergen
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA
- Center for Computational Biology, University of California, Berkeley, USA
| | - Michael I Jordan
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA
- Department of Statistics, University of California, Berkeley, USA
- Center for Computational Biology, University of California, Berkeley, USA
- Inria, Paris, France
| | - Nir Yosef
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA.
- Department of Systems Immunology, Weizmann Institute of Science, Rehovot, Israel.
| |
Collapse
|
5
|
He Z, Hu S, Chen Y, An S, Zhou J, Liu R, Shi J, Wang J, Dong G, Shi J, Zhao J, Ou-Yang L, Zhu Y, Bo X, Ying X. Mosaic integration and knowledge transfer of single-cell multimodal data with MIDAS. Nat Biotechnol 2024; 42:1594-1605. [PMID: 38263515 PMCID: PMC11471558 DOI: 10.1038/s41587-023-02040-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Accepted: 10/23/2023] [Indexed: 01/25/2024]
Abstract
Integrating single-cell datasets produced by multiple omics technologies is essential for defining cellular heterogeneity. Mosaic integration, in which different datasets share only some of the measured modalities, poses major challenges, particularly regarding modality alignment and batch effect removal. Here, we present a deep probabilistic framework for the mosaic integration and knowledge transfer (MIDAS) of single-cell multimodal data. MIDAS simultaneously achieves dimensionality reduction, imputation and batch correction of mosaic data by using self-supervised modality alignment and information-theoretic latent disentanglement. We demonstrate its superiority to 19 other methods and reliability by evaluating its performance in trimodal and mosaic integration tasks. We also constructed a single-cell trimodal atlas of human peripheral blood mononuclear cells and tailored transfer learning and reciprocal reference mapping schemes to enable flexible and accurate knowledge transfer from the atlas to new data. Applications in mosaic integration, pseudotime analysis and cross-tissue knowledge transfer on bone marrow mosaic datasets demonstrate the versatility and superiority of MIDAS. MIDAS is available at https://github.com/labomics/midas .
Collapse
Affiliation(s)
- Zhen He
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Shuofeng Hu
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Yaowen Chen
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Sijing An
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Jiahao Zhou
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| | - Runyan Liu
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Junfeng Shi
- School of Automation, China University of Geosciences, Wuhan, China
| | - Jing Wang
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Guohua Dong
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Jinhui Shi
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Jiaxin Zhao
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China
| | - Le Ou-Yang
- College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
| | - Yuan Zhu
- School of Automation, China University of Geosciences, Wuhan, China
| | - Xiaochen Bo
- Institute of Health Service and Transfusion Medicine, Beijing, China.
| | - Xiaomin Ying
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing, China.
| |
Collapse
|
6
|
Luo E, Hao M, Wei L, Zhang X. scDiffusion: conditional generation of high-quality single-cell data using diffusion model. Bioinformatics 2024; 40:btae518. [PMID: 39171840 PMCID: PMC11368386 DOI: 10.1093/bioinformatics/btae518] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Revised: 08/10/2024] [Accepted: 08/20/2024] [Indexed: 08/23/2024] Open
Abstract
MOTIVATION Single-cell RNA sequencing (scRNA-seq) data are important for studying the laws of life at single-cell level. However, it is still challenging to obtain enough high-quality scRNA-seq data. To mitigate the limited availability of data, generative models have been proposed to computationally generate synthetic scRNA-seq data. Nevertheless, the data generated with current models are not very realistic yet, especially when we need to generate data with controlled conditions. In the meantime, diffusion models have shown their power in generating data with high fidelity, providing a new opportunity for scRNA-seq generation. RESULTS In this study, we developed scDiffusion, a generative model combining the diffusion model and foundation model to generate high-quality scRNA-seq data with controlled conditions. We designed multiple classifiers to guide the diffusion process simultaneously, enabling scDiffusion to generate data under multiple condition combinations. We also proposed a new control strategy called Gradient Interpolation. This strategy allows the model to generate continuous trajectories of cell development from a given cell state. Experiments showed that scDiffusion could generate single-cell gene expression data closely resembling real scRNA-seq data. Also, scDiffusion can conditionally produce data on specific cell types including rare cell types. Furthermore, we could use the multiple-condition generation of scDiffusion to generate cell type that was out of the training data. Leveraging the Gradient Interpolation strategy, we generated a continuous developmental trajectory of mouse embryonic cells. These experiments demonstrate that scDiffusion is a powerful tool for augmenting the real scRNA-seq data and can provide insights into cell fate research. AVAILABILITY AND IMPLEMENTATION scDiffusion is openly available at the GitHub repository https://github.com/EperLuo/scDiffusion or Zenodo https://zenodo.org/doi/10.5281/zenodo.13268742.
Collapse
Affiliation(s)
- Erpai Luo
- MOE Key Lab of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Minsheng Hao
- MOE Key Lab of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Lei Wei
- MOE Key Lab of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xuegong Zhang
- MOE Key Lab of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
- School of Life Sciences and School of Medicine, Center for Synthetic and Systems Biology, Tsinghua University, Beijing 100084, China
| |
Collapse
|
7
|
Huang W, Liu H. Predicting single-cell cellular responses to perturbations using cycle consistency learning. Bioinformatics 2024; 40:i462-i470. [PMID: 38940153 PMCID: PMC11256949 DOI: 10.1093/bioinformatics/btae248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
SUMMARY Phenotype-based drug screening emerges as a powerful approach for identifying compounds that actively interact with cells. Transcriptional and proteomic profiling of cell lines and individual cells provide insights into the cellular state alterations that occur at the molecular level in response to external perturbations, such as drugs or genetic manipulations. In this paper, we propose cycleCDR, a novel deep learning framework to predict cellular response to external perturbations. We leverage the autoencoder to map the unperturbed cellular states to a latent space, in which we postulate the effects of drug perturbations on cellular states follow a linear additive model. Next, we introduce the cycle consistency constraints to ensure that unperturbed cellular state subjected to drug perturbation in the latent space would produces the perturbed cellular state through the decoder. Conversely, removal of perturbations from the perturbed cellular states can restore the unperturbed cellular state. The cycle consistency constraints and linear modeling in the latent space enable to learn transferable representations of external perturbations, so that our model can generalize well to unseen drugs during training stage. We validate our model on four different types of datasets, including bulk transcriptional responses, bulk proteomic responses, and single-cell transcriptional responses to drug/gene perturbations. The experimental results demonstrate that our model consistently outperforms existing state-of-the-art methods, indicating our method is highly versatile and applicable to a wide range of scenarios. AVAILABILITY AND IMPLEMENTATION The source code is available at: https://github.com/hliulab/cycleCDR.
Collapse
Affiliation(s)
- Wei Huang
- College of Computer and Information Engineering, Nanjing Tech University, Nanjing, Jiangsu 211816, China
| | - Hui Liu
- College of Computer and Information Engineering, Nanjing Tech University, Nanjing, Jiangsu 211816, China
| |
Collapse
|
8
|
Rivero-Garcia I, Torres M, Sánchez-Cabo F. Deep generative models in single-cell omics. Comput Biol Med 2024; 176:108561. [PMID: 38749321 DOI: 10.1016/j.compbiomed.2024.108561] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 04/30/2024] [Accepted: 05/05/2024] [Indexed: 05/31/2024]
Abstract
Deep Generative Models (DGMs) are becoming instrumental for inferring probability distributions inherent to complex processes, such as most questions in biomedical research. For many years, there was a lack of mathematical methods that would allow this inference in the scarce data scenario of biomedical research. The advent of single-cell omics has finally made square the so-called "skinny matrix", allowing to apply mathematical methods already extensively used in other areas. Moreover, it is now possible to integrate data at different molecular levels in thousands or even millions of samples, thanks to the number of single-cell atlases being collaboratively generated. Additionally, DGMs have proven useful in other frequent tasks in single-cell analysis pipelines, from dimensionality reduction, cell type annotation to RNA velocity inference. In spite of its promise, DGMs need to be used with caution in biomedical research, paying special attention to its use to answer the right questions and the definition of appropriate error metrics and validation check points that confirm not only its correct use but also its relevance. All in all, DGMs provide an exciting tool that opens a bright future for the integrative analysis of single-cell -omics to understand health and disease.
Collapse
Affiliation(s)
- Inés Rivero-Garcia
- Universidad Politécnica de Madrid, Madrid, 28040, Spain; Centro Nacional de Investigaciones Cardiovasculares (CNIC), Madrid, 28029, Spain
| | - Miguel Torres
- Centro Nacional de Investigaciones Cardiovasculares (CNIC), Madrid, 28029, Spain
| | - Fátima Sánchez-Cabo
- Centro Nacional de Investigaciones Cardiovasculares (CNIC), Madrid, 28029, Spain.
| |
Collapse
|
9
|
Maizels RJ. A dynamical perspective: moving towards mechanism in single-cell transcriptomics. Philos Trans R Soc Lond B Biol Sci 2024; 379:20230049. [PMID: 38432314 PMCID: PMC10909508 DOI: 10.1098/rstb.2023.0049] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Accepted: 10/31/2023] [Indexed: 03/05/2024] Open
Abstract
As the field of single-cell transcriptomics matures, research is shifting focus from phenomenological descriptions of cellular phenotypes to a mechanistic understanding of the gene regulation underneath. This perspective considers the value of capturing dynamical information at single-cell resolution for gaining mechanistic insight; reviews the available technologies for recording and inferring temporal information in single cells; and explores whether better dynamical resolution is sufficient to adequately capture the causal relationships driving complex biological systems. This article is part of a discussion meeting issue 'Causes and consequences of stochastic processes in development and disease'.
Collapse
Affiliation(s)
- Rory J. Maizels
- The Francis Crick Institute, 1 Midland Road, London NW1 1AT, UK
- University College London, London WC1E 6BT, UK
| |
Collapse
|
10
|
Cusworth S, Gkoutos GV, Acharjee A. A novel generative adversarial networks modelling for the class imbalance problem in high dimensional omics data. BMC Med Inform Decis Mak 2024; 24:90. [PMID: 38549123 PMCID: PMC10979623 DOI: 10.1186/s12911-024-02487-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Accepted: 03/22/2024] [Indexed: 04/01/2024] Open
Abstract
Class imbalance remains a large problem in high-throughput omics analyses, causing bias towards the over-represented class when training machine learning-based classifiers. Oversampling is a common method used to balance classes, allowing for better generalization of the training data. More naive approaches can introduce other biases into the data, being especially sensitive to inaccuracies in the training data, a problem considering the characteristically noisy data obtained in healthcare. This is especially a problem with high-dimensional data. A generative adversarial network-based method is proposed for creating synthetic samples from small, high-dimensional data, to improve upon other more naive generative approaches. The method was compared with 'synthetic minority over-sampling technique' (SMOTE) and 'random oversampling' (RO). Generative methods were validated by training classifiers on the balanced data.
Collapse
Affiliation(s)
- Samuel Cusworth
- Institute of Applied Health Research, University of Birmingham, Birmingham, UK
- NIHR Blood and Transplant Research Unit (BTRU) in Precision Transplant and Cellular Therapeutics, University of Birmingham, Birmingham, UK
| | - Georgios V Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, B15 2TT, Birmingham, UK
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, B15 2TT, Birmingham, UK
- MRC Health Data Research UK (HDR), Midlands Site, UK
- Centre for Health Data Research, University of Birmingham, B15 2TT, Birmingham, UK
- NIHR Experimental Cancer Medicine Centre, B15 2TT, Birmingham, UK
| | - Animesh Acharjee
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, University of Birmingham, B15 2TT, Birmingham, UK.
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, B15 2TT, Birmingham, UK.
- MRC Health Data Research UK (HDR), Midlands Site, UK.
- Centre for Health Data Research, University of Birmingham, B15 2TT, Birmingham, UK.
| |
Collapse
|
11
|
Gayoso A, Weiler P, Lotfollahi M, Klein D, Hong J, Streets A, Theis FJ, Yosef N. Deep generative modeling of transcriptional dynamics for RNA velocity analysis in single cells. Nat Methods 2024; 21:50-59. [PMID: 37735568 PMCID: PMC10776389 DOI: 10.1038/s41592-023-01994-w] [Citation(s) in RCA: 30] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2022] [Accepted: 08/08/2023] [Indexed: 09/23/2023]
Abstract
RNA velocity has been rapidly adopted to guide interpretation of transcriptional dynamics in snapshot single-cell data; however, current approaches for estimating RNA velocity lack effective strategies for quantifying uncertainty and determining the overall applicability to the system of interest. Here, we present veloVI (velocity variational inference), a deep generative modeling framework for estimating RNA velocity. veloVI learns a gene-specific dynamical model of RNA metabolism and provides a transcriptome-wide quantification of velocity uncertainty. We show that veloVI compares favorably to previous approaches with respect to goodness of fit, consistency across transcriptionally similar cells and stability across preprocessing pipelines for quantifying RNA abundance. Further, we demonstrate that veloVI's posterior velocity uncertainty can be used to assess whether velocity analysis is appropriate for a given dataset. Finally, we highlight veloVI as a flexible framework for modeling transcriptional dynamics by adapting the underlying dynamical model to use time-dependent transcription rates.
Collapse
Affiliation(s)
- Adam Gayoso
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Philipp Weiler
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany
- Department of Mathematics, Technical University of Munich, Munich, Germany
| | - Mohammad Lotfollahi
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany
- Wellcome Sanger Institute, Cambridge, UK
| | - Dominik Klein
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany
- Department of Mathematics, Technical University of Munich, Munich, Germany
| | - Justin Hong
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
- Department of Computer Science, Columbia University, New York, NY, USA
| | - Aaron Streets
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Fabian J Theis
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany.
- Department of Mathematics, Technical University of Munich, Munich, Germany.
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany.
| | - Nir Yosef
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA.
| |
Collapse
|
12
|
Plattner C, Lamberti G, Blattmann P, Kirchmair A, Rieder D, Loncova Z, Sturm G, Scheidl S, Ijsselsteijn M, Fotakis G, Noureen A, Lisandrelli R, Böck N, Nemati N, Krogsdam A, Daum S, Finotello F, Somarakis A, Schäfer A, Wilflingseder D, Gonzalez Acera M, Öfner D, Huber LA, Clevers H, Becker C, Farin HF, Greten FR, Aebersold R, de Miranda NF, Trajanoski Z. Functional and spatial proteomics profiling reveals intra- and intercellular signaling crosstalk in colorectal cancer. iScience 2023; 26:108399. [PMID: 38047086 PMCID: PMC10692669 DOI: 10.1016/j.isci.2023.108399] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 04/21/2023] [Accepted: 11/02/2023] [Indexed: 12/05/2023] Open
Abstract
Precision oncology approaches for patients with colorectal cancer (CRC) continue to lag behind other solid cancers. Functional precision oncology-a strategy that is based on perturbing primary tumor cells from cancer patients-could provide a road forward to personalize treatment. We extend this paradigm to measuring proteome activity landscapes by acquiring quantitative phosphoproteomic data from patient-derived organoids (PDOs). We show that kinase inhibitors induce inhibitor- and patient-specific off-target effects and pathway crosstalk. Reconstruction of the kinase networks revealed that the signaling rewiring is modestly affected by mutations. We show non-genetic heterogeneity of the PDOs and upregulation of stemness and differentiation genes by kinase inhibitors. Using imaging mass-cytometry-based profiling of the primary tumors, we characterize the tumor microenvironment (TME) and determine spatial heterocellular crosstalk and tumor-immune cell interactions. Collectively, we provide a framework for inferring tumor cell intrinsic signaling and external signaling from the TME to inform precision (immuno-) oncology in CRC.
Collapse
Affiliation(s)
- Christina Plattner
- Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Giorgia Lamberti
- Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Peter Blattmann
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, 8092 Zurich, Switzerland
| | - Alexander Kirchmair
- Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Dietmar Rieder
- Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Zuzana Loncova
- Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Gregor Sturm
- Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Stefan Scheidl
- Department of Visceral, Transplant and Thoracic Surgery, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Marieke Ijsselsteijn
- Department of Pathology, Leiden University Medical Center, 2333 ZA Leiden, the Netherlands
| | - Georgios Fotakis
- Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Asma Noureen
- Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Rebecca Lisandrelli
- Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Nina Böck
- Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Niloofar Nemati
- Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Anne Krogsdam
- Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Sophia Daum
- Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Francesca Finotello
- Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Antonios Somarakis
- Department of Radiology, Leiden University Medical Center, 2333 ZA Leiden, the Netherlands
| | - Alexander Schäfer
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, 8092 Zurich, Switzerland
| | - Doris Wilflingseder
- Institute of Hygiene and Medical Microbiology, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Miguel Gonzalez Acera
- Department of Medicine 1, Friedrich-Alexander Universität Erlangen-Nürnberg (FAU) and Universitätsklinikum Erlangen, 91054 Erlangen, Germany
| | - Dietmar Öfner
- Department of Visceral, Transplant and Thoracic Surgery, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Lukas A. Huber
- Biocenter, Institute of Cell Biology, Medical University of Innsbruck, 6020 Innsbruck, Austria
| | - Hans Clevers
- Hubrecht Institute, 3584 CT Utrecht, the Netherlands
| | - Christoph Becker
- Department of Medicine 1, Friedrich-Alexander Universität Erlangen-Nürnberg (FAU) and Universitätsklinikum Erlangen, 91054 Erlangen, Germany
| | - Henner F. Farin
- Institute for Tumor Biology and Experimental Therapy, Georg-Speyer-Haus, 60596 Frankfurt am Main, Germany
- Frankfurt Cancer Institute, Goethe University, 60596 Frankfurt am Main, Germany
- German Cancer Consortium (DKTK), partner site Frankfurt/Mainz, a partnership with DKFZ Heidelberg, Frankfurt/Mainz, Germany
| | - Florian R. Greten
- Institute for Tumor Biology and Experimental Therapy, Georg-Speyer-Haus, 60596 Frankfurt am Main, Germany
- Frankfurt Cancer Institute, Goethe University, 60596 Frankfurt am Main, Germany
- German Cancer Consortium (DKTK), partner site Frankfurt/Mainz, a partnership with DKFZ Heidelberg, Frankfurt/Mainz, Germany
| | - Ruedi Aebersold
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, 8092 Zurich, Switzerland
| | - Noel F.C.C. de Miranda
- Department of Pathology, Leiden University Medical Center, 2333 ZA Leiden, the Netherlands
| | - Zlatko Trajanoski
- Biocenter, Institute of Bioinformatics, Medical University of Innsbruck, 6020 Innsbruck, Austria
| |
Collapse
|
13
|
Barghout RA, Xu Z, Betala S, Mahadevan R. Advances in generative modeling methods and datasets to design novel enzymes for renewable chemicals and fuels. Curr Opin Biotechnol 2023; 84:103007. [PMID: 37931573 DOI: 10.1016/j.copbio.2023.103007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 09/12/2023] [Accepted: 09/13/2023] [Indexed: 11/08/2023]
Abstract
Biotechnology has revolutionized the development of sustainable energy sources by harnessing biomass as a feedstock for energy production. However, challenges such as recalcitrant feedstocks and inefficient metabolic pathways hinder the large-scale integration of renewable energy systems. Enzyme engineering has emerged as a powerful tool to address these challenges by enhancing enzyme activity, specificity, and stability. Generative machine learning (ML) models have shown great promise in accelerating protein design, allowing for the generation of novel protein sequences with desired properties by navigating vast spaces. This review paper aims to summarize the state of the art in generative models for protein design and how they can be applied to bioenergy applications, including the underlying architectures and training strategies. Additionally, it highlights the importance of high-quality datasets for training and evaluating generative models, organizes available datasets for generative protein design, and discusses the potential of applying generative models to strain design for bioenergy production.
Collapse
Affiliation(s)
- Rana A Barghout
- Department of Chemical Engineering & Applied Chemistry, University of Toronto, 200 College St, Toronto, ON, Canada.
| | - Zhiqing Xu
- Department of Chemical Engineering & Applied Chemistry, University of Toronto, 200 College St, Toronto, ON, Canada
| | - Siddharth Betala
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, India
| | - Radhakrishnan Mahadevan
- Department of Chemical Engineering & Applied Chemistry, University of Toronto, 200 College St, Toronto, ON, Canada
| |
Collapse
|
14
|
Alexandrov T, Saez‐Rodriguez J, Saka SK. Enablers and challenges of spatial omics, a melting pot of technologies. Mol Syst Biol 2023; 19:e10571. [PMID: 37842805 PMCID: PMC10632737 DOI: 10.15252/msb.202110571] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Revised: 07/31/2023] [Accepted: 08/03/2023] [Indexed: 10/17/2023] Open
Abstract
Spatial omics has emerged as a rapidly growing and fruitful field with hundreds of publications presenting novel methods for obtaining spatially resolved information for any omics data type on spatial scales ranging from subcellular to organismal. From a technology development perspective, spatial omics is a highly interdisciplinary field that integrates imaging and omics, spatial and molecular analyses, sequencing and mass spectrometry, and image analysis and bioinformatics. The emergence of this field has not only opened a window into spatial biology, but also created multiple novel opportunities, questions, and challenges for method developers. Here, we provide the perspective of technology developers on what makes the spatial omics field unique. After providing a brief overview of the state of the art, we discuss technological enablers and challenges and present our vision about the future applications and impact of this melting pot.
Collapse
Affiliation(s)
- Theodore Alexandrov
- Structural and Computational Biology UnitEuropean Molecular Biology LaboratoryHeidelbergGermany
- Molecular Medicine Partnership UnitEuropean Molecular Biology LaboratoryHeidelbergGermany
- BioInnovation InstituteCopenhagenDenmark
| | - Julio Saez‐Rodriguez
- Molecular Medicine Partnership UnitEuropean Molecular Biology LaboratoryHeidelbergGermany
- Faculty of Medicine and Heidelberg University Hospital, Institute for Computational BiomedicineHeidelberg UniversityHeidelbergGermany
| | - Sinem K Saka
- Genome Biology UnitEuropean Molecular Biology LaboratoryHeidelbergGermany
| |
Collapse
|
15
|
Michoel T, Zhang JD. Causal inference in drug discovery and development. Drug Discov Today 2023; 28:103737. [PMID: 37591410 DOI: 10.1016/j.drudis.2023.103737] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2022] [Revised: 07/31/2023] [Accepted: 08/10/2023] [Indexed: 08/19/2023]
Abstract
To discover new drugs is to seek and to prove causality. As an emerging approach leveraging human knowledge and creativity, data, and machine intelligence, causal inference holds the promise of reducing cognitive bias and improving decision-making in drug discovery. Although it has been applied across the value chain, the concepts and practice of causal inference remain obscure to many practitioners. This article offers a nontechnical introduction to causal inference, reviews its recent applications, and discusses opportunities and challenges of adopting the causal language in drug discovery and development.
Collapse
Affiliation(s)
- Tom Michoel
- Computational Biology Unit, Department of Informatics, University of Bergen, Postboks 7803, 5020 Bergen, Norway
| | - Jitao David Zhang
- Pharma Early Research and Development, Roche Innovation Centre Basel, F. Hoffmann-La Roche, Grenzacherstrasse 124, 4070 Basel, Switzerland; Department of Mathematics and Computer Science, University of Basel, Spiegelgasse 1, 4051 Basel, Switzerland.
| |
Collapse
|
16
|
Valeri JA, Soenksen LR, Collins KM, Ramesh P, Cai G, Powers R, Angenent-Mari NM, Camacho DM, Wong F, Lu TK, Collins JJ. BioAutoMATED: An end-to-end automated machine learning tool for explanation and design of biological sequences. Cell Syst 2023; 14:525-542.e9. [PMID: 37348466 PMCID: PMC10700034 DOI: 10.1016/j.cels.2023.05.007] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Revised: 02/17/2023] [Accepted: 05/22/2023] [Indexed: 06/24/2023]
Abstract
The design choices underlying machine-learning (ML) models present important barriers to entry for many biologists who aim to incorporate ML in their research. Automated machine-learning (AutoML) algorithms can address many challenges that come with applying ML to the life sciences. However, these algorithms are rarely used in systems and synthetic biology studies because they typically do not explicitly handle biological sequences (e.g., nucleotide, amino acid, or glycan sequences) and cannot be easily compared with other AutoML algorithms. Here, we present BioAutoMATED, an AutoML platform for biological sequence analysis that integrates multiple AutoML methods into a unified framework. Users are automatically provided with relevant techniques for analyzing, interpreting, and designing biological sequences. BioAutoMATED predicts gene regulation, peptide-drug interactions, and glycan annotation, and designs optimized synthetic biology components, revealing salient sequence characteristics. By automating sequence modeling, BioAutoMATED allows life scientists to incorporate ML more readily into their work.
Collapse
Affiliation(s)
- Jacqueline A Valeri
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Luis R Soenksen
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Department of Mechanical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA
| | - Katherine M Collins
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Department of Engineering, University of Cambridge, Trumpington St, Cambridge CB2 1PZ, UK
| | - Pradeep Ramesh
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - George Cai
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Rani Powers
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Pluto Biosciences, Golden, CO 80402, USA
| | - Nicolaas M Angenent-Mari
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Diogo M Camacho
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Felix Wong
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Timothy K Lu
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Synthetic Biology Group, Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - James J Collins
- Department of Biological Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA 02139, USA; Abdul Latif Jameel Clinic for Machine Learning in Health, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
| |
Collapse
|
17
|
Boyeau P, Regier J, Gayoso A, Jordan MI, Lopez R, Yosef N. An empirical Bayes method for differential expression analysis of single cells with deep generative models. Proc Natl Acad Sci U S A 2023; 120:e2209124120. [PMID: 37192164 PMCID: PMC10214125 DOI: 10.1073/pnas.2209124120] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Accepted: 01/23/2023] [Indexed: 05/18/2023] Open
Abstract
Detecting differentially expressed genes is important for characterizing subpopulations of cells. In scRNA-seq data, however, nuisance variation due to technical factors like sequencing depth and RNA capture efficiency obscures the underlying biological signal. Deep generative models have been extensively applied to scRNA-seq data, with a special focus on embedding cells into a low-dimensional latent space and correcting for batch effects. However, little attention has been paid to the problem of utilizing the uncertainty from the deep generative model for differential expression (DE). Furthermore, the existing approaches do not allow for controlling for effect size or the false discovery rate (FDR). Here, we present lvm-DE, a generic Bayesian approach for performing DE predictions from a fitted deep generative model, while controlling the FDR. We apply the lvm-DE framework to scVI and scSphere, two deep generative models. The resulting approaches outperform state-of-the-art methods at estimating the log fold change in gene expression levels as well as detecting differentially expressed genes between subpopulations of cells.
Collapse
Affiliation(s)
- Pierre Boyeau
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA74720
| | - Jeffrey Regier
- Department of Statistics, University of Michigan, Ann Arbor, MI48109
| | - Adam Gayoso
- Center for Computational Biology, University of California, Berkeley, CA94720
| | - Michael I. Jordan
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA74720
- Center for Computational Biology, University of California, Berkeley, CA94720
- Department of Statistics, University of California, Berkeley, CA94720
| | - Romain Lopez
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA74720
| | - Nir Yosef
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA74720
- Center for Computational Biology, University of California, Berkeley, CA94720
- Department of Systems Immunology, Weizmann Institute of Science, Rehovot76100, Israel
| |
Collapse
|
18
|
Lotfollahi M, Klimovskaia Susmelj A, De Donno C, Hetzel L, Ji Y, Ibarra IL, Srivatsan SR, Naghipourfar M, Daza RM, Martin B, Shendure J, McFaline-Figueroa JL, Boyeau P, Wolf FA, Yakubova N, Günnemann S, Trapnell C, Lopez-Paz D, Theis FJ. Predicting cellular responses to complex perturbations in high-throughput screens. Mol Syst Biol 2023:e11517. [PMID: 37154091 DOI: 10.15252/msb.202211517] [Citation(s) in RCA: 77] [Impact Index Per Article: 38.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Revised: 03/23/2023] [Accepted: 03/31/2023] [Indexed: 05/10/2023] Open
Abstract
Recent advances in multiplexed single-cell transcriptomics experiments facilitate the high-throughput study of drug and genetic perturbations. However, an exhaustive exploration of the combinatorial perturbation space is experimentally unfeasible. Therefore, computational methods are needed to predict, interpret, and prioritize perturbations. Here, we present the compositional perturbation autoencoder (CPA), which combines the interpretability of linear models with the flexibility of deep-learning approaches for single-cell response modeling. CPA learns to in silico predict transcriptional perturbation response at the single-cell level for unseen dosages, cell types, time points, and species. Using newly generated single-cell drug combination data, we validate that CPA can predict unseen drug combinations while outperforming baseline models. Additionally, the architecture's modularity enables incorporating the chemical representation of the drugs, allowing the prediction of cellular response to completely unseen drugs. Furthermore, CPA is also applicable to genetic combinatorial screens. We demonstrate this by imputing in silico 5,329 missing combinations (97.6% of all possibilities) in a single-cell Perturb-seq experiment with diverse genetic interactions. We envision CPA will facilitate efficient experimental design and hypothesis generation by enabling in silico response prediction at the single-cell level and thus accelerate therapeutic applications using single-cell technologies.
Collapse
Affiliation(s)
- Mohammad Lotfollahi
- Helmholtz Center Munich - German Research Center for Environmental Health, Institute of Computational Biology, Munich, Germany
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, UK
| | | | - Carlo De Donno
- Helmholtz Center Munich - German Research Center for Environmental Health, Institute of Computational Biology, Munich, Germany
- School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
| | - Leon Hetzel
- Helmholtz Center Munich - German Research Center for Environmental Health, Institute of Computational Biology, Munich, Germany
- Department of Mathematics, Technical University of Munich, Munich, Germany
| | - Yuge Ji
- Helmholtz Center Munich - German Research Center for Environmental Health, Institute of Computational Biology, Munich, Germany
- School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
| | - Ignacio L Ibarra
- Helmholtz Center Munich - German Research Center for Environmental Health, Institute of Computational Biology, Munich, Germany
| | - Sanjay R Srivatsan
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | | | - Riza M Daza
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Beth Martin
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Jay Shendure
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Howard Hughes Medical Institute, Seattle, WA, USA
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
- Allen Discovery Center for Cell Lineage Tracing, Seattle, WA, USA
| | | | - Pierre Boyeau
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
| | - F Alexander Wolf
- Helmholtz Center Munich - German Research Center for Environmental Health, Institute of Computational Biology, Munich, Germany
| | | | - Stephan Günnemann
- Department of Computer Science, Technical University of Munich, Munich, Germany
| | - Cole Trapnell
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
- Allen Discovery Center for Cell Lineage Tracing, Seattle, WA, USA
| | - David Lopez-Paz
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, UK
| | - Fabian J Theis
- Helmholtz Center Munich - German Research Center for Environmental Health, Institute of Computational Biology, Munich, Germany
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, UK
- School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
- Department of Mathematics, Technical University of Munich, Munich, Germany
| |
Collapse
|
19
|
Dou Z, Sun Y, Jiang X, Wu X, Li Y, Gong B, Wang L. Data-driven strategies for the computational design of enzyme thermal stability: trends, perspectives, and prospects. Acta Biochim Biophys Sin (Shanghai) 2023; 55:343-355. [PMID: 37143326 PMCID: PMC10160227 DOI: 10.3724/abbs.2023033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 11/23/2022] [Indexed: 03/05/2023] Open
Abstract
Thermal stability is one of the most important properties of enzymes, which sustains life and determines the potential for the industrial application of biocatalysts. Although traditional methods such as directed evolution and classical rational design contribute greatly to this field, the enormous sequence space of proteins implies costly and arduous experiments. The development of enzyme engineering focuses on automated and efficient strategies because of the breakthrough of high-throughput DNA sequencing and machine learning models. In this review, we propose a data-driven architecture for enzyme thermostability engineering and summarize some widely adopted datasets, as well as machine learning-driven approaches for designing the thermal stability of enzymes. In addition, we present a series of existing challenges while applying machine learning in enzyme thermostability design, such as the data dilemma, model training, and use of the proposed models. Additionally, a few promising directions for enhancing the performance of the models are discussed. We anticipate that the efficient incorporation of machine learning can provide more insights and solutions for the design of enzyme thermostability in the coming years.
Collapse
Affiliation(s)
- Zhixin Dou
- State Key Laboratory of Microbial TechnologyShandong UniversityQingdao266237China
| | - Yuqing Sun
- School of SoftwareShandong UniversityJinan250101China
| | - Xukai Jiang
- National Glycoengineering Research CenterShandong UniversityQingdao266237China
| | - Xiuyun Wu
- State Key Laboratory of Microbial TechnologyShandong UniversityQingdao266237China
| | - Yingjie Li
- State Key Laboratory of Microbial TechnologyShandong UniversityQingdao266237China
| | - Bin Gong
- School of SoftwareShandong UniversityJinan250101China
| | - Lushan Wang
- State Key Laboratory of Microbial TechnologyShandong UniversityQingdao266237China
| |
Collapse
|
20
|
Jirsa V, Wang H, Triebkorn P, Hashemi M, Jha J, Gonzalez-Martinez J, Guye M, Makhalova J, Bartolomei F. Personalised virtual brain models in epilepsy. Lancet Neurol 2023; 22:443-454. [PMID: 36972720 DOI: 10.1016/s1474-4422(23)00008-x] [Citation(s) in RCA: 49] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2021] [Revised: 12/20/2022] [Accepted: 01/04/2023] [Indexed: 03/29/2023]
Abstract
Individuals with drug-resistant focal epilepsy are candidates for surgical treatment as a curative option. Before surgery can take place, the patient must have a presurgical evaluation to establish whether and how surgical treatment might stop their seizures without causing neurological deficits. Virtual brains are a new digital modelling technology that map the brain network of a person with epilepsy, using data derived from MRI. This technique produces a computer simulation of seizures and brain imaging signals, such as those that would be recorded with intracranial EEG. When combined with machine learning, virtual brains can be used to estimate the extent and organisation of the epileptogenic zone (ie, the brain regions related to seizure generation and the spatiotemporal dynamics during seizure onset). Virtual brains could, in the future, be used for clinical decision making, to improve precision in localisation of seizure activity, and for surgical planning, but at the moment these models have some limitations, such as low spatial resolution. As evidence accumulates in support of the predictive power of personalised virtual brain models, and as methods are tested in clinical trials, virtual brains might inform clinical practice in the near future.
Collapse
Affiliation(s)
- Viktor Jirsa
- Institut National de la Santé et de la Recherche Médicale, Institut de Neurosciences des Systèmes (INS) UMR1106, Aix Marseille Université, Marseille, France.
| | - Huifang Wang
- Institut National de la Santé et de la Recherche Médicale, Institut de Neurosciences des Systèmes (INS) UMR1106, Aix Marseille Université, Marseille, France
| | - Paul Triebkorn
- Institut National de la Santé et de la Recherche Médicale, Institut de Neurosciences des Systèmes (INS) UMR1106, Aix Marseille Université, Marseille, France
| | - Meysam Hashemi
- Institut National de la Santé et de la Recherche Médicale, Institut de Neurosciences des Systèmes (INS) UMR1106, Aix Marseille Université, Marseille, France
| | - Jayant Jha
- Institut National de la Santé et de la Recherche Médicale, Institut de Neurosciences des Systèmes (INS) UMR1106, Aix Marseille Université, Marseille, France
| | | | - Maxime Guye
- Centre National de la Recherche Scientifique, Center for Magnetic Resonance in Biology and Medicine, Aix Marseille Université, Marseille, France; Centre d'Exploration Métabolique par Résonance Magnétique, Assistance Publique - Hôpitaux de Marseille, La Timone University Hospital, Marseille, France
| | - Julia Makhalova
- Centre National de la Recherche Scientifique, Center for Magnetic Resonance in Biology and Medicine, Aix Marseille Université, Marseille, France; Centre d'Exploration Métabolique par Résonance Magnétique, Assistance Publique - Hôpitaux de Marseille, La Timone University Hospital, Marseille, France; Epileptology and Clinical Neurophysiology Department, Assistance Publique - Hôpitaux de Marseille, La Timone University Hospital, Marseille, France
| | - Fabrice Bartolomei
- Institut National de la Santé et de la Recherche Médicale, Institut de Neurosciences des Systèmes (INS) UMR1106, Aix Marseille Université, Marseille, France; Epileptology and Clinical Neurophysiology Department, Assistance Publique - Hôpitaux de Marseille, La Timone University Hospital, Marseille, France
| |
Collapse
|
21
|
Brombacher E, Hackenberg M, Kreutz C, Binder H, Treppner M. The performance of deep generative models for learning joint embeddings of single-cell multi-omics data. Front Mol Biosci 2022; 9:962644. [PMID: 36387277 PMCID: PMC9643784 DOI: 10.3389/fmolb.2022.962644] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Accepted: 10/12/2022] [Indexed: 11/07/2023] Open
Abstract
Recent extensions of single-cell studies to multiple data modalities raise new questions regarding experimental design. For example, the challenge of sparsity in single-omics data might be partly resolved by compensating for missing information across modalities. In particular, deep learning approaches, such as deep generative models (DGMs), can potentially uncover complex patterns via a joint embedding. Yet, this also raises the question of sample size requirements for identifying such patterns from single-cell multi-omics data. Here, we empirically examine the quality of DGM-based integrations for varying sample sizes. We first review the existing literature and give a short overview of deep learning methods for multi-omics integration. Next, we consider eight popular tools in more detail and examine their robustness to different cell numbers, covering two of the most common multi-omics types currently favored. Specifically, we use data featuring simultaneous gene expression measurements at the RNA level and protein abundance measurements for cell surface proteins (CITE-seq), as well as data where chromatin accessibility and RNA expression are measured in thousands of cells (10x Multiome). We examine the ability of the methods to learn joint embeddings based on biological and technical metrics. Finally, we provide recommendations for the design of multi-omics experiments and discuss potential future developments.
Collapse
Affiliation(s)
- Eva Brombacher
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
- Freiburg Center for Data Analysis and Modeling University of Freiburg, Freiburg, Germany
- Spemann Graduate School of Biology and Medicine (SGBM) University of Freiburg, Freiburg, Germany
- Centre for Integrative Biological Signaling Studies (CIBSS) University of Freiburg, Freiburg, Germany
- Faculty of Biology University of Freiburg, Freiburg, Germany
| | - Maren Hackenberg
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
- Freiburg Center for Data Analysis and Modeling University of Freiburg, Freiburg, Germany
| | - Clemens Kreutz
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
- Freiburg Center for Data Analysis and Modeling University of Freiburg, Freiburg, Germany
- Centre for Integrative Biological Signaling Studies (CIBSS) University of Freiburg, Freiburg, Germany
| | - Harald Binder
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
- Freiburg Center for Data Analysis and Modeling University of Freiburg, Freiburg, Germany
| | - Martin Treppner
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
- Freiburg Center for Data Analysis and Modeling University of Freiburg, Freiburg, Germany
| |
Collapse
|
22
|
Yeo HC, Selvarajoo K. Machine learning alternative to systems biology should not solely depend on data. Brief Bioinform 2022; 23:6731718. [PMID: 36184188 PMCID: PMC9677488 DOI: 10.1093/bib/bbac436] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 08/24/2022] [Accepted: 09/09/2022] [Indexed: 12/14/2022] Open
Abstract
In recent years, artificial intelligence (AI)/machine learning has emerged as a plausible alternative to systems biology for the elucidation of biological phenomena and in attaining specified design objective in synthetic biology. Although considered highly disruptive with numerous notable successes so far, we seek to bring attention to both the fundamental and practical pitfalls of their usage, especially in illuminating emergent behaviors from chaotic or stochastic systems in biology. Without deliberating on their suitability and the required data qualities and pre-processing approaches beforehand, the research and development community could experience similar 'AI winters' that had plagued other fields. Instead, we anticipate the integration or combination of the two approaches, where appropriate, moving forward.
Collapse
Affiliation(s)
- Hock Chuan Yeo
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), Singapore
| | | |
Collapse
|
23
|
Treppner M, Binder H, Hess M. Interpretable generative deep learning: an illustration with single cell gene expression data. Hum Genet 2022; 141:1481-1498. [PMID: 34988661 PMCID: PMC9360114 DOI: 10.1007/s00439-021-02417-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 08/06/2021] [Indexed: 11/26/2022]
Abstract
Deep generative models can learn the underlying structure, such as pathways or gene programs, from omics data. We provide an introduction as well as an overview of such techniques, specifically illustrating their use with single-cell gene expression data. For example, the low dimensional latent representations offered by various approaches, such as variational auto-encoders, are useful to get a better understanding of the relations between observed gene expressions and experimental factors or phenotypes. Furthermore, by providing a generative model for the latent and observed variables, deep generative models can generate synthetic observations, which allow us to assess the uncertainty in the learned representations. While deep generative models are useful to learn the structure of high-dimensional omics data by efficiently capturing non-linear dependencies between genes, they are sometimes difficult to interpret due to their neural network building blocks. More precisely, to understand the relationship between learned latent variables and observed variables, e.g., gene transcript abundances and external phenotypes, is difficult. Therefore, we also illustrate current approaches that allow us to infer the relationship between learned latent variables and observed variables as well as external phenotypes. Thereby, we render deep learning approaches more interpretable. In an application with single-cell gene expression data, we demonstrate the utility of the discussed methods.
Collapse
Affiliation(s)
- Martin Treppner
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Stefan-Meier-Str. 26, Freiburg, 79104, Germany.
| | - Harald Binder
- Freiburg Center for Data Analysis and Modeling, University of Freiburg, Freiburg, 79104, Germany
| | - Moritz Hess
- Freiburg Center for Data Analysis and Modeling, University of Freiburg, Freiburg, 79104, Germany
| |
Collapse
|
24
|
Lopez R, Li B, Keren-Shaul H, Boyeau P, Kedmi M, Pilzer D, Jelinski A, Yofe I, David E, Wagner A, Ergen C, Addadi Y, Golani O, Ronchese F, Jordan MI, Amit I, Yosef N. DestVI identifies continuums of cell types in spatial transcriptomics data. Nat Biotechnol 2022; 40:1360-1369. [PMID: 35449415 PMCID: PMC9756396 DOI: 10.1038/s41587-022-01272-8] [Citation(s) in RCA: 103] [Impact Index Per Article: 34.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Accepted: 03/07/2022] [Indexed: 11/09/2022]
Abstract
Most spatial transcriptomics technologies are limited by their resolution, with spot sizes larger than that of a single cell. Although joint analysis with single-cell RNA sequencing can alleviate this problem, current methods are limited to assessing discrete cell types, revealing the proportion of cell types inside each spot. To identify continuous variation of the transcriptome within cells of the same type, we developed Deconvolution of Spatial Transcriptomics profiles using Variational Inference (DestVI). Using simulations, we demonstrate that DestVI outperforms existing methods for estimating gene expression for every cell type inside every spot. Applied to a study of infected lymph nodes and of a mouse tumor model, DestVI provides high-resolution, accurate spatial characterization of the cellular organization of these tissues and identifies cell-type-specific changes in gene expression between different tissue regions or between conditions. DestVI is available as part of the open-source software package scvi-tools ( https://scvi-tools.org ).
Collapse
Affiliation(s)
- Romain Lopez
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley CA, USA
| | - Baoguo Li
- Department of Immunology, Weizmann Institute of Science, Rehovot, Israel
| | - Hadas Keren-Shaul
- Department of Life Sciences Core Facilities, Weizmann Institute of Science, Rehovot, Israel
| | - Pierre Boyeau
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley CA, USA
| | - Merav Kedmi
- Department of Life Sciences Core Facilities, Weizmann Institute of Science, Rehovot, Israel
| | - David Pilzer
- Department of Life Sciences Core Facilities, Weizmann Institute of Science, Rehovot, Israel
| | - Adam Jelinski
- Department of Immunology, Weizmann Institute of Science, Rehovot, Israel
| | - Ido Yofe
- Department of Immunology, Weizmann Institute of Science, Rehovot, Israel
| | - Eyal David
- Department of Immunology, Weizmann Institute of Science, Rehovot, Israel
| | - Allon Wagner
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley CA, USA
| | - Can Ergen
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley CA, USA
| | - Yoseph Addadi
- Department of Life Sciences Core Facilities, Weizmann Institute of Science, Rehovot, Israel
| | - Ofra Golani
- Department of Life Sciences Core Facilities, Weizmann Institute of Science, Rehovot, Israel
| | - Franca Ronchese
- Malaghan Institute of Medical Research, Wellington, New Zealand
| | - Michael I Jordan
- Department of Immunology, Weizmann Institute of Science, Rehovot, Israel
- Department of Statistics, University of California, Berkeley, Berkeley CA, USA
| | - Ido Amit
- Department of Immunology, Weizmann Institute of Science, Rehovot, Israel.
| | - Nir Yosef
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley CA, USA.
- Center for Computational Biology, University of California, Berkeley, Berkeley CA, USA.
- Chan Zuckerberg Biohub, San Francisco CA, USA.
- Ragon Institute of MGH, MIT and Harvard, Cambridge MA, USA.
| |
Collapse
|
25
|
Martinelli DD. Generative machine learning for de novo drug discovery: A systematic review. Comput Biol Med 2022; 145:105403. [PMID: 35339849 DOI: 10.1016/j.compbiomed.2022.105403] [Citation(s) in RCA: 39] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Revised: 03/10/2022] [Accepted: 03/11/2022] [Indexed: 02/08/2023]
Abstract
Recent research on artificial intelligence indicates that machine learning algorithms can auto-generate novel drug-like molecules. Generative models have revolutionized de novo drug discovery, rendering the explorative process more efficient. Several model frameworks and input formats have been proposed to enhance the performance of intelligent algorithms in generative molecular design. In this systematic literature review of experimental articles and reviews over the last five years, machine learning models, challenges associated with computational molecule design along with proposed solutions, and molecular encoding methods are discussed. A query-based search of the PubMed, ScienceDirect, Springer, Wiley Online Library, arXiv, MDPI, bioRxiv, and IEEE Xplore databases yielded 87 studies. Twelve additional studies were identified via citation searching. Of the articles in which machine learning was implemented, six prominent algorithms were identified: long short-term memory recurrent neural networks (LSTM-RNNs), variational autoencoders (VAEs), generative adversarial networks (GANs), adversarial autoencoders (AAEs), evolutionary algorithms, and gated recurrent unit (GRU-RNNs). Furthermore, eight central challenges were designated: homogeneity of generated molecular libraries, deficient synthesizability, limited assay data, model interpretability, incapacity for multi-property optimization, incomparability, restricted molecule size, and uncertainty in model evaluation. Molecules were encoded either as strings, which were occasionally augmented using randomization, as 2D graphs, or as 3D graphs. Statistical analysis and visualization are performed to illustrate how approaches to machine learning in de novo drug design have evolved over the past five years. Finally, future opportunities and reservations are discussed.
Collapse
|
26
|
Benegas G, Fischer J, Song YS. Robust and annotation-free analysis of alternative splicing across diverse cell types in mice. eLife 2022; 11:73520. [PMID: 35229721 PMCID: PMC8975553 DOI: 10.7554/elife.73520] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Accepted: 02/27/2022] [Indexed: 11/13/2022] Open
Abstract
Although alternative splicing is a fundamental and pervasive aspect of gene expression in higher eukaryotes, it is often omitted from single-cell studies due to quantification challenges inherent to commonly used short-read sequencing technologies. Here, we undertake the analysis of alternative splicing across numerous diverse murine cell types from two large-scale single-cell datasets-the Tabula Muris and BRAIN Initiative Cell Census Network-while accounting for understudied technical artifacts and unannotated events. We find strong and general cell-type-specific alternative splicing, complementary to total gene expression but of similar discriminatory value, and identify a large volume of novel splicing events. We specifically highlight splicing variation across different cell types in primary motor cortex neurons, bone marrow B cells, and various epithelial cells, and we show that the implicated transcripts include many genes which do not display total expression differences. To elucidate the regulation of alternative splicing, we build a custom predictive model based on splicing factor activity, recovering several known interactions while generating new hypotheses, including potential regulatory roles for novel alternative splicing events in critical genes like Khdrbs3 and Rbfox1. We make our results available using public interactive browsers to spur further exploration by the community.
Collapse
Affiliation(s)
- Gonzalo Benegas
- Graduate Group in Computational Biology, University of California, Berkeley, Berkeley, United States
| | - Jonathan Fischer
- Department of Biostatistics, University of Florida, Gainesville, United States
| | - Yun S Song
- Computer Science Division, University of California, Berkeley, Berkeley, United States
| |
Collapse
|
27
|
Spatial components of molecular tissue biology. Nat Biotechnol 2022; 40:308-318. [PMID: 35132261 DOI: 10.1038/s41587-021-01182-1] [Citation(s) in RCA: 152] [Impact Index Per Article: 50.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Accepted: 12/03/2021] [Indexed: 02/06/2023]
Abstract
Methods for profiling RNA and protein expression in a spatially resolved manner are rapidly evolving, making it possible to comprehensively characterize cells and tissues in health and disease. To maximize the biological insights obtained using these techniques, it is critical to both clearly articulate the key biological questions in spatial analysis of tissues and develop the requisite computational tools to address them. Developers of analytical tools need to decide on the intrinsic molecular features of each cell that need to be considered, and how cell shape and morphological features are incorporated into the analysis. Also, optimal ways to compare different tissue samples at various length scales are still being sought. Grouping these biological problems and related computational algorithms into classes across length scales, thus characterizing common issues that need to be addressed, will facilitate further progress in spatial transcriptomics and proteomics.
Collapse
|
28
|
Gayoso A, Lopez R, Xing G, Boyeau P, Valiollah Pour Amiri V, Hong J, Wu K, Jayasuriya M, Mehlman E, Langevin M, Liu Y, Samaran J, Misrachi G, Nazaret A, Clivio O, Xu C, Ashuach T, Gabitto M, Lotfollahi M, Svensson V, da Veiga Beltrame E, Kleshchevnikov V, Talavera-López C, Pachter L, Theis FJ, Streets A, Jordan MI, Regier J, Yosef N. A Python library for probabilistic analysis of single-cell omics data. Nat Biotechnol 2022; 40:163-166. [DOI: 10.1038/s41587-021-01206-w] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
29
|
Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologists. Nat Rev Mol Cell Biol 2022; 23:40-55. [PMID: 34518686 DOI: 10.1038/s41580-021-00407-0] [Citation(s) in RCA: 790] [Impact Index Per Article: 263.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/23/2021] [Indexed: 02/08/2023]
Abstract
The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed.
Collapse
Affiliation(s)
- Joe G Greener
- Department of Computer Science, University College London, London, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, London, UK
| | - Lewis Moffat
- Department of Computer Science, University College London, London, UK
| | - David T Jones
- Department of Computer Science, University College London, London, UK.
| |
Collapse
|
30
|
Interpretable Autoencoders Trained on Single Cell Sequencing Data Can Transfer Directly to Data from Unseen Tissues. Cells 2021; 11:cells11010085. [PMID: 35011647 PMCID: PMC8750521 DOI: 10.3390/cells11010085] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Revised: 12/17/2021] [Accepted: 12/21/2021] [Indexed: 01/04/2023] Open
Abstract
Autoencoders have been used to model single-cell mRNA-sequencing data with the purpose of denoising, visualization, data simulation, and dimensionality reduction. We, and others, have shown that autoencoders can be explainable models and interpreted in terms of biology. Here, we show that such autoencoders can generalize to the extent that they can transfer directly without additional training. In practice, we can extract biological modules, denoise, and classify data correctly from an autoencoder that was trained on a different dataset and with different cells (a foreign model). We deconvoluted the biological signal encoded in the bottleneck layer of scRNA-models using saliency maps and mapped salient features to biological pathways. Biological concepts could be associated with specific nodes and interpreted in relation to biological pathways. Even in this unsupervised framework, with no prior information about cell types or labels, the specific biological pathways deduced from the model were in line with findings in previous research. It was hypothesized that autoencoders could learn and represent meaningful biology; here, we show with a systematic experiment that this is true and even transcends the training data. This means that carefully trained autoencoders can be used to assist the interpretation of new unseen data.
Collapse
|
31
|
Kitano H. Nobel Turing Challenge: creating the engine for scientific discovery. NPJ Syst Biol Appl 2021; 7:29. [PMID: 34145287 PMCID: PMC8213706 DOI: 10.1038/s41540-021-00189-3] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Accepted: 06/03/2021] [Indexed: 12/15/2022] Open
Abstract
Scientific discovery has long been one of the central driving forces in our civilization. It uncovered the principles of the world we live in, and enabled us to invent new technologies reshaping our society, cure diseases, explore unknown new frontiers, and hopefully lead us to build a sustainable society. Accelerating the speed of scientific discovery is therefore one of the most important endeavors. This requires an in-depth understanding of not only the subject areas but also the nature of scientific discoveries themselves. In other words, the "science of science" needs to be established, and has to be implemented using artificial intelligence (AI) systems to be practically executable. At the same time, what may be implemented by "AI Scientists" may not resemble the scientific process conducted by human scientist. It may be an alternative form of science that will break the limitation of current scientific practice largely hampered by human cognitive limitation and sociological constraints. It could give rise to a human-AI hybrid form of science that shall bring systems biology and other sciences into the next stage. The Nobel Turing Challenge aims to develop a highly autonomous AI system that can perform top-level science, indistinguishable from the quality of that performed by the best human scientists, where some of the discoveries may be worthy of Nobel Prize level recognition and beyond.
Collapse
Affiliation(s)
- Hiroaki Kitano
- The Systems Biology Institute, Tokyo, Japan; Okinawa Institute of Science and Technology Graduate School, Okinawa, Japan; Sony Computer Science Laboratories, Inc., Tokyo, Japan; Sony AI, Inc., Tokyo, Japan; and The Alan Turing Institute, London, UK.
| |
Collapse
|
32
|
Osadchy M, Kolodny R. How Deep Learning Tools Can Help Protein Engineers Find Good Sequences. J Phys Chem B 2021; 125:6440-6450. [PMID: 34105961 DOI: 10.1021/acs.jpcb.1c02449] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
The deep learning revolution introduced a new and efficacious way to address computational challenges in a wide range of fields, relying on large data sets and powerful computational resources. In protein engineering, we consider the challenge of computationally predicting properties of a protein and designing sequences with these properties. Indeed, accurate and fast deep network oracles for different properties of proteins have been developed. These learn to predict a property from an amino acid sequence by training on large sets of proteins that have this property. In particular, deep networks can learn from the set of all known protein sequences to identify ones that are protein-like. A fundamental challenge when engineering sequences that are both protein-like and satisfy a desired property is that these are rare instances within the vast space of all possible ones. When searching for these very rare instances, one would like to use good sampling procedures. Sampling approaches that are decoupled from the prediction of the property or in which the predictor uses only post-sampling to identify good instances are less efficient. The alternative is to use sampling methods that are geared to generate sequences satisfying and/or optimizing the predictor's desired properties. Deep learning has a class of architectures, denoted as generative models, which offer the capability of sampling from the learned distribution of a predicted property. Here, we review the use of deep learning tools to find good sequences for protein engineering, including developing oracles/predictors of a property of the proteins and methods that sample from a distribution of protein-like sequences to optimize the desired property.
Collapse
Affiliation(s)
- Margarita Osadchy
- Department of Computer Science, Jacobs Building, University of Haifa, 199 Aba Houshi Road, Mount Carmel, Haifa, Israel 3498838
| | - Rachel Kolodny
- Department of Computer Science, Jacobs Building, University of Haifa, 199 Aba Houshi Road, Mount Carmel, Haifa, Israel 3498838
| |
Collapse
|
33
|
Yáñez Feliú G, Earle Gómez B, Codoceo Berrocal V, Muñoz Silva M, Nuñez IN, Matute TF, Arce Medina A, Vidal G, Vitalis C, Dahlin J, Federici F, Rudge TJ. Flapjack: Data Management and Analysis for Genetic Circuit Characterization. ACS Synth Biol 2021; 10:183-191. [PMID: 33382586 DOI: 10.1021/acssynbio.0c00554] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Characterization is fundamental to the design, build, test, learn (DBTL) cycle for engineering synthetic genetic circuits. Components must be described in such a way as to account for their behavior in a range of contexts. Measurements and associated metadata, including part composition, constitute the test phase of the DBTL cycle. These data may consist of measurements of thousands of circuits, measured in hundreds of conditions, in multiple assays potentially performed in different laboratories and using different techniques. In order to inform the learn phase this large volume of data must be filtered, collated, and analyzed. Characterization consists of using this data to parametrize models of component function in different contexts, and combining them to predict behaviors of novel circuits. Tools to store, organize, share, and analyze large volumes of measurement and metadata are therefore essential to linking the test phase to the build and learn phases, closing the loop of the DBTL cycle. Here we present such a system, implemented as a web app with a backend data registry and analysis engine. An interactive frontend provides powerful querying, plotting, and analysis tools, and we provide a REST API and Python package for full integration with external build and learn software. All measurements are associated with circuit part composition via SBOL (Synthetic Biology Open Language). We demonstrate our tool by characterizing a range of genetic components and circuits according to composition and context.
Collapse
Affiliation(s)
- Guillermo Yáñez Feliú
- Department of Chemical and Bioprocess Engineering, School of Engineering, Pontificia Universidad Católica de Chile, Santiago 7820244, Chile
| | - Benjamín Earle Gómez
- Institute for Biological and Medical Engineering, Schools of Engineering, Biology and Medicine, Pontificia Universidad Católica de Chile, Santiago 7820244, Chile
| | - Verner Codoceo Berrocal
- Institute for Biological and Medical Engineering, Schools of Engineering, Biology and Medicine, Pontificia Universidad Católica de Chile, Santiago 7820244, Chile
| | - Macarena Muñoz Silva
- Institute for Biological and Medical Engineering, Schools of Engineering, Biology and Medicine, Pontificia Universidad Católica de Chile, Santiago 7820244, Chile
| | - Isaac N Nuñez
- Department of Chemical and Bioprocess Engineering, School of Engineering, Pontificia Universidad Católica de Chile, Santiago 7820244, Chile
- Institute for Biological and Medical Engineering, Schools of Engineering, Biology and Medicine, Pontificia Universidad Católica de Chile, Santiago 7820244, Chile
- ANID - Millennium Science Initiative Program - Millennium Institute for Integrative Biology (iBio), Pontificia Universidad Católica de Chile, Santiago 8330005, Chile
| | - Tamara F Matute
- Department of Chemical and Bioprocess Engineering, School of Engineering, Pontificia Universidad Católica de Chile, Santiago 7820244, Chile
- Institute for Biological and Medical Engineering, Schools of Engineering, Biology and Medicine, Pontificia Universidad Católica de Chile, Santiago 7820244, Chile
- ANID - Millennium Science Initiative Program - Millennium Institute for Integrative Biology (iBio), Pontificia Universidad Católica de Chile, Santiago 8330005, Chile
| | - Anibal Arce Medina
- ANID - Millennium Science Initiative Program - Millennium Institute for Integrative Biology (iBio), Pontificia Universidad Católica de Chile, Santiago 8330005, Chile
- Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Santiago 8330005, Chile
| | - Gonzalo Vidal
- Institute for Biological and Medical Engineering, Schools of Engineering, Biology and Medicine, Pontificia Universidad Católica de Chile, Santiago 7820244, Chile
| | - Carlos Vitalis
- Institute for Biological and Medical Engineering, Schools of Engineering, Biology and Medicine, Pontificia Universidad Católica de Chile, Santiago 7820244, Chile
| | - Jonathan Dahlin
- The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark
| | - Fernán Federici
- Institute for Biological and Medical Engineering, Schools of Engineering, Biology and Medicine, Pontificia Universidad Católica de Chile, Santiago 7820244, Chile
- ANID - Millennium Science Initiative Program - Millennium Institute for Integrative Biology (iBio), Pontificia Universidad Católica de Chile, Santiago 8330005, Chile
- FONDAP, Center for Genome Regulation, Pontificia Universidad Católica de Chile, Santiago 8330005, Chile
| | - Timothy J Rudge
- Department of Chemical and Bioprocess Engineering, School of Engineering, Pontificia Universidad Católica de Chile, Santiago 7820244, Chile
- Institute for Biological and Medical Engineering, Schools of Engineering, Biology and Medicine, Pontificia Universidad Católica de Chile, Santiago 7820244, Chile
| |
Collapse
|
34
|
Kell DB, Samanta S, Swainston N. Deep learning and generative methods in cheminformatics and chemical biology: navigating small molecule space intelligently. Biochem J 2020; 477:4559-4580. [PMID: 33290527 PMCID: PMC7733676 DOI: 10.1042/bcj20200781] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Revised: 11/11/2020] [Accepted: 11/12/2020] [Indexed: 12/15/2022]
Abstract
The number of 'small' molecules that may be of interest to chemical biologists - chemical space - is enormous, but the fraction that have ever been made is tiny. Most strategies are discriminative, i.e. have involved 'forward' problems (have molecule, establish properties). However, we normally wish to solve the much harder generative or inverse problem (describe desired properties, find molecule). 'Deep' (machine) learning based on large-scale neural networks underpins technologies such as computer vision, natural language processing, driverless cars, and world-leading performance in games such as Go; it can also be applied to the solution of inverse problems in chemical biology. In particular, recent developments in deep learning admit the in silico generation of candidate molecular structures and the prediction of their properties, thereby allowing one to navigate (bio)chemical space intelligently. These methods are revolutionary but require an understanding of both (bio)chemistry and computer science to be exploited to best advantage. We give a high-level (non-mathematical) background to the deep learning revolution, and set out the crucial issue for chemical biology and informatics as a two-way mapping from the discrete nature of individual molecules to the continuous but high-dimensional latent representation that may best reflect chemical space. A variety of architectures can do this; we focus on a particular type known as variational autoencoders. We then provide some examples of recent successes of these kinds of approach, and a look towards the future.
Collapse
Affiliation(s)
- Douglas B. Kell
- Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown St, Liverpool L69 7ZB, U.K
- Novo Nordisk Foundation Centre for Biosustainability, Technical University of Denmark, Building 220, Kemitorvet, 2800 Kgs. Lyngby, Denmark
| | - Soumitra Samanta
- Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown St, Liverpool L69 7ZB, U.K
| | - Neil Swainston
- Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, Faculty of Health and Life Sciences, University of Liverpool, Crown St, Liverpool L69 7ZB, U.K
| |
Collapse
|
35
|
Lopez R, Gayoso A, Yosef N. Enhancing scientific discoveries in molecular biology with deep generative models. Mol Syst Biol 2020; 16:e9198. [PMID: 32975352 PMCID: PMC7517326 DOI: 10.15252/msb.20199198] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 04/10/2020] [Accepted: 07/09/2020] [Indexed: 12/15/2022] Open
Abstract
Generative models provide a well-established statistical framework for evaluating uncertainty and deriving conclusions from large data sets especially in the presence of noise, sparsity, and bias. Initially developed for computer vision and natural language processing, these models have been shown to effectively summarize the complexity that underlies many types of data and enable a range of applications including supervised learning tasks, such as assigning labels to images; unsupervised learning tasks, such as dimensionality reduction; and out-of-sample generation, such as de novo image synthesis. With this early success, the power of generative models is now being increasingly leveraged in molecular biology, with applications ranging from designing new molecules with properties of interest to identifying deleterious mutations in our genomes and to dissecting transcriptional variability between single cells. In this review, we provide a brief overview of the technical notions behind generative models and their implementation with deep learning techniques. We then describe several different ways in which these models can be utilized in practice, using several recent applications in molecular biology as examples.
Collapse
Affiliation(s)
- Romain Lopez
- Department of Electrical Engineering and Computer SciencesUniversity of CaliforniaBerkeleyCAUSA
| | - Adam Gayoso
- Center for Computational BiologyUniversity of CaliforniaBerkeleyCAUSA
| | - Nir Yosef
- Department of Electrical Engineering and Computer SciencesUniversity of CaliforniaBerkeleyCAUSA
- Center for Computational BiologyUniversity of CaliforniaBerkeleyCAUSA
- Chan‐Zuckerberg BiohubSan FranciscoCAUSA
- Ragon Institute of MGH, MIT, and HarvardCambridgeMAUSA
| |
Collapse
|