1
|
Qiu W, Dincer AB, Janizek JD, Celik S, Pittet MJ, Naxerova K, Lee SI. Deep profiling of gene expression across 18 human cancers. Nat Biomed Eng 2025; 9:333-355. [PMID: 39690287 DOI: 10.1038/s41551-024-01290-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 10/23/2024] [Indexed: 12/19/2024]
Abstract
Clinical and biological information in large datasets of gene expression across cancers could be tapped with unsupervised deep learning. However, difficulties associated with biological interpretability and methodological robustness have made this impractical. Here we describe an unsupervised deep-learning framework for the generation of low-dimensional latent spaces for gene-expression data from 50,211 transcriptomes across 18 human cancers. The framework, which we named DeepProfile, outperformed dimensionality-reduction methods with respect to biological interpretability and allowed us to unveil that genes that are universally important in defining latent spaces across cancer types control immune cell activation, whereas cancer-type-specific genes and pathways define molecular disease subtypes. By linking latent variables in DeepProfile to secondary characteristics of tumours, we discovered that mutation burden is closely associated with the expression of cell-cycle-related genes, and that the activity of biological pathways for DNA-mismatch repair and MHC class II antigen presentation are consistently associated with patient survival. We also found that tumour-associated macrophages are a source of survival-correlated MHC class II transcripts. Unsupervised learning can facilitate the discovery of biological insight from gene-expression data.
Collapse
Affiliation(s)
- Wei Qiu
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Ayse B Dincer
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Joseph D Janizek
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
- Medical Scientist Training Program, University of Washington, Seattle, WA, USA
| | - Safiye Celik
- Recursion Pharmaceuticals, Salt Lake City, UT, USA
| | - Mikael J Pittet
- Department of Pathology and Immunology, University of Geneva, Geneva, Switzerland
- Ludwig Institute for Cancer Research, Lausanne Branch, Lausanne, Switzerland
- Department of Oncology, Geneva University Hospitals, Geneva, Switzerland
- AGORA Cancer Research Center and Swiss Cancer Center Leman, Lausanne, Switzerland
| | - Kamila Naxerova
- Department of Genetics, Harvard Medical School, Boston, MA, USA.
- Center for Systems Biology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.
| | - Su-In Lee
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA.
| |
Collapse
|
2
|
Abdill RJ, Graham SP, Rubinetti V, Ahmadian M, Hicks P, Chetty A, McDonald D, Ferretti P, Gibbons E, Rossi M, Krishnan A, Albert FW, Greene CS, Davis S, Blekhman R. Integration of 168,000 samples reveals global patterns of the human gut microbiome. Cell 2025; 188:1100-1118.e17. [PMID: 39848248 PMCID: PMC11848717 DOI: 10.1016/j.cell.2024.12.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Revised: 09/09/2024] [Accepted: 12/13/2024] [Indexed: 01/25/2025]
Abstract
The factors shaping human microbiome variation are a major focus of biomedical research. While other fields have used large sequencing compendia to extract insights requiring otherwise impractical sample sizes, the microbiome field has lacked a comparably sized resource for the 16S rRNA gene amplicon sequencing commonly used to quantify microbiome composition. To address this gap, we processed 168,464 publicly available human gut microbiome samples with a uniform pipeline. We use this compendium to evaluate geographic and technical effects on microbiome variation. We find that regions such as Central and Southern Asia differ significantly from the more thoroughly characterized microbiomes of Europe and Northern America and that composition alone can be used to predict a sample's region of origin. We also find strong associations between microbiome variation and technical factors such as primers and DNA extraction. We anticipate this growing work, the Human Microbiome Compendium, will enable advanced applied and methodological research.
Collapse
Affiliation(s)
- Richard J Abdill
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA
| | - Samantha P Graham
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, MN, USA
| | - Vincent Rubinetti
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA; Center for Health Artificial Intelligence (CHAI), University of Colorado School of Medicine, Aurora, CO, USA
| | - Mansooreh Ahmadian
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, School of Public Health, Aurora, CO, USA
| | - Parker Hicks
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA
| | - Ashwin Chetty
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA
| | - Daniel McDonald
- Department of Pediatrics, School of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Pamela Ferretti
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA
| | - Elizabeth Gibbons
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA
| | - Marco Rossi
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA
| | - Arjun Krishnan
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA; Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, School of Public Health, Aurora, CO, USA
| | - Frank W Albert
- Department of Genetics, Cell Biology, and Development, University of Minnesota, Minneapolis, MN, USA
| | - Casey S Greene
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA; Center for Health Artificial Intelligence (CHAI), University of Colorado School of Medicine, Aurora, CO, USA
| | - Sean Davis
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA; Center for Health Artificial Intelligence (CHAI), University of Colorado School of Medicine, Aurora, CO, USA
| | - Ran Blekhman
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA.
| |
Collapse
|
3
|
Qiu W, Dincer AB, Janizek JD, Celik S, Pittet M, Naxerova K, Lee SI. A deep profile of gene expression across 18 human cancers. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.17.585426. [PMID: 38559197 PMCID: PMC10980029 DOI: 10.1101/2024.03.17.585426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Clinically and biologically valuable information may reside untapped in large cancer gene expression data sets. Deep unsupervised learning has the potential to extract this information with unprecedented efficacy but has thus far been hampered by a lack of biological interpretability and robustness. Here, we present DeepProfile, a comprehensive framework that addresses current challenges in applying unsupervised deep learning to gene expression profiles. We use DeepProfile to learn low-dimensional latent spaces for 18 human cancers from 50,211 transcriptomes. DeepProfile outperforms existing dimensionality reduction methods with respect to biological interpretability. Using DeepProfile interpretability methods, we show that genes that are universally important in defining the latent spaces across all cancer types control immune cell activation, while cancer type-specific genes and pathways define molecular disease subtypes. By linking DeepProfile latent variables to secondary tumor characteristics, we discover that tumor mutation burden is closely associated with the expression of cell cycle-related genes. DNA mismatch repair and MHC class II antigen presentation pathway expression, on the other hand, are consistently associated with patient survival. We validate these results through Kaplan-Meier analyses and nominate tumor-associated macrophages as an important source of survival-correlated MHC class II transcripts. Our results illustrate the power of unsupervised deep learning for discovery of cancer biology from existing gene expression data.
Collapse
Affiliation(s)
- Wei Qiu
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA
| | - Ayse B. Dincer
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA
| | - Joseph D. Janizek
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA
- Medical Scientist Training Program, University of Washington, Seattle, WA
| | | | - Mikael Pittet
- Department of Pathology and Immunology, University of Geneva, Switzerland
- Ludwig Institute for Cancer Research, Lausanne Branch, Switzerland
| | - Kamila Naxerova
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Center for Systems Biology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA
| | - Su-In Lee
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA
| |
Collapse
|
4
|
Sastry AV, Yuan Y, Poudel S, Rychel K, Yoo R, Lamoureux CR, Li G, Burrows JT, Chauhan S, Haiman ZB, Al Bulushi T, Seif Y, Palsson BO, Zielinski DC. iModulonMiner and PyModulon: Software for unsupervised mining of gene expression compendia. PLoS Comput Biol 2024; 20:e1012546. [PMID: 39441835 PMCID: PMC11534266 DOI: 10.1371/journal.pcbi.1012546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 11/04/2024] [Accepted: 10/09/2024] [Indexed: 10/25/2024] Open
Abstract
Public gene expression databases are a rapidly expanding resource of organism responses to diverse perturbations, presenting both an opportunity and a challenge for bioinformatics workflows to extract actionable knowledge of transcription regulatory network function. Here, we introduce a five-step computational pipeline, called iModulonMiner, to compile, process, curate, analyze, and characterize the totality of RNA-seq data for a given organism or cell type. This workflow is centered around the data-driven computation of co-regulated gene sets using Independent Component Analysis, called iModulons, which have been shown to have broad applications. As a demonstration, we applied this workflow to generate the iModulon structure of Bacillus subtilis using all high-quality, publicly-available RNA-seq data. Using this structure, we predicted regulatory interactions for multiple transcription factors, identified groups of co-expressed genes that are putatively regulated by undiscovered transcription factors, and predicted properties of a recently discovered single-subunit phage RNA polymerase. We also present a Python package, PyModulon, with functions to characterize, visualize, and explore computed iModulons. The pipeline, available at https://github.com/SBRG/iModulonMiner, can be readily applied to diverse organisms to gain a rapid understanding of their transcriptional regulatory network structure and condition-specific activity.
Collapse
Affiliation(s)
- Anand V. Sastry
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Yuan Yuan
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Saugat Poudel
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Kevin Rychel
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Reo Yoo
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Cameron R. Lamoureux
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Gaoyuan Li
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Joshua T. Burrows
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Siddharth Chauhan
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Zachary B. Haiman
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Tahani Al Bulushi
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Yara Seif
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| | - Bernhard O. Palsson
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
- Bioinformatics and Systems Biology Program, University of California, San Diego, La Jolla, California, United States of America
- Department of Pediatrics, University of California, San Diego, La Jolla, California, United States of America
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kemitorvet, Kongens, Lyngby, Denmark
| | - Daniel C. Zielinski
- Department of Bioengineering, University of California, San Diego, La Jolla, California, United States of America
| |
Collapse
|
5
|
Joubbi S, Micheli A, Milazzo P, Maccari G, Ciano G, Cardamone D, Medini D. Antibody design using deep learning: from sequence and structure design to affinity maturation. Brief Bioinform 2024; 25:bbae307. [PMID: 38960409 PMCID: PMC11221890 DOI: 10.1093/bib/bbae307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Revised: 05/20/2024] [Accepted: 06/12/2024] [Indexed: 07/05/2024] Open
Abstract
Deep learning has achieved impressive results in various fields such as computer vision and natural language processing, making it a powerful tool in biology. Its applications now encompass cellular image classification, genomic studies and drug discovery. While drug development traditionally focused deep learning applications on small molecules, recent innovations have incorporated it in the discovery and development of biological molecules, particularly antibodies. Researchers have devised novel techniques to streamline antibody development, combining in vitro and in silico methods. In particular, computational power expedites lead candidate generation, scaling and potential antibody development against complex antigens. This survey highlights significant advancements in protein design and optimization, specifically focusing on antibodies. This includes various aspects such as design, folding, antibody-antigen interactions docking and affinity maturation.
Collapse
Affiliation(s)
- Sara Joubbi
- Department of Computer Science, University of Pisa, Largo B. Pontecorvo, 3, 56127, Pisa, Italy
- Data Science for Health (DaScH) Lab, Fondazione Toscana Life Sciences, Via Fiorentina, 1, 53100, Siena, Italy
| | - Alessio Micheli
- Department of Computer Science, University of Pisa, Largo B. Pontecorvo, 3, 56127, Pisa, Italy
| | - Paolo Milazzo
- Department of Computer Science, University of Pisa, Largo B. Pontecorvo, 3, 56127, Pisa, Italy
| | - Giuseppe Maccari
- Data Science for Health (DaScH) Lab, Fondazione Toscana Life Sciences, Via Fiorentina, 1, 53100, Siena, Italy
| | - Giorgio Ciano
- Data Science for Health (DaScH) Lab, Fondazione Toscana Life Sciences, Via Fiorentina, 1, 53100, Siena, Italy
| | - Dario Cardamone
- Data Science for Health (DaScH) Lab, Fondazione Toscana Life Sciences, Via Fiorentina, 1, 53100, Siena, Italy
| | - Duccio Medini
- Data Science for Health (DaScH) Lab, Fondazione Toscana Life Sciences, Via Fiorentina, 1, 53100, Siena, Italy
| |
Collapse
|
6
|
Kion-Crosby W, Barquist L. Network depth affects inference of gene sets from bacterial transcriptomes using denoising autoencoders. BIOINFORMATICS ADVANCES 2024; 4:vbae066. [PMID: 39027639 PMCID: PMC11256956 DOI: 10.1093/bioadv/vbae066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 04/05/2024] [Accepted: 05/02/2024] [Indexed: 07/20/2024]
Abstract
Summary The increasing number of publicly available bacterial gene expression data sets provides an unprecedented resource for the study of gene regulation in diverse conditions, but emphasizes the need for self-supervised methods for the automated generation of new hypotheses. One approach for inferring coordinated regulation from bacterial expression data is through neural networks known as denoising autoencoders (DAEs) which encode large datasets in a reduced bottleneck layer. We have generalized this application of DAEs to include deep networks and explore the effects of network architecture on gene set inference using deep learning. We developed a DAE-based pipeline to extract gene sets from transcriptomic data in Escherichia coli, validate our method by comparing inferred gene sets with known pathways, and have used this pipeline to explore how the choice of network architecture impacts gene set recovery. We find that increasing network depth leads the DAEs to explain gene expression in terms of fewer, more concisely defined gene sets, and that adjusting the width results in a tradeoff between generalizability and biological inference. Finally, leveraging our understanding of the impact of DAE architecture, we apply our pipeline to an independent uropathogenic E.coli dataset to identify genes uniquely induced during human colonization. Availability and implementation https://github.com/BarquistLab/DAE_architecture_exploration.
Collapse
Affiliation(s)
- Willow Kion-Crosby
- Helmholtz Institute for RNA-based Infection Research (HIRI)/Helmholtz Centre for Infection Research (HZI), 97080 Würzburg, Germany
- Faculty of Medicine, University of Würzburg, 97080 Würzburg, Germany
| | - Lars Barquist
- Helmholtz Institute for RNA-based Infection Research (HIRI)/Helmholtz Centre for Infection Research (HZI), 97080 Würzburg, Germany
- Faculty of Medicine, University of Würzburg, 97080 Würzburg, Germany
- Department of Biology, University of Toronto, Mississauga, ON L5L 1C6, Canada
| |
Collapse
|
7
|
Neff SL, Doing G, Reiter T, Hampton TH, Greene CS, Hogan DA. Pseudomonas aeruginosa transcriptome analysis of metal restriction in ex vivo cystic fibrosis sputum. Microbiol Spectr 2024; 12:e0315723. [PMID: 38385740 PMCID: PMC10986534 DOI: 10.1128/spectrum.03157-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 01/22/2024] [Indexed: 02/23/2024] Open
Abstract
Chronic Pseudomonas aeruginosa lung infections are a feature of cystic fibrosis (CF) that many patients experience even with the advent of highly effective modulator therapies. Identifying factors that impact P. aeruginosa in the CF lung could yield novel strategies to eradicate infection or otherwise improve outcomes. To complement published P. aeruginosa studies using laboratory models or RNA isolated from sputum, we analyzed transcripts of strain PAO1 after incubation in sputum from different CF donors prior to RNA extraction. We compared PAO1 gene expression in this "spike-in" sputum model to that for P. aeruginosa grown in synthetic cystic fibrosis sputum medium to determine key genes, which are among the most differentially expressed or most highly expressed. Using the key genes, gene sets with correlated expression were determined using the gene expression analysis tool eADAGE. Gene sets were used to analyze the activity of specific pathways in P. aeruginosa grown in sputum from different individuals. Gene sets that we found to be more active in sputum showed similar activation in published data that included P. aeruginosa RNA isolated from sputum relative to corresponding in vitro reference cultures. In the ex vivo samples, P. aeruginosa had increased levels of genes related to zinc and iron acquisition which were suppressed by metal amendment of sputum. We also found a significant correlation between expression of the H1-type VI secretion system and CFTR corrector use by the sputum donor. An ex vivo sputum model or synthetic sputum medium formulation that imposes metal restriction may enhance future CF-related studies.IMPORTANCEIdentifying the gene expression programs used by Pseudomonas aeruginosa to colonize the lungs of people with cystic fibrosis (CF) will illuminate new therapeutic strategies. To capture these transcriptional programs, we cultured the common P. aeruginosa laboratory strain PAO1 in expectorated sputum from CF patient donors. Through bioinformatic analysis, we defined sets of genes that are more transcriptionally active in real CF sputum compared to a synthetic cystic fibrosis sputum medium. Many of the most differentially active gene sets contained genes related to metal acquisition, suggesting that these gene sets play an active role in scavenging for metals in the CF lung environment which may be inadequately represented in some models. Future studies of P. aeruginosa transcript abundance in CF may benefit from the use of an expectorated sputum model or media supplemented with factors that induce metal restriction.
Collapse
Affiliation(s)
- Samuel L. Neff
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA
| | - Georgia Doing
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA
| | - Taylor Reiter
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, Colorado, USA
| | - Thomas H. Hampton
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA
| | - Casey S. Greene
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, Colorado, USA
| | - Deborah A. Hogan
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA
| |
Collapse
|
8
|
Wang J, Wan YW, Al-Ouran R, Huang M, Liu Z. CoRegNet: unraveling gene co-regulation networks from public RNA-Seq repositories using a beta-binomial statistical model. Brief Bioinform 2023; 25:bbad380. [PMID: 38113079 PMCID: PMC10729864 DOI: 10.1093/bib/bbad380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Revised: 09/13/2023] [Indexed: 12/21/2023] Open
Abstract
Millions of RNA sequencing samples have been deposited into public databases, providing a rich resource for biological research. These datasets encompass tens of thousands of experiments and offer comprehensive insights into human cellular regulation. However, a major challenge is how to integrate these experiments that acquired at different conditions. We propose a new statistical tool based on beta-binomial distributions that can construct robust gene co-regulation network (CoRegNet) across tens of thousands of experiments. Our analysis of over 12 000 experiments involving human tissues and cells shows that CoRegNet significantly outperforms existing gene co-expression-based methods. Although the majority of the genes are linearly co-regulated, we did discover an interesting set of genes that are non-linearly co-regulated; half of the time they change in the same direction and the other half they change in the opposite direction. Additionally, we identified a set of gene pairs that follows the Simpson's paradox. By utilizing public domain data, CoRegNet offers a powerful approach for identifying functionally related gene pairs, thereby revealing new biological insights.
Collapse
Affiliation(s)
- Jiasheng Wang
- Jan and Dan Duncan Neurological Research Institute at Texas Children’s Hospital, Houston, TX 77030, USA
- Graduate Program in Quantitative and Computational Biosciences, Baylor College of Medicine, Houston, TX, 77030, USA
- Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Ying-Wooi Wan
- Jan and Dan Duncan Neurological Research Institute at Texas Children’s Hospital, Houston, TX 77030, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Howard Hughes Medical Institute, Houston, TX 77030, USA
| | | | - Meichen Huang
- Jan and Dan Duncan Neurological Research Institute at Texas Children’s Hospital, Houston, TX 77030, USA
- Department of Neurology, Baylor College of Medicine, Houston, TX 77030, USA
| | - Zhandong Liu
- Jan and Dan Duncan Neurological Research Institute at Texas Children’s Hospital, Houston, TX 77030, USA
- Graduate Program in Quantitative and Computational Biosciences, Baylor College of Medicine, Houston, TX, 77030, USA
- Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA
| |
Collapse
|
9
|
Esser-Skala W, Fortelny N. Reliable interpretability of biology-inspired deep neural networks. NPJ Syst Biol Appl 2023; 9:50. [PMID: 37816807 PMCID: PMC10564878 DOI: 10.1038/s41540-023-00310-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Accepted: 09/15/2023] [Indexed: 10/12/2023] Open
Abstract
Deep neural networks display impressive performance but suffer from limited interpretability. Biology-inspired deep learning, where the architecture of the computational graph is based on biological knowledge, enables unique interpretability where real-world concepts are encoded in hidden nodes, which can be ranked by importance and thereby interpreted. In such models trained on single-cell transcriptomes, we previously demonstrated that node-level interpretations lack robustness upon repeated training and are influenced by biases in biological knowledge. Similar studies are missing for related models. Here, we test and extend our methodology for reliable interpretability in P-NET, a biology-inspired model trained on patient mutation data. We observe variability of interpretations and susceptibility to knowledge biases, and identify the network properties that drive interpretation biases. We further present an approach to control the robustness and biases of interpretations, which leads to more specific interpretations. In summary, our study reveals the broad importance of methods to ensure robust and bias-aware interpretability in biology-inspired deep learning.
Collapse
Affiliation(s)
- Wolfgang Esser-Skala
- Computational Systems Biology Group, Department of Biosciences and Medical Biology, University of Salzburg, Hellbrunner Straße 34, 5020, Salzburg, Austria
| | - Nikolaus Fortelny
- Computational Systems Biology Group, Department of Biosciences and Medical Biology, University of Salzburg, Hellbrunner Straße 34, 5020, Salzburg, Austria.
| |
Collapse
|
10
|
Kahl LJ, Stremmel N, Esparza-Mora MA, Wheatley RM, MacLean RC, Ralser M. Interkingdom interactions between Pseudomonas aeruginosa and Candida albicans affect clinical outcomes and antimicrobial responses. Curr Opin Microbiol 2023; 75:102368. [PMID: 37677865 DOI: 10.1016/j.mib.2023.102368] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 07/24/2023] [Accepted: 07/24/2023] [Indexed: 09/09/2023]
Abstract
Infections that involve interkingdom microbial communities, such as those between bacteria and yeast pathogens, are difficult to treat, associated with worse patient outcomes, and may be a source of antimicrobial resistance. In this review, we address co-occurrence and co-infections of Candida albicans and Pseudomonas aeruginosa, two pathogens that occupy multiple infection niches in the human body, especially in immunocompromised patients. The interaction between the pathogen species influences microbe-host interactions, the effectiveness of antimicrobials and even infection outcomes, and may thus require adapted treatment strategies. However, the molecular details of bacteria-fungal interactions both inside and outside the infection sites, are insufficiently characterised. We argue that comprehensively understanding the P. aeruginosa-C. albicans interaction network through integrated systems biology approaches will capture the highly dynamic and complex nature of these polymicrobial infections and lead to a more comprehensive understanding of clinical observations such as reshaped immune defences and low antimicrobial treatment efficacy.
Collapse
Affiliation(s)
- Lisa J Kahl
- Charité Universitätsmedizin Berlin, Department of Biochemistry, 10117 Berlin, Germany
| | - Nina Stremmel
- Charité Universitätsmedizin Berlin, Department of Biochemistry, 10117 Berlin, Germany
| | | | - Rachel M Wheatley
- University of Oxford, Department of Biology, Oxford OX1 3SZ, United Kingdom
| | - R Craig MacLean
- University of Oxford, Department of Biology, Oxford OX1 3SZ, United Kingdom
| | - Markus Ralser
- Charité Universitätsmedizin Berlin, Department of Biochemistry, 10117 Berlin, Germany; University of Oxford, The Wellcome Centre for Human Genetics, Nuffield Department of Medicine, Oxford OX3 7BN, United Kingdom; Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany.
| |
Collapse
|
11
|
Neff SL, Doing G, Reiter T, Hampton TH, Greene CS, Hogan DA. Analysis of Pseudomonas aeruginosa transcription in an ex vivo cystic fibrosis sputum model identifies metal restriction as a gene expression stimulus. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.21.554169. [PMID: 37662412 PMCID: PMC10473638 DOI: 10.1101/2023.08.21.554169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Chronic Pseudomonas aeruginosa lung infections are a distinctive feature of cystic fibrosis (CF) pathology, that challenge adults with CF even with the advent of highly effective modulator therapies. Characterizing P. aeruginosa transcription in the CF lung and identifying factors that drive gene expression could yield novel strategies to eradicate infection or otherwise improve outcomes. To complement published P. aeruginosa gene expression studies in laboratory culture models designed to model the CF lung environment, we employed an ex vivo sputum model in which laboratory strain PAO1 was incubated in sputum from different CF donors. As part of the analysis, we compared PAO1 gene expression in this "spike-in" sputum model to that for P. aeruginosa grown in artificial sputum medium (ASM). Analyses focused on genes that were differentially expressed between sputum and ASM and genes that were most highly expressed in sputum. We present a new approach that used sets of genes with correlated expression, identified by the gene expression analysis tool eADAGE, to analyze the differential activity of pathways in P. aeruginosa grown in CF sputum from different individuals. A key characteristic of P. aeruginosa grown in expectorated CF sputum was related to zinc and iron acquisition, but this signal varied by donor sputum. In addition, a significant correlation between P. aeruginosa expression of the H1-type VI secretion system and corrector use by the sputum donor was observed. These methods may be broadly useful in looking for variable signals across clinical samples.
Collapse
Affiliation(s)
- Samuel L. Neff
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Georgia Doing
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Taylor Reiter
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA
| | - Thomas H. Hampton
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Casey S. Greene
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA
| | - Deborah A. Hogan
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| |
Collapse
|
12
|
Krieger KL, Mann EK, Lee KJ, Bolterstein E, Jebakumar D, Ittmann MM, Dal Zotto VL, Shaban M, Sreekumar A, Gassman NR. Spatial mapping of the DNA adducts in cancer. DNA Repair (Amst) 2023; 128:103529. [PMID: 37390674 PMCID: PMC10330576 DOI: 10.1016/j.dnarep.2023.103529] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Revised: 06/19/2023] [Accepted: 06/21/2023] [Indexed: 07/02/2023]
Abstract
DNA adducts and strand breaks are induced by various exogenous and endogenous agents. Accumulation of DNA damage is implicated in many disease processes, including cancer, aging, and neurodegeneration. The continuous acquisition of DNA damage from exogenous and endogenous stressors coupled with defects in DNA repair pathways contribute to the accumulation of DNA damage within the genome and genomic instability. While mutational burden offers some insight into the level of DNA damage a cell may have experienced and subsequently repaired, it does not quantify DNA adducts and strand breaks. Mutational burden also infers the identity of the DNA damage. With advances in DNA adduct detection and quantification methods, there is an opportunity to identify DNA adducts driving mutagenesis and correlate with a known exposome. However, most DNA adduct detection methods require isolation or separation of the DNA and its adducts from the context of the nuclei. Mass spectrometry, comet assays, and other techniques precisely quantify lesion types but lose the nuclear context and even tissue context of the DNA damage. The growth in spatial analysis technologies offers a novel opportunity to leverage DNA damage detection with nuclear and tissue context. However, we lack a wealth of techniques capable of detecting DNA damage in situ. Here, we review the limited existing in situ DNA damage detection methods and examine their potential to offer spatial analysis of DNA adducts in tumors or other tissues. We also offer a perspective on the need for spatial analysis of DNA damage in situ and highlight Repair Assisted Damage Detection (RADD) as an in situ DNA adduct technique with the potential to integrate with spatial analysis and the challenges to be addressed.
Collapse
Affiliation(s)
- Kimiko L Krieger
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX 77030, USA; Center for Translational Metabolism and Health Disparities (C-TMH), Baylor College of Medicine, Houston, TX 77030, USA
| | - Elise K Mann
- Department of Physiology and Cell Biology, College of Medicine, University of South Alabama, Mobile, AL 36688, USA; Mitchell Cancer Institute, University of South Alabama, Mobile, AL 36604, USA
| | - Kevin J Lee
- Department of Physiology and Cell Biology, College of Medicine, University of South Alabama, Mobile, AL 36688, USA; Mitchell Cancer Institute, University of South Alabama, Mobile, AL 36604, USA
| | - Elyse Bolterstein
- Department of Biology, Northeastern Illinois University, Chicago, IL 60625, USA
| | - Deborah Jebakumar
- Department of Anatomic Pathology, Baylor Scott & White Medical Center, Temple, TX 76508, USA; Texas A&M College of Medicine, Temple, TX 76508, USA
| | - Michael M Ittmann
- Department of Pathology and Immunology, Baylor College of Medicine, Houston, TX 77030, USA; Human Tissue Acquisition & Pathology Shared Resource, Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Valeria L Dal Zotto
- Department of Pathology, College of Medicine, University of South Alabama, Mobile, AL 36688, USA
| | - Mohamed Shaban
- Department of Electrical and Computer Engineering, University of South Alabama, Mobile, AL 36688, USA
| | - Arun Sreekumar
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX 77030, USA; Center for Translational Metabolism and Health Disparities (C-TMH), Baylor College of Medicine, Houston, TX 77030, USA; Dan L. Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, TX 77030, USA; Verna and Marrs McLean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, Houston, TX 77030, USA
| | - Natalie R Gassman
- Department of Pharmacology and Toxicology, University of Alabama at Birmingham, Birmingham, AL 35294, USA.
| |
Collapse
|
13
|
Janizek JD, Spiro A, Celik S, Blue BW, Russell JC, Lee TI, Kaeberlin M, Lee SI. PAUSE: principled feature attribution for unsupervised gene expression analysis. Genome Biol 2023; 24:81. [PMID: 37076856 PMCID: PMC10114348 DOI: 10.1186/s13059-023-02901-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Accepted: 03/17/2023] [Indexed: 04/21/2023] Open
Abstract
As interest in using unsupervised deep learning models to analyze gene expression data has grown, an increasing number of methods have been developed to make these models more interpretable. These methods can be separated into two groups: post hoc analyses of black box models through feature attribution methods and approaches to build inherently interpretable models through biologically-constrained architectures. We argue that these approaches are not mutually exclusive, but can in fact be usefully combined. We propose PAUSE ( https://github.com/suinleelab/PAUSE ), an unsupervised pathway attribution method that identifies major sources of transcriptomic variation when combined with biologically-constrained neural network models.
Collapse
Affiliation(s)
- Joseph D Janizek
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA
- Medical Scientist Training Program, University of Washington, Seattle, USA
| | - Anna Spiro
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA
| | | | - Ben W Blue
- Department of Pathology, University of Washington, Seattle, USA
| | - John C Russell
- Department of Pathology, University of Washington, Seattle, USA
| | - Ting-I Lee
- Department of Pathology, University of Washington, Seattle, USA
| | - Matt Kaeberlin
- Department of Pathology, University of Washington, Seattle, USA
- Department of Genome Sciences, University of Washington, Seattle, USA
| | - Su-In Lee
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA.
| |
Collapse
|
14
|
Compendium-Wide Analysis of Pseudomonas aeruginosa Core and Accessory Genes Reveals Transcriptional Patterns across Strains PAO1 and PA14. mSystems 2023; 8:e0034222. [PMID: 36541762 PMCID: PMC9948736 DOI: 10.1128/msystems.00342-22] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Pseudomonas aeruginosa is an opportunistic pathogen that causes difficult-to-treat infections. Two well-studied divergent P. aeruginosa strain types, PAO1 and PA14, have significant genomic heterogeneity, including diverse accessory genes present in only some strains. Genome content comparisons find core genes that are conserved across both PAO1 and PA14 strains and accessory genes that are present in only a subset of PAO1 and PA14 strains. Here, we use recently assembled transcriptome compendia of publicly available P. aeruginosa RNA sequencing (RNA-seq) samples to create two smaller compendia consisting of only strain PAO1 or strain PA14 samples with each aligned to their cognate reference genome. We confirmed strain annotations and identified other samples for inclusion by assessing each sample's median expression of PAO1-only or PA14-only accessory genes. We then compared the patterns of core gene expression in each strain. To do so, we developed a method by which we analyzed genes in terms of which genes showed similar expression patterns across strain types. We found that some core genes had consistent correlated expression patterns across both compendia, while others were less stable in an interstrain comparison. For each accessory gene, we also determined core genes with correlated expression patterns. We found that stable core genes had fewer coexpressed neighbors that were accessory genes. Overall, this approach for analyzing expression patterns across strain types can be extended to other groups of genes, like phage genes, or applied for analyzing patterns beyond groups of strains, such as samples with different traits, to reveal a deeper understanding of regulation. IMPORTANCE Pseudomonas aeruginosa is a ubiquitous pathogen. There is much diversity among P. aeruginosa strains, including two divergent but well-studied strains, PAO1 and PA14. Understanding how these different strain-level traits manifest is important for identifying targets that regulate different traits of interest. With the availability of thousands of PAO1 and PA14 samples, we created two strain-specific RNA-seq compendia where each one contains hundreds of samples from PAO1 or PA14 strains and used them to compare the expression patterns of core genes that are conserved in both strain types and to determine which core genes have expression patterns that are similar to those of accessory genes that are unique to one strain or the other using an approach that we developed. We found a subset of core genes with different transcriptional patterns across PAO1 and PA14 strains and identified those core genes with expression patterns similar to those of strain-specific accessory genes.
Collapse
|
15
|
Doing G, Lee AJ, Neff SL, Reiter T, Holt JD, Stanton BA, Greene CS, Hogan DA. Computationally Efficient Assembly of Pseudomonas aeruginosa Gene Expression Compendia. mSystems 2023; 8:e0034122. [PMID: 36541761 PMCID: PMC9948711 DOI: 10.1128/msystems.00341-22] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Accepted: 11/09/2022] [Indexed: 12/24/2022] Open
Abstract
Thousands of Pseudomonas aeruginosa RNA sequencing (RNA-seq) gene expression profiles are publicly available via the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). In this work, the transcriptional profiles from hundreds of studies performed by over 75 research groups were reanalyzed in aggregate to create a powerful tool for hypothesis generation and testing. Raw sequence data were uniformly processed using the Salmon pseudoaligner, and this read mapping method was validated by comparison to a direct alignment method. We developed filtering criteria to exclude samples with aberrant levels of housekeeping gene expression or an unexpected number of genes with no reported values and normalized the filtered compendia using the ratio-of-medians method. The filtering and normalization steps greatly improved gene expression correlations for genes within the same operon or regulon across the 2,333 samples. Since the RNA-seq data were generated using diverse strains, we report the effects of mapping samples to noncognate reference genomes by separately analyzing all samples mapped to cDNA reference genomes for strains PAO1 and PA14, two divergent strains that were used to generate most of the samples. Finally, we developed an algorithm to incorporate new data as they are deposited into the SRA. Our processing and quality control methods provide a scalable framework for taking advantage of the troves of biological information hibernating in the depths of microbial gene expression data and yield useful tools for P. aeruginosa RNA-seq data to be leveraged for diverse research goals. IMPORTANCE Pseudomonas aeruginosa is a causative agent of a wide range of infections, including chronic infections associated with cystic fibrosis. These P. aeruginosa infections are difficult to treat and often have negative outcomes. To aid in the study of this problematic pathogen, we mapped, filtered for quality, and normalized thousands of P. aeruginosa RNA-seq gene expression profiles that were publicly available via the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). The resulting compendia facilitate analyses across experiments, strains, and conditions. Ultimately, the workflow that we present could be applied to analyses of other microbial species.
Collapse
Affiliation(s)
- Georgia Doing
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA
| | - Alexandra J. Lee
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Samuel L. Neff
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA
| | - Taylor Reiter
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Denver, Colorado, USA
| | - Jacob D. Holt
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA
| | - Bruce A. Stanton
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA
| | - Casey S. Greene
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Denver, Colorado, USA
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Deborah A. Hogan
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, USA
| |
Collapse
|
16
|
Lee AJ, Mould DL, Crawford J, Hu D, Powers RK, Doing G, Costello JC, Hogan DA, Greene CS. SOPHIE: Generative Neural Networks Separate Common and Specific Transcriptional Responses. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:912-927. [PMID: 36216026 PMCID: PMC10025681 DOI: 10.1016/j.gpb.2022.09.011] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Revised: 09/09/2022] [Accepted: 09/30/2022] [Indexed: 11/06/2022]
Abstract
Genome-wide transcriptome profiling identifies genes that are prone to differential expression (DE) across contexts, as well as genes with changes specific to the experimental manipulation. Distinguishing genes that are specifically changed in a context of interest from common differentially expressed genes (DEGs) allows more efficient prediction of which genes are specific to a given biological process under scrutiny. Currently, common DEGs or pathways can only be identified through the laborious manual curation of experiments, an inordinately time-consuming endeavor. Here we pioneer an approach, Specific cOntext Pattern Highlighting In Expression data (SOPHIE), for distinguishing between common and specific transcriptional patterns using a generative neural network to create a background set of experiments from which a null distribution of gene and pathway changes can be generated. We apply SOPHIE to diverse datasets including those from human, human cancer, and bacterial pathogen Pseudomonas aeruginosa. SOPHIE identifies common DEGs in concordance with previously described, manually and systematically determined common DEGs. Further molecular validation indicates that SOPHIE detects highly specific but low-magnitude biologically relevant transcriptional changes. SOPHIE's measure of specificity can complement log2 fold change values generated from traditional DE analyses. For example, by filtering the set of DEGs, one can identify genes that are specifically relevant to the experimental condition of interest. Consequently, these results can inform future research directions. All scripts used in these analyses are available at https://github.com/greenelab/generic-expression-patterns. Users can access https://github.com/greenelab/sophie to run SOPHIE on their own data.
Collapse
Affiliation(s)
- Alexandra J Lee
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Dallas L Mould
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Jake Crawford
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Dongbo Hu
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Rani K Powers
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Georgia Doing
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - James C Costello
- Department of Pharmacology, University of Colorado School of Medicine, Denver, CO 80045, USA
| | - Deborah A Hogan
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, USA; Center for Health AI, University of Colorado School of Medicine, Denver, CO 80045, USA; Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Denver, CO 80045, USA.
| |
Collapse
|
17
|
Dotolo S, Esposito Abate R, Roma C, Guido D, Preziosi A, Tropea B, Palluzzi F, Giacò L, Normanno N. Bioinformatics: From NGS Data to Biological Complexity in Variant Detection and Oncological Clinical Practice. Biomedicines 2022; 10:biomedicines10092074. [PMID: 36140175 PMCID: PMC9495893 DOI: 10.3390/biomedicines10092074] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Revised: 08/12/2022] [Accepted: 08/22/2022] [Indexed: 11/22/2022] Open
Abstract
The use of next-generation sequencing (NGS) techniques for variant detection has become increasingly important in clinical research and in clinical practice in oncology. Many cancer patients are currently being treated in clinical practice or in clinical trials with drugs directed against specific genomic alterations. In this scenario, the development of reliable and reproducible bioinformatics tools is essential to derive information on the molecular characteristics of each patient’s tumor from the NGS data. The development of bioinformatics pipelines based on the use of machine learning and statistical methods is even more relevant for the determination of complex biomarkers. In this review, we describe some important technologies, computational algorithms and models that can be applied to NGS data from Whole Genome to Targeted Sequencing, to address the problem of finding complex cancer-associated biomarkers. In addition, we explore the future perspectives and challenges faced by bioinformatics for precision medicine both at a molecular and clinical level, with a focus on an emerging complex biomarker such as homologous recombination deficiency (HRD).
Collapse
Affiliation(s)
- Serena Dotolo
- Cell Biology and Biotherapy Unit, Istituto Nazionale Tumori—IRCCS—Fondazione G. Pascale, 80131 Naples, Italy
| | - Riziero Esposito Abate
- Cell Biology and Biotherapy Unit, Istituto Nazionale Tumori—IRCCS—Fondazione G. Pascale, 80131 Naples, Italy
| | - Cristin Roma
- Cell Biology and Biotherapy Unit, Istituto Nazionale Tumori—IRCCS—Fondazione G. Pascale, 80131 Naples, Italy
| | - Davide Guido
- Bioinformatics Research Core Facility, Gemelli Science and Technology Park (GSTeP), Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Largo A. Gemelli, 8, 00168 Rome, Italy
| | - Alessia Preziosi
- Bioinformatics Research Core Facility, Gemelli Science and Technology Park (GSTeP), Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Largo A. Gemelli, 8, 00168 Rome, Italy
| | - Beatrice Tropea
- Bioinformatics Research Core Facility, Gemelli Science and Technology Park (GSTeP), Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Largo A. Gemelli, 8, 00168 Rome, Italy
| | - Fernando Palluzzi
- Bioinformatics Research Core Facility, Gemelli Science and Technology Park (GSTeP), Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Largo A. Gemelli, 8, 00168 Rome, Italy
| | - Luciano Giacò
- Bioinformatics Research Core Facility, Gemelli Science and Technology Park (GSTeP), Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Largo A. Gemelli, 8, 00168 Rome, Italy
| | - Nicola Normanno
- Cell Biology and Biotherapy Unit, Istituto Nazionale Tumori—IRCCS—Fondazione G. Pascale, 80131 Naples, Italy
- Correspondence:
| |
Collapse
|
18
|
Lee B, Shin MK, Yoo JS, Jang W, Sung JS. Identifying novel antimicrobial peptides from venom gland of spider Pardosa astrigera by deep multi-task learning. Front Microbiol 2022; 13:971503. [PMID: 36090084 PMCID: PMC9449525 DOI: 10.3389/fmicb.2022.971503] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Accepted: 07/27/2022] [Indexed: 11/22/2022] Open
Abstract
Antimicrobial peptides (AMPs) show promises as valuable compounds for developing therapeutic agents to control the worldwide health threat posed by the increasing prevalence of antibiotic-resistant bacteria. Animal venom can be a useful source for screening AMPs due to its various bioactive components. Here, the deep learning model was developed to predict species-specific antimicrobial activity. To overcome the data deficiency, a multi-task learning method was implemented, achieving F1 scores of 0.818, 0.696, 0.814, 0.787, and 0.719 for Bacillus subtilis, Escherichia coli, Pseudomonas aeruginosa, Staphylococcus aureus, and Staphylococcus epidermidis, respectively. Peptides PA-Full and PA-Win were identified from the model using different inputs of full and partial sequences, broadening the application of transcriptome data of the spider Pardosa astrigera. Two peptides exhibited strong antimicrobial activity against all five strains along with cytocompatibility. Our approach enables excavating AMPs with high potency, which can be expanded into the fields of biology to address data insufficiency.
Collapse
Affiliation(s)
- Byungjo Lee
- Department of Life Science, Dongguk University-Seoul, Goyang-si, South Korea
| | - Min Kyoung Shin
- Department of Life Science, Dongguk University-Seoul, Goyang-si, South Korea
| | - Jung Sun Yoo
- Animal Resources Division, National Institute of Biological Resources, Incheon, South Korea
| | - Wonhee Jang
- Department of Life Science, Dongguk University-Seoul, Goyang-si, South Korea
- Wonhee Jang,
| | - Jung-Suk Sung
- Department of Life Science, Dongguk University-Seoul, Goyang-si, South Korea
- *Correspondence: Jung-Suk Sung,
| |
Collapse
|
19
|
Jia X, Wen X, Russo DP, Aleksunes LM, Zhu H. Mechanism-driven modeling of chemical hepatotoxicity using structural alerts and an in vitro screening assay. JOURNAL OF HAZARDOUS MATERIALS 2022; 436:129193. [PMID: 35739723 PMCID: PMC9262097 DOI: 10.1016/j.jhazmat.2022.129193] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/07/2022] [Revised: 05/13/2022] [Accepted: 05/17/2022] [Indexed: 05/20/2023]
Abstract
Traditional experimental approaches to evaluate hepatotoxicity are expensive and time-consuming. As an advanced framework of risk assessment, adverse outcome pathways (AOPs) describe the sequence of molecular and cellular events underlying chemical toxicities. We aimed to develop an AOP that can be used to predict hepatotoxicity by leveraging computational modeling and in vitro assays. We curated 869 compounds with known hepatotoxicity classifications as a modeling set and extracted assay data from PubChem. The antioxidant response element (ARE) assay, which quantifies transcriptional responses to oxidative stress, showed a high correlation to hepatotoxicity (PPV=0.82). Next, we developed quantitative structure-activity relationship (QSAR) models to predict ARE activation for compounds lacking testing results. Potential toxicity alerts were identified and used to construct a mechanistic hepatotoxicity model. For experimental validation, 16 compounds in the modeling set and 12 new compounds were selected and tested using an in-house ARE-luciferase assay in HepG2-C8 cells. The mechanistic model showed good hepatotoxicity predictivity (accuracy = 0.82) for these compounds. Potential false positive hepatotoxicity predictions by only using ARE results can be corrected by incorporating structural alerts and vice versa. This mechanistic model illustrates a potential toxicity pathway for hepatotoxicity, and this strategy can be expanded to develop predictive models for other complex toxicities.
Collapse
Affiliation(s)
- Xuelian Jia
- The Rutgers Center for Computational and Integrative Biology, Camden, NJ 08102, USA
| | - Xia Wen
- Department of Pharmacology and Toxicology, Ernest Mario School of Pharmacy, Rutgers University, Piscataway, NJ 08854, USA
| | - Daniel P Russo
- The Rutgers Center for Computational and Integrative Biology, Camden, NJ 08102, USA
| | - Lauren M Aleksunes
- Department of Pharmacology and Toxicology, Ernest Mario School of Pharmacy, Rutgers University, Piscataway, NJ 08854, USA
| | - Hao Zhu
- The Rutgers Center for Computational and Integrative Biology, Camden, NJ 08102, USA; Department of Chemistry, Rutgers University, Camden, NJ 08102, USA.
| |
Collapse
|
20
|
Lee AJ, Reiter T, Doing G, Oh J, Hogan DA, Greene CS. Using genome-wide expression compendia to study microorganisms. Comput Struct Biotechnol J 2022; 20:4315-4324. [PMID: 36016717 PMCID: PMC9396250 DOI: 10.1016/j.csbj.2022.08.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Revised: 08/07/2022] [Accepted: 08/07/2022] [Indexed: 11/30/2022] Open
Abstract
A gene expression compendium is a heterogeneous collection of gene expression experiments assembled from data collected for diverse purposes. The widely varied experimental conditions and genetic backgrounds across samples creates a tremendous opportunity for gaining a systems level understanding of the transcriptional responses that influence phenotypes. Variety in experimental design is particularly important for studying microbes, where the transcriptional responses integrate many signals and demonstrate plasticity across strains including response to what nutrients are available and what microbes are present. Advances in high-throughput measurement technology have made it feasible to construct compendia for many microbes. In this review we discuss how these compendia are constructed and analyzed to reveal transcriptional patterns.
Collapse
Affiliation(s)
- Alexandra J. Lee
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, PA, USA
| | - Taylor Reiter
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Denver, CO, USA
| | - Georgia Doing
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Julia Oh
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Deborah A. Hogan
- Department of Microbiology and Immunology, Geisel School of Medicine, Dartmouth, Hanover, NH, USA
| | - Casey S. Greene
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Denver, CO, USA
| |
Collapse
|
21
|
Hesami M, Alizadeh M, Jones AMP, Torkamaneh D. Machine learning: its challenges and opportunities in plant system biology. Appl Microbiol Biotechnol 2022; 106:3507-3530. [PMID: 35575915 DOI: 10.1007/s00253-022-11963-6] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 03/14/2022] [Accepted: 05/07/2022] [Indexed: 12/25/2022]
Abstract
Sequencing technologies are evolving at a rapid pace, enabling the generation of massive amounts of data in multiple dimensions (e.g., genomics, epigenomics, transcriptomic, metabolomics, proteomics, and single-cell omics) in plants. To provide comprehensive insights into the complexity of plant biological systems, it is important to integrate different omics datasets. Although recent advances in computational analytical pipelines have enabled efficient and high-quality exploration and exploitation of single omics data, the integration of multidimensional, heterogenous, and large datasets (i.e., multi-omics) remains a challenge. In this regard, machine learning (ML) offers promising approaches to integrate large datasets and to recognize fine-grained patterns and relationships. Nevertheless, they require rigorous optimizations to process multi-omics-derived datasets. In this review, we discuss the main concepts of machine learning as well as the key challenges and solutions related to the big data derived from plant system biology. We also provide in-depth insight into the principles of data integration using ML, as well as challenges and opportunities in different contexts including multi-omics, single-cell omics, protein function, and protein-protein interaction. KEY POINTS: • The key challenges and solutions related to the big data derived from plant system biology have been highlighted. • Different methods of data integration have been discussed. • Challenges and opportunities of the application of machine learning in plant system biology have been highlighted and discussed.
Collapse
Affiliation(s)
- Mohsen Hesami
- Department of Plant Agriculture, University of Guelph, Guelph, ON, N1G 2W1, Canada
| | - Milad Alizadeh
- Department of Botany, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | | | - Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec City, QC, G1V 0A6, Canada. .,Institut de Biologie Intégrative Et Des Systèmes (IBIS), Université Laval, Québec City, QC, G1V 0A6, Canada.
| |
Collapse
|
22
|
Xu Z, Luo M, Lin W, Xue G, Wang P, Jin X, Xu C, Zhou W, Cai Y, Yang W, Nie H, Jiang Q. DLpTCR: an ensemble deep learning framework for predicting immunogenic peptide recognized by T cell receptor. Brief Bioinform 2021; 22:6355415. [PMID: 34415016 DOI: 10.1093/bib/bbab335] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Revised: 07/25/2021] [Accepted: 07/28/2021] [Indexed: 12/30/2022] Open
Abstract
Accurate prediction of immunogenic peptide recognized by T cell receptor (TCR) can greatly benefit vaccine development and cancer immunotherapy. However, identifying immunogenic peptides accurately is still a huge challenge. Most of the antigen peptides predicted in silico fail to elicit immune responses in vivo without considering TCR as a key factor. This inevitably causes costly and time-consuming experimental validation test for predicted antigens. Therefore, it is necessary to develop novel computational methods for precisely and effectively predicting immunogenic peptide recognized by TCR. Here, we described DLpTCR, a multimodal ensemble deep learning framework for predicting the likelihood of interaction between single/paired chain(s) of TCR and peptide presented by major histocompatibility complex molecules. To investigate the generality and robustness of the proposed model, COVID-19 data and IEDB data were constructed for independent evaluation. The DLpTCR model exhibited high predictive power with area under the curve up to 0.91 on COVID-19 data while predicting the interaction between peptide and single TCR chain. Additionally, the DLpTCR model achieved the overall accuracy of 81.03% on IEDB data while predicting the interaction between peptide and paired TCR chains. The results demonstrate that DLpTCR has the ability to learn general interaction rules and generalize to antigen peptide recognition by TCR. A user-friendly webserver is available at http://jianglab.org.cn/DLpTCR/. Additionally, a stand-alone software package that can be downloaded from https://github.com/jiangBiolab/DLpTCR.
Collapse
Affiliation(s)
- Zhaochun Xu
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Meng Luo
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Weizhong Lin
- Center for Bioinformatics, Computer Department, Jingdezhen Ceramic Institute, Jingdezhen 333403, China
| | - Guangfu Xue
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Pingping Wang
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Xiyun Jin
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Chang Xu
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Wenyang Zhou
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Yideng Cai
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Wenyi Yang
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Huan Nie
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China
| | - Qinghua Jiang
- Center for Bioinformatics, School of Life Science and Technology, Harbin Institute of Technology, Harbin 150000, China.,Key Laboratory of Biological Data (Harbin Institute of Technology), Ministry of Education, China
| |
Collapse
|
23
|
Rincón-Riveros A, Morales D, Rodríguez JA, Villegas VE, López-Kleine L. Bioinformatic Tools for the Analysis and Prediction of ncRNA Interactions. Int J Mol Sci 2021; 22:11397. [PMID: 34768830 PMCID: PMC8583695 DOI: 10.3390/ijms222111397] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2021] [Revised: 09/30/2021] [Accepted: 09/30/2021] [Indexed: 12/16/2022] Open
Abstract
Noncoding RNAs (ncRNAs) play prominent roles in the regulation of gene expression via their interactions with other biological molecules such as proteins and nucleic acids. Although much of our knowledge about how these ncRNAs operate in different biological processes has been obtained from experimental findings, computational biology can also clearly substantially boost this knowledge by suggesting possible novel interactions of these ncRNAs with other molecules. Computational predictions are thus used as an alternative source of new insights through a process of mutual enrichment because the information obtained through experiments continuously feeds through into computational methods. The results of these predictions in turn shed light on possible interactions that are subsequently validated experimentally. This review describes the latest advances in databases, bioinformatic tools, and new in silico strategies that allow the establishment or prediction of biological interactions of ncRNAs, particularly miRNAs and lncRNAs. The ncRNA species described in this work have a special emphasis on those found in humans, but information on ncRNA of other species is also included.
Collapse
Affiliation(s)
- Andrés Rincón-Riveros
- Bioinformatics and Systems Biology Group, Universidad Nacional de Colombia, Bogotá 111221, Colombia;
| | - Duvan Morales
- Centro de Investigaciones en Microbiología y Biotecnología-UR (CIMBIUR), Facultad de Ciencias Naturales, Universidad del Rosario, Bogotá 111221, Colombia;
| | - Josefa Antonia Rodríguez
- Grupo de Investigación en Biología del Cáncer, Instituto Nacional de Cancerología, Bogotá 111221, Colombia;
| | - Victoria E. Villegas
- Centro de Investigaciones en Microbiología y Biotecnología-UR (CIMBIUR), Facultad de Ciencias Naturales, Universidad del Rosario, Bogotá 111221, Colombia;
| | - Liliana López-Kleine
- Department of Statistics, Faculty of Science, Universidad Nacional de Colombia, Bogotá 111221, Colombia
| |
Collapse
|
24
|
Wei H, Zhao Z, Luo R. Machine-Learned Molecular Surface and Its Application to Implicit Solvent Simulations. J Chem Theory Comput 2021; 17:6214-6224. [PMID: 34516109 DOI: 10.1021/acs.jctc.1c00492] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Implicit solvent models, such as Poisson-Boltzmann models, play important roles in computational studies of biomolecules. A vital step in almost all implicit solvent models is to determine the solvent-solute interface, and the solvent excluded surface (SES) is the most widely used interface definition in these models. However, classical algorithms used for computing SES are geometry-based, so that they are neither suitable for parallel implementations nor convenient for obtaining surface derivatives. To address the limitations, we explored a machine learning strategy to obtain a level set formulation for the SES. The training process was conducted in three steps, eventually leading to a model with over 95% agreement with the classical SES. Visualization of tested molecular surfaces shows that the machine-learned SES overlaps with the classical SES in almost all situations. Further analyses show that the machine-learned SES is incredibly stable in terms of rotational variation of tested molecules. Our timing analysis shows that the machine-learned SES is roughly 2.5 times as efficient as the classical SES routine implemented in Amber/PBSA on a tested central processing unit (CPU) platform. We expect further performance gain on massively parallel platforms such as graphics processing units (GPUs) given the ease in converting the machine-learned SES to a parallel procedure. We also implemented the machine-learned SES into the Amber/PBSA program to study its performance on reaction field energy calculation. The analysis shows that the two sets of reaction field energies are highly consistent with a 1% deviation on average. Given its level set formulation, we expect the machine-learned SES to be applied in molecular simulations that require either surface derivatives or high efficiency on parallel computing platforms.
Collapse
Affiliation(s)
- Haixin Wei
- Departments of Materials Science and Engineering, Molecular Biology and Biochemistry, Chemical and Biomolecular Engineering, and Biomedical Engineering, Graduate Program in Chemical and Materials Physics, University of California, Irvine, California 92697, United States
| | - Zekai Zhao
- Departments of Materials Science and Engineering, Molecular Biology and Biochemistry, Chemical and Biomolecular Engineering, and Biomedical Engineering, Graduate Program in Chemical and Materials Physics, University of California, Irvine, California 92697, United States
| | - Ray Luo
- Departments of Materials Science and Engineering, Molecular Biology and Biochemistry, Chemical and Biomolecular Engineering, and Biomedical Engineering, Graduate Program in Chemical and Materials Physics, University of California, Irvine, California 92697, United States
| |
Collapse
|
25
|
Caudai C, Galizia A, Geraci F, Le Pera L, Morea V, Salerno E, Via A, Colombo T. AI applications in functional genomics. Comput Struct Biotechnol J 2021; 19:5762-5790. [PMID: 34765093 PMCID: PMC8566780 DOI: 10.1016/j.csbj.2021.10.009] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 10/05/2021] [Accepted: 10/05/2021] [Indexed: 12/13/2022] Open
Abstract
We review the current applications of artificial intelligence (AI) in functional genomics. The recent explosion of AI follows the remarkable achievements made possible by "deep learning", along with a burst of "big data" that can meet its hunger. Biology is about to overthrow astronomy as the paradigmatic representative of big data producer. This has been made possible by huge advancements in the field of high throughput technologies, applied to determine how the individual components of a biological system work together to accomplish different processes. The disciplines contributing to this bulk of data are collectively known as functional genomics. They consist in studies of: i) the information contained in the DNA (genomics); ii) the modifications that DNA can reversibly undergo (epigenomics); iii) the RNA transcripts originated by a genome (transcriptomics); iv) the ensemble of chemical modifications decorating different types of RNA transcripts (epitranscriptomics); v) the products of protein-coding transcripts (proteomics); and vi) the small molecules produced from cell metabolism (metabolomics) present in an organism or system at a given time, in physiological or pathological conditions. After reviewing main applications of AI in functional genomics, we discuss important accompanying issues, including ethical, legal and economic issues and the importance of explainability.
Collapse
Affiliation(s)
- Claudia Caudai
- CNR, Institute of Information Science and Technologies “A. Faedo” (ISTI), Pisa, Italy
| | - Antonella Galizia
- CNR, Institute of Applied Mathematics and Information Technologies (IMATI), Genoa, Italy
| | - Filippo Geraci
- CNR, Institute for Informatics and Telematics (IIT), Pisa, Italy
| | - Loredana Le Pera
- CNR, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies (IBIOM), Bari, Italy
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| | - Veronica Morea
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| | - Emanuele Salerno
- CNR, Institute of Information Science and Technologies “A. Faedo” (ISTI), Pisa, Italy
| | - Allegra Via
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| | - Teresa Colombo
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| |
Collapse
|
26
|
Pratella D, Ait-El-Mkadem Saadi S, Bannwarth S, Paquis-Fluckinger V, Bottini S. A Survey of Autoencoder Algorithms to Pave the Diagnosis of Rare Diseases. Int J Mol Sci 2021; 22:10891. [PMID: 34639231 PMCID: PMC8509321 DOI: 10.3390/ijms221910891] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 10/04/2021] [Accepted: 10/07/2021] [Indexed: 12/28/2022] Open
Abstract
Rare diseases (RDs) concern a broad range of disorders and can result from various origins. For a long time, the scientific community was unaware of RDs. Impressive progress has already been made for certain RDs; however, due to the lack of sufficient knowledge, many patients are not diagnosed. Nowadays, the advances in high-throughput sequencing technologies such as whole genome sequencing, single-cell and others, have boosted the understanding of RDs. To extract biological meaning using the data generated by these methods, different analysis techniques have been proposed, including machine learning algorithms. These methods have recently proven to be valuable in the medical field. Among such approaches, unsupervised learning methods via neural networks including autoencoders (AEs) or variational autoencoders (VAEs) have shown promising performances with applications on various type of data and in different contexts, from cancer to healthy patient tissues. In this review, we discuss how AEs and VAEs have been used in biomedical settings. Specifically, we discuss their current applications and the improvements achieved in diagnostic and survival of patients. We focus on the applications in the field of RDs, and we discuss how the employment of AEs and VAEs would enhance RD understanding and diagnosis.
Collapse
Affiliation(s)
- David Pratella
- Center of Modeling, Simulation and Interactions, Université Côte d’Azur, 06200 Nice, France;
| | - Samira Ait-El-Mkadem Saadi
- Centre Hospitalier Universitaire (CHU) de Nice, Institute for Research on Cancer and Aging, Nice (IRCAN), Université Côte d’Azur, Inserm U1081, CNRS UMR 7284, 06200 Nice, France; (S.A.-E.-M.S.); (S.B.); (V.P.-F.)
| | - Sylvie Bannwarth
- Centre Hospitalier Universitaire (CHU) de Nice, Institute for Research on Cancer and Aging, Nice (IRCAN), Université Côte d’Azur, Inserm U1081, CNRS UMR 7284, 06200 Nice, France; (S.A.-E.-M.S.); (S.B.); (V.P.-F.)
| | - Véronique Paquis-Fluckinger
- Centre Hospitalier Universitaire (CHU) de Nice, Institute for Research on Cancer and Aging, Nice (IRCAN), Université Côte d’Azur, Inserm U1081, CNRS UMR 7284, 06200 Nice, France; (S.A.-E.-M.S.); (S.B.); (V.P.-F.)
| | - Silvia Bottini
- Center of Modeling, Simulation and Interactions, Université Côte d’Azur, 06200 Nice, France;
| |
Collapse
|
27
|
Umarov R, Li Y, Arner E. DeepCellState: An autoencoder-based framework for predicting cell type specific transcriptional states induced by drug treatment. PLoS Comput Biol 2021; 17:e1009465. [PMID: 34610009 PMCID: PMC8519465 DOI: 10.1371/journal.pcbi.1009465] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Revised: 10/15/2021] [Accepted: 09/20/2021] [Indexed: 11/18/2022] Open
Abstract
Drug treatment induces cell type specific transcriptional programs, and as the number of combinations of drugs and cell types grows, the cost for exhaustive screens measuring the transcriptional drug response becomes intractable. We developed DeepCellState, a deep learning autoencoder-based framework, for predicting the induced transcriptional state in a cell type after drug treatment, based on the drug response in another cell type. Training the method on a large collection of transcriptional drug perturbation profiles, prediction accuracy improves significantly over baseline and alternative deep learning approaches when applying the method to two cell types, with improved accuracy when generalizing the framework to additional cell types. Treatments with drugs or whole drug families not seen during training are predicted with similar accuracy, and the same framework can be used for predicting the results from other interventions, such as gene knock-downs. Finally, analysis of the trained model shows that the internal representation is able to learn regulatory relationships between genes in a fully data-driven manner.
Collapse
Affiliation(s)
- Ramzan Umarov
- Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Japan
- * E-mail: (RU); (EA)
| | - Yu Li
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Hong Kong, People’s Republic of China
| | - Erik Arner
- Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Japan
- Laboratory for Applied Regulatory Genomics Network Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- * E-mail: (RU); (EA)
| |
Collapse
|
28
|
Calprotectin-Mediated Zinc Chelation Inhibits Pseudomonas aeruginosa Protease Activity in Cystic Fibrosis Sputum. J Bacteriol 2021; 203:e0010021. [PMID: 33927050 DOI: 10.1128/jb.00100-21] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Pseudomonas aeruginosa induces pathways indicative of low zinc availability in the cystic fibrosis (CF) lung environment. To learn more about P. aeruginosa zinc access in CF, we grew P. aeruginosa strain PAO1 directly in expectorated CF sputum. The P. aeruginosa Zur transcriptional repressor controls the response to low intracellular zinc, and we used the NanoString methodology to monitor levels of Zur-regulated transcripts, including those encoding a zincophore system, a zinc importer, and paralogs of zinc containing proteins that do not require zinc for activity. Zur-controlled transcripts were induced in sputum-grown P. aeruginosa compared to those grown in control cultures but not if the sputum was amended with zinc. Amendment of sputum with ferrous iron did not reduce expression of Zur-regulated genes. A reporter fusion to a Zur-regulated promoter had variable activity in P. aeruginosa grown in sputum from different donors, and this variation inversely correlated with sputum zinc concentrations. Recombinant human calprotectin (CP), a divalent-metal binding protein released by neutrophils, was sufficient to induce a zinc starvation response in P. aeruginosa grown in laboratory medium or zinc-amended CF sputum, indicating that CP is functional in the sputum environment. Zinc metalloproteases comprise a large fraction of secreted zinc-binding P. aeruginosa proteins. Here, we show that recombinant CP inhibited both LasB-mediated casein degradation and LasA-mediated lysis of Staphylococcus aureus, which was reversible with added zinc. These studies reveal the potential for CP-mediated zinc chelation to posttranslationally inhibit zinc metalloprotease activity and thereby affect the protease-dependent physiology and/or virulence of P. aeruginosa in the CF lung environment. IMPORTANCE The factors that contribute to worse outcomes in individuals with cystic fibrosis (CF) with chronic Pseudomonas aeruginosa infections are not well understood. Therefore, there is a need to understand environmental factors within the CF airway that contribute to P. aeruginosa colonization and infection. We demonstrate that growing bacteria in CF sputum induces a zinc starvation response that inversely correlates with sputum zinc levels. Additionally, both calprotectin and a chemical zinc chelator inhibit the proteolytic activities of LasA and LasB proteases, suggesting that extracellular zinc chelators can influence proteolytic activity and thus P. aeruginosa virulence and nutrient acquisition in vivo.
Collapse
|
29
|
Molina Mora JA, Montero-Manso P, García-Batán R, Campos-Sánchez R, Vilar-Fernández J, García F. A first perturbome of Pseudomonas aeruginosa: Identification of core genes related to multiple perturbations by a machine learning approach. Biosystems 2021; 205:104411. [PMID: 33757842 DOI: 10.1016/j.biosystems.2021.104411] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 03/11/2021] [Accepted: 03/12/2021] [Indexed: 01/27/2023]
Abstract
Tolerance to stress conditions is vital for organismal survival, including bacteria under specific environmental conditions, antibiotics, and other perturbations. Some studies have described common modulation and shared genes during stress response to different types of disturbances (termed as perturbome), leading to the idea of central control at the molecular level. We implemented a robust machine learning approach to identify and describe genes associated with multiple perturbations or perturbome in a Pseudomonas aeruginosa PAO1 model. Using microarray datasets from the Gene Expression Omnibus (GEO), we evaluated six approaches to rank and select genes: using two methodologies, data single partition (SP method) or multiple partitions (MP method) for training and testing datasets, we evaluated three classification algorithms (SVM Support Vector Machine, KNN K-Nearest neighbor and RF Random Forest). Gene expression patterns and topological features at the systems level were included to describe the perturbome elements. We were able to select and describe 46 core response genes associated with multiple perturbations in P. aeruginosa PAO1 and it can be considered a first report of the P. aeruginosa perturbome. Molecular annotations, patterns in expression levels, and topological features in molecular networks revealed biological functions of biosynthesis, binding, and metabolism, many of them related to DNA damage repair and aerobic respiration in the context of tolerance to stress. We also discuss different issues related to implemented and assessed algorithms, including data partitioning, classification approaches, and metrics. Altogether, this work offers a different and robust framework to select genes using a machine learning approach.
Collapse
Affiliation(s)
- Jose Arturo Molina Mora
- Centro de Investigacion en Enfermedades Tropicales (CIET) and Facultad de Microbiología, Universidad de Costa Rica, San Jose, Costa Rica.
| | | | - Raquel García-Batán
- Centro de Investigacion en Enfermedades Tropicales (CIET) and Facultad de Microbiología, Universidad de Costa Rica, San Jose, Costa Rica.
| | - Rebeca Campos-Sánchez
- Centro de Investigación en Biología Celular y Molecular (CIBCM), Universidad de Costa Rica, San José, Costa Rica.
| | | | - Fernando García
- Centro de Investigacion en Enfermedades Tropicales (CIET) and Facultad de Microbiología, Universidad de Costa Rica, San Jose, Costa Rica.
| |
Collapse
|
30
|
Wang J, Xie X, Shi J, He W, Chen Q, Chen L, Gu W, Zhou T. Denoising Autoencoder, A Deep Learning Algorithm, Aids the Identification of A Novel Molecular Signature of Lung Adenocarcinoma. GENOMICS PROTEOMICS & BIOINFORMATICS 2020; 18:468-480. [PMID: 33346087 PMCID: PMC8242334 DOI: 10.1016/j.gpb.2019.02.003] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Revised: 01/11/2019] [Accepted: 03/01/2019] [Indexed: 02/06/2023]
Abstract
Precise biomarker development is a key step in disease management. However, most of the published biomarkers were derived from a relatively small number of samples with supervised approaches. Recent advances in unsupervised machine learning promise to leverage very large datasets for making better predictions of disease biomarkers. Denoising autoencoder (DA) is one of the unsupervised deep learning algorithms, which is a stochastic version of autoencoder techniques. The principle of DA is to force the hidden layer of autoencoder to capture more robust features by reconstructing a clean input from a corrupted one. Here, a DA model was applied to analyze integrated transcriptomic data from 13 published lung cancer studies, which consisted of 1916 human lung tissue samples. Using DA, we discovered a molecular signature composed of multiple genes for lung adenocarcinoma (ADC). In independent validation cohorts, the proposed molecular signature is proved to be an effective classifier for lung cancer histological subtypes. Also, this signature successfully predicts clinical outcome in lung ADC, which is independent of traditional prognostic factors. More importantly, this signature exhibits a superior prognostic power compared with the other published prognostic genes. Our study suggests that unsupervised learning is helpful for biomarker development in the era of precision medicine.
Collapse
Affiliation(s)
- Jun Wang
- Department of Thoracic Surgery, Jiangsu Province People's Hospital and the First Affiliated Hospital of Nanjing Medical University, Nanjing 210029, China
| | - Xueying Xie
- State Key Laboratory of Bioelectronics, School of Biological Sciences and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Junchao Shi
- Department of Physiology and Cell Biology, University of Nevada, Reno School of Medicine, Reno, NV 89557, USA
| | - Wenjun He
- State Key Lab of Respiratory Disease, Guangzhou Medical University, Guangzhou 510000, China
| | - Qi Chen
- Department of Physiology and Cell Biology, University of Nevada, Reno School of Medicine, Reno, NV 89557, USA
| | - Liang Chen
- Department of Thoracic Surgery, Jiangsu Province People's Hospital and the First Affiliated Hospital of Nanjing Medical University, Nanjing 210029, China.
| | - Wanjun Gu
- State Key Laboratory of Bioelectronics, School of Biological Sciences and Medical Engineering, Southeast University, Nanjing 210096, China.
| | - Tong Zhou
- Department of Physiology and Cell Biology, University of Nevada, Reno School of Medicine, Reno, NV 89557, USA.
| |
Collapse
|
31
|
Simon LM, Yan F, Zhao Z. DrivAER: Identification of driving transcriptional programs in single-cell RNA sequencing data. Gigascience 2020; 9:giaa122. [PMID: 33301553 PMCID: PMC7727875 DOI: 10.1093/gigascience/giaa122] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Revised: 05/27/2020] [Accepted: 10/07/2020] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND Single-cell RNA sequencing (scRNA-seq) unfolds complex transcriptomic datasets into detailed cellular maps. Despite recent success, there is a pressing need for specialized methods tailored towards the functional interpretation of these cellular maps. FINDINGS Here, we present DrivAER, a machine learning approach for the identification of driving transcriptional programs using autoencoder-based relevance scores. DrivAER scores annotated gene sets on the basis of their relevance to user-specified outcomes such as pseudotemporal ordering or disease status. DrivAER iteratively evaluates the information content of each gene set with respect to the outcome variable using autoencoders. We benchmark our method using extensive simulation analysis as well as comparison to existing methods for functional interpretation of scRNA-seq data. Furthermore, we demonstrate that DrivAER extracts key pathways and transcription factors that regulate complex biological processes from scRNA-seq data. CONCLUSIONS By quantifying the relevance of annotated gene sets with respect to specified outcome variables, DrivAER greatly enhances our ability to understand the underlying molecular mechanisms.
Collapse
Affiliation(s)
- Lukas M Simon
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, 7000 Fannin St, Houston, TX 77030, USA
| | - Fangfang Yan
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, 7000 Fannin St, Houston, TX 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, 7000 Fannin St, Houston, TX 77030, USA
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, 7000 Fannin St, Houston, TX 77030, USA
- MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, 6767 Bertner Ave, Houston, TX 77030, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End, Nashville, TN 37203, USA
| |
Collapse
|
32
|
Lee AJ, Park Y, Doing G, Hogan DA, Greene CS. Correcting for experiment-specific variability in expression compendia can remove underlying signals. Gigascience 2020; 9:giaa117. [PMID: 33140829 PMCID: PMC7607552 DOI: 10.1093/gigascience/giaa117] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Revised: 08/28/2020] [Accepted: 09/29/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION In the past two decades, scientists in different laboratories have assayed gene expression from millions of samples. These experiments can be combined into compendia and analyzed collectively to extract novel biological patterns. Technical variability, or "batch effects," may result from combining samples collected and processed at different times and in different settings. Such variability may distort our ability to extract true underlying biological patterns. As more integrative analysis methods arise and data collections get bigger, we must determine how technical variability affects our ability to detect desired patterns when many experiments are combined. OBJECTIVE We sought to determine the extent to which an underlying signal was masked by technical variability by simulating compendia comprising data aggregated across multiple experiments. METHOD We developed a generative multi-layer neural network to simulate compendia of gene expression experiments from large-scale microbial and human datasets. We compared simulated compendia before and after introducing varying numbers of sources of undesired variability. RESULTS The signal from a baseline compendium was obscured when the number of added sources of variability was small. Applying statistical correction methods rescued the underlying signal in these cases. However, as the number of sources of variability increased, it became easier to detect the original signal even without correction. In fact, statistical correction reduced our power to detect the underlying signal. CONCLUSION When combining a modest number of experiments, it is best to correct for experiment-specific noise. However, when many experiments are combined, statistical correction reduces our ability to extract underlying patterns.
Collapse
Affiliation(s)
- Alexandra J Lee
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - YoSon Park
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Georgia Doing
- Department of Microbiology and Immunology, Geisel School of Medicine, Dartmouth, 1 Rope Ferry Rd, Hanover, NH, 03755, USA
| | - Deborah A Hogan
- Department of Microbiology and Immunology, Geisel School of Medicine, Dartmouth, 1 Rope Ferry Rd, Hanover, NH, 03755, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA
- Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, 1429 Walnut St, Floor 10, Philadelphia, PA, 19102 USA
| |
Collapse
|
33
|
Byrd JB, Greene AC, Prasad DV, Jiang X, Greene CS. Responsible, practical genomic data sharing that accelerates research. Nat Rev Genet 2020; 21:615-629. [PMID: 32694666 PMCID: PMC7974070 DOI: 10.1038/s41576-020-0257-5] [Citation(s) in RCA: 65] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/08/2020] [Indexed: 12/13/2022]
Abstract
Data sharing anchors reproducible science, but expectations and best practices are often nebulous. Communities of funders, researchers and publishers continue to grapple with what should be required or encouraged. To illuminate the rationales for sharing data, the technical challenges and the social and cultural challenges, we consider the stakeholders in the scientific enterprise. In biomedical research, participants are key among those stakeholders. Ethical sharing requires considering both the value of research efforts and the privacy costs for participants. We discuss current best practices for various types of genomic data, as well as opportunities to promote ethical data sharing that accelerates science by aligning incentives.
Collapse
Affiliation(s)
- James Brian Byrd
- Department of Internal Medicine, Medical School, University of Michigan, Ann Arbor, MI, USA
| | - Anna C Greene
- Alex's Lemonade Stand Foundation, Bala Cynwyd, PA, USA
| | | | - Xiaoqian Jiang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Casey S Greene
- Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, PA, USA.
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
34
|
Makrodimitris S, Reinders MJT, van Ham RCHJ. Metric learning on expression data for gene function prediction. Bioinformatics 2020; 36:1182-1190. [PMID: 31562759 PMCID: PMC7703756 DOI: 10.1093/bioinformatics/btz731] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2019] [Revised: 08/31/2019] [Accepted: 09/25/2019] [Indexed: 01/02/2023] Open
Abstract
MOTIVATION Co-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, when using RNA-Seq datasets with many experimental conditions from diverse sources, only a subset of the experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similarly functioning genes, the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest. RESULTS To address this, we developed Metric Learning for Co-expression (MLC), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression and, in addition, if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidopsis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric performance. Moreover, our method is particularly good at more specific terms, which are the most interesting. Finally, by observing the sample weights for a particular GO term, one can identify which experiments are important for learning that term and potentially identify novel conditions that are relevant, as demonstrated by experiments in both A. thaliana and Pseudomonas Aeruginosa. AVAILABILITY AND IMPLEMENTATION MLC is available as a Python package at www.github.com/stamakro/MLC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Stavros Makrodimitris
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628 XE, The Netherlands.,Keygene N.V., Wageningen 6708 PW, The Netherlands
| | - Marcel J T Reinders
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628 XE, The Netherlands.,Leiden Computational Biology Center, Leiden University Medical Center, Leiden 2333 ZC, The Netherlands
| | - Roeland C H J van Ham
- Delft Bioinformatics Lab, Delft University of Technology, Delft 2628 XE, The Netherlands.,Keygene N.V., Wageningen 6708 PW, The Netherlands
| |
Collapse
|
35
|
Conditional antagonism in co-cultures of Pseudomonas aeruginosa and Candida albicans: An intersection of ethanol and phosphate signaling distilled from dual-seq transcriptomics. PLoS Genet 2020; 16:e1008783. [PMID: 32813693 PMCID: PMC7480860 DOI: 10.1371/journal.pgen.1008783] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 09/09/2020] [Accepted: 06/20/2020] [Indexed: 12/12/2022] Open
Abstract
Pseudomonas aeruginosa and Candida albicans are opportunistic pathogens whose interactions involve the secreted products ethanol and phenazines. Here, we describe the role of ethanol in mixed-species co-cultures by dual-seq analyses. P. aeruginosa and C. albicans transcriptomes were assessed after growth in mono-culture or co-culture with either ethanol-producing C. albicans or a C. albicans mutant lacking the primary ethanol dehydrogenase, Adh1. Analysis of the RNA-Seq data using KEGG pathway enrichment and eADAGE methods revealed several P. aeruginosa responses to C. albicans-produced ethanol including the induction of a non-canonical low-phosphate response regulated by PhoB. C. albicans wild type, but not C. albicans adh1Δ/Δ, induces P. aeruginosa production of 5-methyl-phenazine-1-carboxylic acid (5-MPCA), which forms a red derivative within fungal cells and exhibits antifungal activity. Here, we show that C. albicans adh1Δ/Δ no longer activates P. aeruginosa PhoB and PhoB-regulated phosphatase activity, that exogenous ethanol complements this defect, and that ethanol is sufficient to activate PhoB in single-species P. aeruginosa cultures at permissive phosphate levels. The intersection of ethanol and phosphate in co-culture is inversely reflected in C. albicans; C. albicans adh1Δ/Δ had increased expression of genes regulated by Pho4, the C. albicans transcription factor that responds to low phosphate, and Pho4-dependent phosphatase activity. Together, these results show that C. albicans-produced ethanol stimulates P. aeruginosa PhoB activity and 5-MPCA-mediated antagonism, and that both responses are dependent on local phosphate concentrations. Further, our data suggest that phosphate scavenging by one species improves phosphate access for the other, thus highlighting the complex dynamics at play in microbial communities. Pseudomonas aeruginosa and Candida albicans are opportunistic pathogens that are frequently isolated from co-infections. Using a combination of dual-seq transcriptomics and genetics approaches, we found that ethanol produced by C. albicans stimulates the PhoB regulon in P. aeruginosa asynchronously with activation of the Pho4 regulon in C. albicans. We validated our result by showing that PhoB plays multiple roles in co-culture including orchestrating the competition for phosphate and the production of 5-methyl-phenazine-1-carboxylic acid; the P. aeruginosa phenazine response to C. albicans-produced ethanol depends on phosphate availability. The conditional stimulation of antifungal production in response to sub-inhibitory concentrations of ethanol only under phosphate limitation highlights the importance of considering nutrient concentrations in the analysis of co-culture interactions and suggests that the low-phosphate response in one species influences phosphate availability for the other.
Collapse
|
36
|
|
37
|
Fortelny N, Bock C. Knowledge-primed neural networks enable biologically interpretable deep learning on single-cell sequencing data. Genome Biol 2020; 21:190. [PMID: 32746932 PMCID: PMC7397672 DOI: 10.1186/s13059-020-02100-5] [Citation(s) in RCA: 61] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Accepted: 07/10/2020] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Deep learning has emerged as a versatile approach for predicting complex biological phenomena. However, its utility for biological discovery has so far been limited, given that generic deep neural networks provide little insight into the biological mechanisms that underlie a successful prediction. Here we demonstrate deep learning on biological networks, where every node has a molecular equivalent, such as a protein or gene, and every edge has a mechanistic interpretation, such as a regulatory interaction along a signaling pathway. RESULTS With knowledge-primed neural networks (KPNNs), we exploit the ability of deep learning algorithms to assign meaningful weights in multi-layered networks, resulting in a widely applicable approach for interpretable deep learning. We present a learning method that enhances the interpretability of trained KPNNs by stabilizing node weights in the presence of redundancy, enhancing the quantitative interpretability of node weights, and controlling for uneven connectivity in biological networks. We validate KPNNs on simulated data with known ground truth and demonstrate their practical use and utility in five biological applications with single-cell RNA-seq data for cancer and immune cells. CONCLUSIONS We introduce KPNNs as a method that combines the predictive power of deep learning with the interpretability of biological networks. While demonstrated here on single-cell sequencing data, this method is broadly relevant to other research areas where prior domain knowledge can be represented as networks.
Collapse
Affiliation(s)
- Nikolaus Fortelny
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna, Austria
| | - Christoph Bock
- CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna, Austria.
- Department of Laboratory Medicine, Medical University of Vienna, Vienna, Austria.
| |
Collapse
|
38
|
Current Knowledge and Future Directions in Developing Strategies to Combat Pseudomonas aeruginosa Infection. J Mol Biol 2020; 432:5509-5528. [PMID: 32750389 DOI: 10.1016/j.jmb.2020.07.021] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2020] [Revised: 07/17/2020] [Accepted: 07/22/2020] [Indexed: 12/21/2022]
Abstract
In the face of growing antimicrobial resistance, there is an urgent need for the development of effective strategies to target Pseudomonas aeruginosa. This metabolically versatile bacterium can cause a wide range of severe opportunistic infections in patients with serious underlying medical conditions, such as those with burns, surgical wounds or people with cystic fibrosis. Many of the key adaptations that arise in this organism during infection are centered on core metabolism and virulence factor synthesis. Interfering with these processes may provide a new strategy to combat infection which could be combined with conventional antibiotics. This review will provide an overview of the most recent work that has advanced our understanding of P. aeruginosa infection. Strategies that exploit this recent knowledge to combat infection will be highlighted alongside potential alternative therapeutic options and their limitations.
Collapse
|
39
|
Koumakis L. Deep learning models in genomics; are we there yet? Comput Struct Biotechnol J 2020; 18:1466-1473. [PMID: 32637044 PMCID: PMC7327302 DOI: 10.1016/j.csbj.2020.06.017] [Citation(s) in RCA: 69] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 06/07/2020] [Accepted: 06/08/2020] [Indexed: 12/23/2022] Open
Abstract
With the evolution of biotechnology and the introduction of the high throughput sequencing, researchers have the ability to produce and analyze vast amounts of genomics data. Since genomics produce big data, most of the bioinformatics algorithms are based on machine learning methodologies, and lately deep learning, to identify patterns, make predictions and model the progression or treatment of a disease. Advances in deep learning created an unprecedented momentum in biomedical informatics and have given rise to new bioinformatics and computational biology research areas. It is evident that deep learning models can provide higher accuracies in specific tasks of genomics than the state of the art methodologies. Given the growing trend on the application of deep learning architectures in genomics research, in this mini review we outline the most prominent models, we highlight possible pitfalls and discuss future directions. We foresee deep learning accelerating changes in the area of genomics, especially for multi-scale and multimodal data analysis for precision medicine.
Collapse
Affiliation(s)
- Lefteris Koumakis
- Foundation for Research and Technology - Hellas (FORTH), Institute of Computer Science, Heraklion, Crete, Greece
| |
Collapse
|
40
|
Way GP, Zietz M, Rubinetti V, Himmelstein DS, Greene CS. Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations. Genome Biol 2020; 21:109. [PMID: 32393369 PMCID: PMC7212571 DOI: 10.1186/s13059-020-02021-3] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Accepted: 04/16/2020] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Unsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically appropriate latent space dimensionality. In practice, most researchers fit a single algorithm and latent dimensionality. We sought to determine the extent by which selecting only one fit limits the biological features captured in the latent representations and, consequently, limits what can be discovered with subsequent analyses. RESULTS We compress gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We train many different models across a large range of latent space dimensionalities and observe various performance differences. We identify more curated pathway gene sets significantly associated with individual dimensions in denoising autoencoder and variational autoencoder models trained using an intermediate number of latent dimensionalities. Combining compressed features across algorithms and dimensionalities captures the most pathway-associated representations. When trained with different latent dimensionalities, models learn strongly associated and generalizable biological representations including sex, neuroblastoma MYCN amplification, and cell types. Stronger signals, such as tumor type, are best captured in models trained at lower dimensionalities, while more subtle signals such as pathway activity are best identified in models trained with more latent dimensionalities. CONCLUSIONS There is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using features derived from different compression models across multiple latent space dimensionalities enhances biological representations.
Collapse
Affiliation(s)
- Gregory P Way
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Michael Zietz
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Vincent Rubinetti
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Daniel S Himmelstein
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA.
- Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, PA, 19102, USA.
| |
Collapse
|
41
|
Li Y, Huang H, Chen H, Liu T. Deep Neural Networks for In Situ Hybridization Grid Completion and Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:536-546. [PMID: 30106689 PMCID: PMC7199204 DOI: 10.1109/tcbb.2018.2864262] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Transcriptome in brain plays a crucial role in understanding the cortical organization and the development of brain structure and function. Two challenges, incomplete data and high dimensionality of transcriptome, remain unsolved. Here, we present a novel training scheme that successfully adapts the U-net architecture to the problem of volume recovery. By analogy to denoising autoencoder, we hide a portion of each training sample so that the network can learn to recover missing voxels from context. Then on the completed volumes, we show that Restricted Boltzmann Machines (RBMs) can be used to infer co-occurrences among voxels, providing foundations for dividing the cortex into discrete subregions. As we stack multiple RBMs to form a deep belief network (DBN), we progressively map the high-dimensional raw input into abstract representations and create a hierarchy of transcriptome architecture. A coarse to fine organization emerges from the network layers. This organization incidentally corresponds to the anatomical structures, suggesting a close link between structures and the genetic underpinnings. Thus, we demonstrate a new way of learning transcriptome-based hierarchical organization using RBM and DBN.
Collapse
|
42
|
Pseudomonas aeruginosa lasR mutant fitness in microoxia is supported by an Anr-regulated oxygen-binding hemerythrin. Proc Natl Acad Sci U S A 2020; 117:3167-3173. [PMID: 31980538 DOI: 10.1073/pnas.1917576117] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Pseudomonas aeruginosa strains with loss-of-function mutations in the transcription factor LasR are frequently encountered in the clinic and the environment. Among the characteristics common to LasR-defective (LasR-) strains is increased activity of the transcription factor Anr, relative to their LasR+ counterparts, in low-oxygen conditions. One of the Anr-regulated genes found to be highly induced in LasR- strains was PA14_42860 (PA1673), which we named mhr for microoxic hemerythrin. Purified P. aeruginosa Mhr protein contained the predicted di-iron center and bound molecular oxygen with an apparent K d of ∼1 µM. Both Anr and Mhr were necessary for fitness in lasR+ and lasR mutant strains in colony biofilms grown in microoxic conditions, and the effects were more striking in the lasR mutant. Among genes in the Anr regulon, mhr was most closely coregulated with the Anr-controlled high-affinity cytochrome c oxidase genes. In the absence of high-affinity cytochrome c oxidases, deletion of mhr no longer caused a fitness disadvantage, suggesting that Mhr works in concert with microoxic respiration. We demonstrate that Anr and Mhr contribute to LasR- strain fitness even in biofilms grown in normoxic conditions. Furthermore, metabolomics data indicate that, in a lasR mutant, expression of Anr-regulated mhr leads to differences in metabolism in cells grown on lysogeny broth or artificial sputum medium. We propose that increased Anr activity leads to higher levels of the oxygen-binding protein Mhr, which confers an advantage to lasR mutants in microoxic conditions.
Collapse
|
43
|
Valli RXE, Lyng M, Kirkpatrick CL. There is no hiding if you Seq: recent breakthroughs in Pseudomonas aeruginosa research revealed by genomic and transcriptomic next-generation sequencing. J Med Microbiol 2020; 69:162-175. [PMID: 31935190 DOI: 10.1099/jmm.0.001135] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
The advent of next-generation sequencing technology has revolutionized the field of prokaryotic genetics and genomics by allowing interrogation of entire genomes, transcriptomes and global transcription factor binding profiles. As more studies employing these techniques have been performed, the state of the art regarding prokaryotic gene regulation has developed from the level of individual genes to genetic regulatory networks and systems biology. When applied to bacterial pathogens, particularly valuable insights have been gained into the regulation of virulence-associated genes, their relative importance to bacterial survival in planktonic, biofilm or host infection scenarios, antimicrobial resistance and the molecular details of host-pathogen interactions. This review outlines some of the latest developments and applications of next-generation sequencing techniques that have used primarily Pseudomonas aeruginosa as a model system. We focus particularly on insights into Pseudomonas virulence and infection that have been gained from these approaches and the future directions in which this field could develop.
Collapse
Affiliation(s)
- Richard X E Valli
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, Campusvej 55, 5230 Odense M, Denmark
| | - Mark Lyng
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, Campusvej 55, 5230 Odense M, Denmark
| | - Clare L Kirkpatrick
- Department of Biochemistry and Molecular Biology, University of Southern Denmark, Campusvej 55, 5230 Odense M, Denmark
| |
Collapse
|
44
|
Zhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, Lewis KA, Georghiou G, Nguyen HN, Hamid MN, Davis L, Dogan T, Atalay V, Rifaioglu AS, Dalkıran A, Cetin Atalay R, Zhang C, Hurto RL, Freddolino PL, Zhang Y, Bhat P, Supek F, Fernández JM, Gemovic B, Perovic VR, Davidović RS, Sumonja N, Veljkovic N, Asgari E, Mofrad MRK, Profiti G, Savojardo C, Martelli PL, Casadio R, Boecker F, Schoof H, Kahanda I, Thurlby N, McHardy AC, Renaux A, Saidi R, Gough J, Freitas AA, Antczak M, Fabris F, Wass MN, Hou J, Cheng J, Wang Z, Romero AE, Paccanaro A, Yang H, Goldberg T, Zhao C, Holm L, Törönen P, Medlar AJ, Zosa E, Borukhov I, Novikov I, Wilkins A, Lichtarge O, Chi PH, Tseng WC, Linial M, Rose PW, Dessimoz C, Vidulin V, Dzeroski S, Sillitoe I, Das S, Lees JG, Jones DT, Wan C, Cozzetto D, Fa R, Torres M, Warwick Vesztrocy A, Rodriguez JM, Tress ML, Frasca M, Notaro M, Grossi G, Petrini A, Re M, Valentini G, Mesiti M, Roche DB, Reeb J, Ritchie DW, Aridhi S, Alborzi SZ, Devignes MD, Koo DCE, Bonneau R, Gligorijević V, Barot M, Fang H, Toppo S, Lavezzo E, et alZhou N, Jiang Y, Bergquist TR, Lee AJ, Kacsoh BZ, Crocker AW, Lewis KA, Georghiou G, Nguyen HN, Hamid MN, Davis L, Dogan T, Atalay V, Rifaioglu AS, Dalkıran A, Cetin Atalay R, Zhang C, Hurto RL, Freddolino PL, Zhang Y, Bhat P, Supek F, Fernández JM, Gemovic B, Perovic VR, Davidović RS, Sumonja N, Veljkovic N, Asgari E, Mofrad MRK, Profiti G, Savojardo C, Martelli PL, Casadio R, Boecker F, Schoof H, Kahanda I, Thurlby N, McHardy AC, Renaux A, Saidi R, Gough J, Freitas AA, Antczak M, Fabris F, Wass MN, Hou J, Cheng J, Wang Z, Romero AE, Paccanaro A, Yang H, Goldberg T, Zhao C, Holm L, Törönen P, Medlar AJ, Zosa E, Borukhov I, Novikov I, Wilkins A, Lichtarge O, Chi PH, Tseng WC, Linial M, Rose PW, Dessimoz C, Vidulin V, Dzeroski S, Sillitoe I, Das S, Lees JG, Jones DT, Wan C, Cozzetto D, Fa R, Torres M, Warwick Vesztrocy A, Rodriguez JM, Tress ML, Frasca M, Notaro M, Grossi G, Petrini A, Re M, Valentini G, Mesiti M, Roche DB, Reeb J, Ritchie DW, Aridhi S, Alborzi SZ, Devignes MD, Koo DCE, Bonneau R, Gligorijević V, Barot M, Fang H, Toppo S, Lavezzo E, Falda M, Berselli M, Tosatto SCE, Carraro M, Piovesan D, Ur Rehman H, Mao Q, Zhang S, Vucetic S, Black GS, Jo D, Suh E, Dayton JB, Larsen DJ, Omdahl AR, McGuffin LJ, Brackenridge DA, Babbitt PC, Yunes JM, Fontana P, Zhang F, Zhu S, You R, Zhang Z, Dai S, Yao S, Tian W, Cao R, Chandler C, Amezola M, Johnson D, Chang JM, Liao WH, Liu YW, Pascarelli S, Frank Y, Hoehndorf R, Kulmanov M, Boudellioua I, Politano G, Di Carlo S, Benso A, Hakala K, Ginter F, Mehryary F, Kaewphan S, Björne J, Moen H, Tolvanen MEE, Salakoski T, Kihara D, Jain A, Šmuc T, Altenhoff A, Ben-Hur A, Rost B, Brenner SE, Orengo CA, Jeffery CJ, Bosco G, Hogan DA, Martin MJ, O'Donovan C, Mooney SD, Greene CS, Radivojac P, Friedberg I. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol 2019; 20:244. [PMID: 31744546 PMCID: PMC6864930 DOI: 10.1186/s13059-019-1835-8] [Show More Authors] [Citation(s) in RCA: 219] [Impact Index Per Article: 36.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 09/24/2019] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. RESULTS Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. CONCLUSION We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.
Collapse
Affiliation(s)
- Naihui Zhou
- Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.,Program in Bioinformatics and Computational Biology, Ames, IA, USA
| | - Yuxiang Jiang
- Indiana University Bloomington, Bloomington, Indiana, USA
| | - Timothy R Bergquist
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA
| | - Alexandra J Lee
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Balint Z Kacsoh
- Geisel School of Medicine at Dartmouth, Hanover, NH, USA.,Department of Molecular and Systems Biology, Hanover, NH, USA
| | - Alex W Crocker
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Kimberley A Lewis
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - George Georghiou
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom
| | - Huy N Nguyen
- Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.,Program in Computer Science, Ames, IA, USA
| | - Md Nafiz Hamid
- Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.,Program in Bioinformatics and Computational Biology, Ames, IA, USA
| | - Larry Davis
- Program in Bioinformatics and Computational Biology, Ames, IA, USA
| | - Tunca Dogan
- Department of Computer Engineering, Hacettepe University, Ankara, Turkey.,European Molecular Biolo gy Labora tory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Volkan Atalay
- Department of Computer Engineering, Middle East Technical University (METU), Ankara, Turkey
| | - Ahmet S Rifaioglu
- Department of Computer Engineering, Middle East Technical University (METU), Ankara, Turkey.,Department of Computer Engineering, Iskenderun Technical University, Hatay, Turkey
| | - Alperen Dalkıran
- Department of Computer Engineering, Middle East Technical University (METU), Ankara, Turkey
| | - Rengul Cetin Atalay
- CanSyL, Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Rebecca L Hurto
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Peter L Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
| | | | - Fran Supek
- Institute for Research in Biomedicine (IRB Barcelona), Barcelona, Spain.,Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| | - José M Fernández
- INB Coordination Unit, Life Sciences Department, Barcelona Supercomputing Center, Barcelona, Catalonia, Spain.,(former) INB GN2, Structural and Computational Biology Programme, Spanish National Cancer Research Centre, Barcelona, Catalonia, Spain
| | - Branislava Gemovic
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Vladimir R Perovic
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Radoslav S Davidović
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Neven Sumonja
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Nevena Veljkovic
- Laboratory for Bioinformatics and Computational Chemistry, Institute of Nuclear Sciences VINCA, University of Belgrade, Belgrade, Serbia
| | - Ehsaneddin Asgari
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering, University of California Berkeley, Berkeley, CA, USA.,Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Berkeley, CA, USA
| | | | - Giuseppe Profiti
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy.,National Research Council, IBIOM, Bologna, Italy
| | - Castrense Savojardo
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Pier Luigi Martelli
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Rita Casadio
- Bologna Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Florian Boecker
- University of Bonn: INRES Crop Bioinformatics, Bonn, North Rhine-Westphalia, Germany
| | - Heiko Schoof
- INRES Crop Bioinformatics, University of Bonn, Bonn, Germany
| | - Indika Kahanda
- Gianforte School of Computing, Montana State University, Bozeman, Montana, USA
| | - Natalie Thurlby
- University of Bristol, Computer Science, Bristol, Bristol, United Kingdom
| | - Alice C McHardy
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Brunswick, Germany.,RESIST, DFG Cluster of Excellence 2155, Brunswick, Germany
| | - Alexandre Renaux
- Interuniversity Institute of Bioinformatics in Brussels, Université libre de Bruxelles - Vrije Universiteit Brussel, Brussels, Belgium.,Machine Learning Group, Université libre de Bruxelles, Brussels, Belgium.,Artificial Intelligence lab, Vrije Universiteit Brussel, Brussels, Belgium
| | - Rabie Saidi
- European Molecular Biolo gy Labora tory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Julian Gough
- MRC Laboratory of Molecular Biology, Cambridge, United Kingdom
| | - Alex A Freitas
- University of Kent, School of Computing, Canterbury, United Kingdom
| | - Magdalena Antczak
- School of Biosciences, University of Kent, Canterbury, Kent, United Kingdom
| | - Fabio Fabris
- University of Kent, School of Computing, Canterbury, United Kingdom
| | - Mark N Wass
- School of Biosciences, University of Kent, Canterbury, Kent, United Kingdom
| | - Jie Hou
- University of Missouri, Computer Science, Columbia, Missouri, USA.,Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, USA
| | - Zheng Wang
- University of Miami, Coral Gables, Florida, USA
| | - Alfonso E Romero
- Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
| | - Alberto Paccanaro
- Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
| | - Haixuan Yang
- School of Mathematics, Statistics and Applied Mathematics, National University of Ireland, Galway, Galway, Ireland.,Technical University of Munich, Garching, Germany
| | - Tatyana Goldberg
- Department of Informatics, Bioinformatics & Computational Biology-i12, Technische Universitat Munchen, Munich, Germany
| | - Chenguang Zhao
- Faculty for Informatics, Garching, Germany.,Department for Bioinformatics and Computational Biology, Garching, Germany.,School of Computing Sciences and Computer Engineering, Hattiesburg, Mississippi, USA
| | - Liisa Holm
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Finland, Helsinki, Finland
| | - Petri Törönen
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Finland, Helsinki, Finland
| | - Alan J Medlar
- Institute of Biotechnology, Helsinki Institute of Life Sciences, University of Helsinki, Finland, Helsinki, Finland
| | - Elaine Zosa
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | | | - Ilya Novikov
- Baylor College of Medicine, Department of Biochemistry and Molecular Biology, Houston, TX, USA
| | - Angela Wilkins
- Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, TX, USA
| | - Olivier Lichtarge
- Baylor College of Medicine, Department of Molecular and Human Genetics, Houston, TX, USA
| | - Po-Han Chi
- National TsingHua University, Hsinchu, Taiwan
| | - Wei-Cheng Tseng
- Department of Electrical Engineering in National Tsing Hua University, Hsinchu City, Taiwan
| | - Michal Linial
- The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Peter W Rose
- University of California San Diego, San Diego Supercomputer Center, La Jolla, California, USA
| | - Christophe Dessimoz
- Department of Computational Biology and Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland.,Department of Genetics, Evolution & Environment, and Department of Computer Science, University College London, London, UK.,Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Vedrana Vidulin
- Department of Knowledge Technologies, Jozef Stefan Institute, Ljubljana, Slovenia
| | - Saso Dzeroski
- Jozef Stefan Institute, Ljubljana, Slovenia.,Jozef Stefan International Postgraduate School, Ljubljana, Slovenia
| | - Ian Sillitoe
- Research Department of Structural and Molecular Biology, University College London, London, England
| | - Sayoni Das
- Research Department of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Jonathan Gill Lees
- Research Department of Structural and Molecular Biology, University College London, London, United Kingdom.,Department of Health and Life Sciences, Oxford Brookes University, London, UK
| | - David T Jones
- The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom.,Department of Genetics, Evolution and Environment, University College London, Gower Street, London, WC1E 6BT, United Kingdom
| | - Cen Wan
- Department of Computer Science, University College London, London, United Kingdom.,The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom
| | - Domenico Cozzetto
- Department of Computer Science, University College London, London, United Kingdom.,The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom
| | - Rui Fa
- Department of Computer Science, University College London, London, United Kingdom.,The Francis Crick Institute, Biomedical Data Science Laboratory, London, United Kingdom
| | - Mateo Torres
- Centre for Systems and Synthetic Biology, Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
| | - Alex Warwick Vesztrocy
- Department of Genetics, Evolution and Environment, University College London, Gower Street, London, WC1E 6BT, United Kingdom.,SIB Swiss Institute of Bioinformatics, Lausanne, 1015, Switzerland
| | - Jose Manuel Rodriguez
- Cardiovascular Proteomics Laboratory, Centro Nacional de Investigaciones Cardiovasculares Carlos III (CNIC), Madrid, Spain
| | - Michael L Tress
- Spanish National Cancer Research Centre (CNIO), Madrid, Spain
| | - Marco Frasca
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Marco Notaro
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Giuliano Grossi
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Alessandro Petrini
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Matteo Re
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Giorgio Valentini
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy
| | - Marco Mesiti
- Università degli Studi di Milano - Computer Science Department - AnacletoLab, Milan, Milan, Italy.,Institut de Biologie Computationnelle, LIRMM, CNRS-UMR 5506, Universite de Montpellier, Montpellier, France
| | - Daniel B Roche
- Department of Informatics, Bioinformatics and Computational Biology-i12, Technische Universitat Munchen, Munich, Germany
| | - Jonas Reeb
- Department of Informatics, Bioinformatics and Computational Biology-i12, Technische Universitat Munchen, Munich, Germany
| | - David W Ritchie
- University of Lorraine, CNRS, Inria, LORIA, Nancy, 54000, France
| | - Sabeur Aridhi
- University of Lorraine, CNRS, Inria, LORIA, Nancy, 54000, France
| | | | - Marie-Dominique Devignes
- University of Lorraine, CNRS, Inria, LORIA, Nancy, 54000, France.,University of Lorraine, Nancy, Lorraine, France.,Inria, Nancy, France
| | | | - Richard Bonneau
- NYU Center for Data Science, New York, 10010, NY, USA.,Flatiron Institute, CCB, New York, 10010, NY, USA
| | - Vladimir Gligorijević
- Center for Computational Biology (CCB), Flatiron Institute, Simons Foundation, New York, New York, USA
| | - Meet Barot
- Center for Data Science, New York University, New York, 10011, NY, USA
| | - Hai Fang
- Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Stefano Toppo
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Enrico Lavezzo
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Marco Falda
- Department of Biology, University of Padova, Padova, Italy
| | - Michele Berselli
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Silvio C E Tosatto
- CNR Institute of Neuroscience, Padova, Italy.,Department of Biomedical Sciences, University of Padua, Padova, Italy
| | - Marco Carraro
- Department of Biomedical Sciences, University of Padua, Padova, Italy
| | - Damiano Piovesan
- Department of Biomedical Sciences, University of Padua, Padova, Italy
| | - Hafeez Ur Rehman
- Department of Computer Science, National University of Computer and Emerging Sciences, Peshawar, Khyber Pakhtoonkhwa, Pakistan
| | - Qizhong Mao
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA.,University of California, Riverside, Philadelphia, PA, USA
| | - Shanshan Zhang
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Slobodan Vucetic
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA, USA
| | - Gage S Black
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Dane Jo
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Erica Suh
- Department of Biology, Brigham Young University, Provo, UT, USA
| | - Jonathan B Dayton
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Dallas J Larsen
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Ashton R Omdahl
- Department of Biology, Brigham Young University, Provo, UT, USA.,Bioinformatics Research Group, Provo, UT, USA
| | - Liam J McGuffin
- School of Biological Sciences, University of Reading, Reading, England, United Kingdom
| | | | - Patricia C Babbitt
- Department of Pharmaceutical Chemistry, San Francisco, CA, USA.,Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, 94158, CA, USA
| | - Jeffrey M Yunes
- UC Berkeley - UCSF Graduate Program in Bioengineering, University of California, San Francisco, 94158, CA, USA.,Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, 94158, CA, USA
| | - Paolo Fontana
- Research and Innovation Center, Edmund Mach Foundation, San Michele all'Adige, Italy
| | - Feng Zhang
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, Shanghai, China.,Department of Biostatistics and Computational Biology, School of Life Sciences, Fudan University, Shanghai, Shanghai, China
| | - Shanfeng Zhu
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Ronghui You
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Zihan Zhang
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Suyang Dai
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Shuwei Yao
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China.,Institute of Science and Technology for Brain-Inspired Intelligence and Shanghai Institute of Artificial Intelligence Algorithms, Fudan University, Shanghai, China
| | - Weidong Tian
- State Key Laboratory of Genetic Engineering and Collaborative Innovation Center for Genetics and Development, Department of Biostatistics and Computational Biology, School of Life Sciences, Fudan University, Shanghai, Shanghai, China.,Department of Pediatrics, Brain Tumor Center, Division of Experimental Hematology and Cancer Biology, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
| | - Caleb Chandler
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
| | - Miguel Amezola
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
| | - Devon Johnson
- Department of Computer Science, Pacific Lutheran University, Tacoma, WA, USA
| | - Jia-Ming Chang
- Department of Computer Science, National Chengchi University, Taipei, Taiwan
| | - Wen-Hung Liao
- Department of Computer Science, National Chengchi University, Taipei, Taiwan
| | - Yi-Wei Liu
- Department of Computer Science, National Chengchi University, Taipei, Taiwan
| | | | | | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia
| | - Maxat Kulmanov
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Jeddah, Saudi Arabia
| | - Imane Boudellioua
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.,Computer, Electrical and Mathematical Sciences Engineering Division (CEMSE), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Gianfranco Politano
- Control and Computer Engineering Department, Politecnico di Torino, Torino, TO, Italy
| | - Stefano Di Carlo
- Control and Computer Engineering Department, Politecnico di Torino, Torino, TO, Italy
| | - Alfredo Benso
- Control and Computer Engineering Department, Politecnico di Torino, Torino, TO, Italy
| | - Kai Hakala
- Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku Graduate School (UTUGS), Turku, Finland
| | - Filip Ginter
- Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku, Turku, Finland
| | - Farrokh Mehryary
- Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku Graduate School (UTUGS), Turku, Finland
| | - Suwisa Kaewphan
- Department of Future Technologies, Turku NLP Group, University of Turku, Turku, Finland.,University of Turku Graduate School (UTUGS), Turku, Finland.,Turku Centre for Computer Science (TUCS), Turku, Finland
| | - Jari Björne
- Department of Future Technologies, Faculty of Science and Engineering, University of Turku, Turku, FI-20014, Finland.,Turku Centre for Computer Science (TUCS), Agora, Vesilinnantie 3, Turku, FI-20500, Finland
| | | | | | - Tapio Salakoski
- Department of Future Technologies, Faculty of Science and Engineering, University of Turku, Turku, FI-20014, Finland.,Turku Centre for Computer Science (TUCS), Agora, Vesilinnantie 3, Turku, FI-20500, Finland
| | - Daisuke Kihara
- Department of Biological Sciences, Department of Computer Science, Purdue University, 47907, IN, USA.,Department of Pediatrics, University of Cincinnati, Cincinnati, 45229, OH, USA
| | - Aashish Jain
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Tomislav Šmuc
- Division of Electronics, Rudjer Boskovic Institute, Zagreb, Croatia
| | - Adrian Altenhoff
- Department of Computer Science, ETH Zurich, Zurich, Switzerland.,SIB Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA
| | - Burkhard Rost
- Department of Informatics, Bioinformatics & Computational Biology-i12, Technische Universitat Munchen, Munich, Germany.,Institute for Food and Plant Sciences WZW, Technische Universität München, Freising, Germany
| | | | - Christine A Orengo
- Research Department of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Constance J Jeffery
- Biological Sciences, University of Illinois at Chicago, Chicago, Illinois, USA
| | - Giovanni Bosco
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Deborah A Hogan
- Geisel School of Medicine at Dartmouth, Hanover, NH, USA.,Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Maria J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom
| | - Claire O'Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom
| | - Sean D Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.,Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, Pennsylvania, USA
| | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA.
| | - Iddo Friedberg
- Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA, USA.
| |
Collapse
|
45
|
Arisdakessian C, Poirion O, Yunits B, Zhu X, Garmire LX. DeepImpute: an accurate, fast, and scalable deep neural network method to impute single-cell RNA-seq data. Genome Biol 2019; 20:211. [PMID: 31627739 PMCID: PMC6798445 DOI: 10.1186/s13059-019-1837-6] [Citation(s) in RCA: 155] [Impact Index Per Article: 25.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2019] [Accepted: 09/26/2019] [Indexed: 12/12/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) offers new opportunities to study gene expression of tens of thousands of single cells simultaneously. We present DeepImpute, a deep neural network-based imputation algorithm that uses dropout layers and loss functions to learn patterns in the data, allowing for accurate imputation. Overall, DeepImpute yields better accuracy than other six publicly available scRNA-seq imputation methods on experimental data, as measured by the mean squared error or Pearson's correlation coefficient. DeepImpute is an accurate, fast, and scalable imputation tool that is suited to handle the ever-increasing volume of scRNA-seq data, and is freely available at https://github.com/lanagarmire/DeepImpute .
Collapse
Affiliation(s)
- Cédric Arisdakessian
- Department of Information and Computer Science, University of Hawaii at Manoa, Honolulu, HI, 96816, USA
| | - Olivier Poirion
- Department of Epidemiology, University of Hawaii Cancer Center, 701 Ilalo Street, Honolulu, HI, 96813, USA
| | - Breck Yunits
- Department of Epidemiology, University of Hawaii Cancer Center, 701 Ilalo Street, Honolulu, HI, 96813, USA
| | - Xun Zhu
- Department of Epidemiology, University of Hawaii Cancer Center, 701 Ilalo Street, Honolulu, HI, 96813, USA
- Department of Molecular Biology and Bioengineering, University of Hawaii at Manoa, Honolulu, HI, 96816, USA
| | - Lana X Garmire
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48105, USA.
| |
Collapse
|
46
|
Exploring single-cell data with deep multitasking neural networks. Nat Methods 2019; 16:1139-1145. [PMID: 31591579 PMCID: PMC10164410 DOI: 10.1038/s41592-019-0576-7] [Citation(s) in RCA: 175] [Impact Index Per Article: 29.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Accepted: 08/19/2019] [Indexed: 01/22/2023]
Abstract
It is currently challenging to analyze single-cell data consisting of many cells and samples, and to address variations arising from batch effects and different sample preparations. For this purpose, we present SAUCIE, a deep neural network that combines parallelization and scalability offered by neural networks, with the deep representation of data that can be learned by them to perform many single-cell data analysis tasks. Our regularizations (penalties) render features learned in hidden layers of the neural network interpretable. On large, multi-patient datasets, SAUCIE's various hidden layers contain denoised and batch-corrected data, a low-dimensional visualization and unsupervised clustering, as well as other information that can be used to explore the data. We analyze a 180-sample dataset consisting of 11 million T cells from dengue patients in India, measured with mass cytometry. SAUCIE can batch correct and identify cluster-based signatures of acute dengue infection and create a patient manifold, stratifying immune response to dengue.
Collapse
|
47
|
Zitnik M, Nguyen F, Wang B, Leskovec J, Goldenberg A, Hoffman MM. Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities. AN INTERNATIONAL JOURNAL ON INFORMATION FUSION 2019; 50:71-91. [PMID: 30467459 PMCID: PMC6242341 DOI: 10.1016/j.inffus.2018.09.012] [Citation(s) in RCA: 262] [Impact Index Per Article: 43.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include myriad properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field.
Collapse
Affiliation(s)
- Marinka Zitnik
- Department of Computer Science, Stanford University,
Stanford, CA, USA
| | - Francis Nguyen
- Department of Medical Biophysics, University of Toronto,
Toronto, ON, Canada
- Princess Margaret Cancer Centre, Toronto, ON, Canada
| | - Bo Wang
- Hikvision Research Institute, Santa Clara, CA, USA
| | - Jure Leskovec
- Department of Computer Science, Stanford University,
Stanford, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Anna Goldenberg
- Genetics & Genome Biology, SickKids Research Institute,
Toronto, ON, Canada
- Department of Computer Science, University of Toronto,
Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
| | - Michael M. Hoffman
- Department of Medical Biophysics, University of Toronto,
Toronto, ON, Canada
- Princess Margaret Cancer Centre, Toronto, ON, Canada
- Department of Computer Science, University of Toronto,
Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
| |
Collapse
|
48
|
Eraslan G, Avsec Ž, Gagneur J, Theis FJ. Deep learning: new computational modelling techniques for genomics. Nat Rev Genet 2019; 20:389-403. [PMID: 30971806 DOI: 10.1038/s41576-019-0122-6] [Citation(s) in RCA: 580] [Impact Index Per Article: 96.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
As a data-driven science, genomics largely utilizes machine learning to capture dependencies in data and derive novel biological hypotheses. However, the ability to extract new insights from the exponentially increasing volume of genomics data requires more expressive machine learning models. By effectively leveraging large data sets, deep learning has transformed fields such as computer vision and natural language processing. Now, it is becoming the method of choice for many genomics modelling tasks, including predicting the impact of genetic variation on gene regulatory mechanisms such as DNA accessibility and splicing.
Collapse
Affiliation(s)
- Gökcen Eraslan
- Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany.,School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
| | - Žiga Avsec
- Department of Informatics, Technical University of Munich, Garching, Germany
| | - Julien Gagneur
- Department of Informatics, Technical University of Munich, Garching, Germany.
| | - Fabian J Theis
- Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany. .,School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany. .,Department of Mathematics, Technical University of Munich, Garching, Germany.
| |
Collapse
|
49
|
Way GP, Greene CS. Discovering Pathway and Cell Type Signatures in Transcriptomic Compendia with Machine Learning. Annu Rev Biomed Data Sci 2019. [DOI: 10.1146/annurev-biodatasci-072018-021348] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Pathway and cell type signatures are patterns present in transcriptome data that are associated with biological processes or phenotypic consequences. These signatures result from specific cell type and pathway expression but can require large transcriptomic compendia to detect. Machine learning techniques can be powerful tools for signature discovery through their ability to provide accurate and interpretable results. In this review, we discuss various machine learning applications to extract pathway and cell type signatures from transcriptomic compendia. We focus on the biological motivations and interpretation for both supervised and unsupervised learning approaches in this setting. We consider recent advances, including deep learning, and their applications to expanding bulk and single-cell RNA data. As data and computational resources increase, there will be more opportunities for machine learning to aid in revealing biological signatures.
Collapse
Affiliation(s)
- Gregory P. Way
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| |
Collapse
|
50
|
Taroni JN, Grayson PC, Hu Q, Eddy S, Kretzler M, Merkel PA, Greene CS. MultiPLIER: A Transfer Learning Framework for Transcriptomics Reveals Systemic Features of Rare Disease. Cell Syst 2019; 8:380-394.e4. [PMID: 31121115 PMCID: PMC6538307 DOI: 10.1016/j.cels.2019.04.003] [Citation(s) in RCA: 65] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2018] [Revised: 01/15/2019] [Accepted: 04/12/2019] [Indexed: 12/22/2022]
Abstract
Most gene expression datasets generated by individual researchers are too small to fully benefit from unsupervised machine-learning methods. In the case of rare diseases, there may be too few cases available, even when multiple studies are combined. To address this challenge, we utilize transfer learning to extract coordinated expression patterns and use learned patterns to analyze small rare disease datasets. We trained a pathway-level information extractor (PLIER) model on a large public data compendium comprising multiple experiments, tissues, and biological conditions and then transferred the model to small datasets in an approach we call MultiPLIER. Models constructed from the public data compendium included features that aligned well to known biological factors and were more comprehensive than those constructed from individual datasets or conditions. When transferred to rare disease datasets, the models describe biological processes related to disease severity more effectively than models trained only on a given dataset.
Collapse
Affiliation(s)
- Jaclyn N Taroni
- Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA; Childhood Cancer Data Laboratory, Alex's Lemonade Stand Foundation, Philadelphia, PA, USA
| | - Peter C Grayson
- National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, Bethesda, MD, USA
| | - Qiwen Hu
- Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Sean Eddy
- Division of Nephrology, Department of Internal Medicine, Michigan Medicine, Ann Arbor, MI, USA
| | - Matthias Kretzler
- Division of Nephrology, Department of Internal Medicine, Michigan Medicine, Ann Arbor, MI, USA; Department of Computational Medicine and Bioinformatics, Michigan Medicine, Ann Arbor, MI, USA
| | - Peter A Merkel
- Division of Rheumatology and the Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Casey S Greene
- Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA; Childhood Cancer Data Laboratory, Alex's Lemonade Stand Foundation, Philadelphia, PA, USA; Institute of Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA; Institute of Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|