1
|
Schäffer DE, Li W, Elbasir A, Altieri DC, Long Q, Auslander N. Microbial gene expression analysis of healthy and cancerous esophagus uncovers bacterial biomarkers of clinical outcomes. ISME Commun 2023; 3:128. [PMID: 38049632 PMCID: PMC10696091 DOI: 10.1038/s43705-023-00338-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 11/16/2023] [Accepted: 11/21/2023] [Indexed: 12/06/2023]
Abstract
Local microbiome shifts are implicated in the development and progression of gastrointestinal cancers, and in particular, esophageal carcinoma (ESCA), which is among the most aggressive malignancies. Short-read RNA sequencing (RNAseq) is currently the leading technology to study gene expression changes in cancer. However, using RNAseq to study microbial gene expression is challenging. Here, we establish a new tool to efficiently detect viral and bacterial expression in human tissues through RNAseq. This approach employs a neural network to predict reads of likely microbial origin, which are targeted for assembly into longer contigs, improving identification of microbial species and genes. This approach is applied to perform a systematic comparison of bacterial expression in ESCA and healthy esophagi. We uncover bacterial genera that are over or underabundant in ESCA vs healthy esophagi both before and after correction for possible covariates, including patient metadata. However, we find that bacterial taxonomies are not significantly associated with clinical outcomes. Strikingly, in contrast, dozens of microbial proteins were significantly associated with poor patient outcomes and in particular, proteins that perform mitochondrial functions and iron-sulfur coordination. We further demonstrate associations between these microbial proteins and dysregulated host pathways in ESCA patients. Overall, these results suggest possible influences of bacteria on the development of ESCA and uncover new prognostic biomarkers based on microbial genes. In addition, this study provides a framework for the analysis of other human malignancies whose development may be driven by pathogens.
Collapse
Affiliation(s)
- Daniel E Schäffer
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
- The Wistar Institute, Philadelphia, PA, 19104, USA
- Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Wenrui Li
- University of Pennsylvania, Philadelphia, PA, USA
| | | | | | - Qi Long
- University of Pennsylvania, Philadelphia, PA, USA
| | - Noam Auslander
- The Wistar Institute, Philadelphia, PA, 19104, USA.
- Department of Cancer Biology, University of Pennsylvania, Philadelphia, PA, 19104, USA.
| |
Collapse
|
2
|
Patterson A, Elbasir A, Tian B, Auslander N. Computational Methods Summarizing Mutational Patterns in Cancer: Promise and Limitations for Clinical Applications. Cancers (Basel) 2023; 15:cancers15071958. [PMID: 37046619 PMCID: PMC10093138 DOI: 10.3390/cancers15071958] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 02/24/2023] [Accepted: 03/09/2023] [Indexed: 03/29/2023] Open
Abstract
Since the rise of next-generation sequencing technologies, the catalogue of mutations in cancer has been continuously expanding. To address the complexity of the cancer-genomic landscape and extract meaningful insights, numerous computational approaches have been developed over the last two decades. In this review, we survey the current leading computational methods to derive intricate mutational patterns in the context of clinical relevance. We begin with mutation signatures, explaining first how mutation signatures were developed and then examining the utility of studies using mutation signatures to correlate environmental effects on the cancer genome. Next, we examine current clinical research that employs mutation signatures and discuss the potential use cases and challenges of mutation signatures in clinical decision-making. We then examine computational studies developing tools to investigate complex patterns of mutations beyond the context of mutational signatures. We survey methods to identify cancer-driver genes, from single-driver studies to pathway and network analyses. In addition, we review methods inferring complex combinations of mutations for clinical tasks and using mutations integrated with multi-omics data to better predict cancer phenotypes. We examine the use of these tools for either discovery or prediction, including prediction of tumor origin, treatment outcomes, prognosis, and cancer typing. We further discuss the main limitations preventing widespread clinical integration of computational tools for the diagnosis and treatment of cancer. We end by proposing solutions to address these challenges using recent advances in machine learning.
Collapse
Affiliation(s)
- Andrew Patterson
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- The Wistar Institute, Philadelphia, PA 19104, USA
| | | | - Bin Tian
- The Wistar Institute, Philadelphia, PA 19104, USA
| | - Noam Auslander
- The Wistar Institute, Philadelphia, PA 19104, USA
- Department of Cancer Biology, University of Pennsylvania, Philadelphia, PA 19104, USA
- Correspondence:
| |
Collapse
|
3
|
Elbasir A, Ye Y, Schäffer DE, Hao X, Wickramasinghe J, Tsingas K, Lieberman PM, Long Q, Morris Q, Zhang R, Schäffer AA, Auslander N. A deep learning approach reveals unexplored landscape of viral expression in cancer. Nat Commun 2023; 14:785. [PMID: 36774364 PMCID: PMC9922274 DOI: 10.1038/s41467-023-36336-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2022] [Accepted: 01/25/2023] [Indexed: 02/13/2023] Open
Abstract
About 15% of human cancer cases are attributed to viral infections. To date, virus expression in tumor tissues has been mostly studied by aligning tumor RNA sequencing reads to databases of known viruses. To allow identification of divergent viruses and rapid characterization of the tumor virome, we develop viRNAtrap, an alignment-free pipeline to identify viral reads and assemble viral contigs. We utilize viRNAtrap, which is based on a deep learning model trained to discriminate viral RNAseq reads, to explore viral expression in cancers and apply it to 14 cancer types from The Cancer Genome Atlas (TCGA). Using viRNAtrap, we uncover expression of unexpected and divergent viruses that have not previously been implicated in cancer and disclose human endogenous viruses whose expression is associated with poor overall survival. The viRNAtrap pipeline provides a way forward to study viral infections associated with different clinical conditions.
Collapse
Affiliation(s)
| | - Ying Ye
- The Wistar Institute, Philadelphia, PA, 19104, USA
| | - Daniel E Schäffer
- The Wistar Institute, Philadelphia, PA, 19104, USA.,Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
| | - Xue Hao
- The Wistar Institute, Philadelphia, PA, 19104, USA
| | | | - Konstantinos Tsingas
- The Wistar Institute, Philadelphia, PA, 19104, USA.,University of Pennsylvania, Philadelphia, PA, USA
| | | | - Qi Long
- University of Pennsylvania, Philadelphia, PA, USA
| | - Quaid Morris
- Computational and Systems Biology, Sloan Kettering Institute, New York City, NY, 10065, USA
| | - Rugang Zhang
- The Wistar Institute, Philadelphia, PA, 19104, USA
| | - Alejandro A Schäffer
- Cancer Data Science Laboratory (CDSL), National Cancer Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | | |
Collapse
|
4
|
Mall R, Elbasir A, Almeer H, Islam Z, Kolatkar PR, Chawla S, Ullah E. A Modelling Framework for Embedding-based Predictions for Compound-Viral Protein Activity. Bioinformatics 2021; 37:2544-2555. [PMID: 33638345 PMCID: PMC8163000 DOI: 10.1093/bioinformatics/btab130] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2020] [Revised: 02/16/2021] [Accepted: 02/24/2021] [Indexed: 11/14/2022] Open
Abstract
Motivation A global effort is underway to identify compounds for the treatment of COVID-19. Since de novo compound design is an extremely long, time-consuming, and expensive process, efforts are underway to discover existing compounds that can be repurposed for COVID-19 and new viral diseases. Model We propose a machine learning representation framework that uses deep learning induced vector embeddings of compounds and viral proteins as features to predict compound-viral protein activity. The prediction model in-turn uses a consensus framework to rank approved compounds against viral proteins of interest. Results Our consensus framework achieves a highmean Pearson correlation of 0.916, mean R2 of 0.840 and a low mean squared error of 0.313 for the task of compound-viral protein activity prediction on an independent test set. As a use case, we identify a ranked list of 47 compounds common to three main proteins of SARS-COV-2 virus (PL-PRO, 3CL-PRO and Spike protein) as potential targets including 21 antivirals, 15 anticancer, 5 antibiotics and 6 other investigationalhuman compounds.We performadditional molecular docking simulations to demonstrate thatmajority of these compounds have low binding energies and thus high binding affinity with the potential to be effective against the SARS-COV-2 virus. Availability All the source code and data is available at: https://github.com/raghvendra5688/Drug-Repurposing and https://dx.doi.org/10.17632/8rrwnbcgmx.3. We also implemented a web-server at: https://machinelearning-protein.qcri.org/index.html. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Raghvendra Mall
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, 34110, Qatar
| | - Abdurrahman Elbasir
- ICT Division, College of Science and Engineering, Hamad Bin Khalifa University, Doha, 34110, Qatar
| | - Hossam Almeer
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, 34110, Qatar
| | - Zeyaul Islam
- Qatar Biomedical Research Institute, Hamad Bin Khalifa Univeristy, Doha, 34110, Qatar
| | - Prasanna R Kolatkar
- Qatar Biomedical Research Institute, Hamad Bin Khalifa Univeristy, Doha, 34110, Qatar
| | - Sanjay Chawla
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, 34110, Qatar
| | - Ehsan Ullah
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, 34110, Qatar
| |
Collapse
|
5
|
Liu A, Walter M, Wright P, Bartosik A, Dolciami D, Elbasir A, Yang H, Bender A. Prediction and mechanistic analysis of drug-induced liver injury (DILI) based on chemical structure. Biol Direct 2021; 16:6. [PMID: 33461600 PMCID: PMC7814730 DOI: 10.1186/s13062-020-00285-0] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2020] [Accepted: 12/01/2020] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Drug-induced liver injury (DILI) is a major safety concern characterized by a complex and diverse pathogenesis. In order to identify DILI early in drug development, a better understanding of the injury and models with better predictivity are urgently needed. One approach in this regard are in silico models which aim at predicting the risk of DILI based on the compound structure. However, these models do not yet show sufficient predictive performance or interpretability to be useful for decision making by themselves, the former partially stemming from the underlying problem of labeling the in vivo DILI risk of compounds in a meaningful way for generating machine learning models. RESULTS As part of the Critical Assessment of Massive Data Analysis (CAMDA) "CMap Drug Safety Challenge" 2019 ( http://camda2019.bioinf.jku.at ), chemical structure-based models were generated using the binarized DILIrank annotations. Support Vector Machine (SVM) and Random Forest (RF) classifiers showed comparable performance to previously published models with a mean balanced accuracy over models generated using 5-fold LOCO-CV inside a 10-fold training scheme of 0.759 ± 0.027 when predicting an external test set. In the models which used predicted protein targets as compound descriptors, we identified the most information-rich proteins which agreed with the mechanisms of action and toxicity of nonsteroidal anti-inflammatory drugs (NSAIDs), one of the most important drug classes causing DILI, stress response via TP53 and biotransformation. In addition, we identified multiple proteins involved in xenobiotic metabolism which could be novel DILI-related off-targets, such as CLK1 and DYRK2. Moreover, we derived potential structural alerts for DILI with high precision, including furan and hydrazine derivatives; however, all derived alerts were present in approved drugs and were over specific indicating the need to consider quantitative variables such as dose. CONCLUSION Using chemical structure-based descriptors such as structural fingerprints and predicted protein targets, DILI prediction models were built with a predictive performance comparable to previous literature. In addition, we derived insights on proteins and pathways statistically (and potentially causally) linked to DILI from these models and inferred new structural alerts related to this adverse endpoint.
Collapse
Affiliation(s)
- Anika Liu
- Department of Chemistry, Centre for Molecular Informatics, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK.
| | - Moritz Walter
- Department of Chemistry, Centre for Molecular Informatics, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - Peter Wright
- Department of Chemistry, Centre for Molecular Informatics, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - Aleksandra Bartosik
- Department of Chemistry, Centre for Molecular Informatics, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - Daniela Dolciami
- Department of Chemistry, Centre for Molecular Informatics, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
- Department of Pharmaceutical Sciences, University of Perugia, Via del Liceo 1, 06123, Perugia, Italy
| | - Abdurrahman Elbasir
- Department of Chemistry, Centre for Molecular Informatics, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
- ICT Department, College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Hongbin Yang
- Department of Chemistry, Centre for Molecular Informatics, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - Andreas Bender
- Department of Chemistry, Centre for Molecular Informatics, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK.
| |
Collapse
|
6
|
Elbasir A, Mall R, Kunji K, Rawi R, Islam Z, Chuang GY, Kolatkar PR, Bensmail H. BCrystal: an interpretable sequence-based protein crystallization predictor. Bioinformatics 2020; 36:1429-1438. [PMID: 31603511 DOI: 10.1093/bioinformatics/btz762] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Revised: 09/19/2019] [Accepted: 10/08/2019] [Indexed: 02/01/2023] Open
Abstract
MOTIVATION X-ray crystallography has facilitated the majority of protein structures determined to date. Sequence-based predictors that can accurately estimate protein crystallization propensities would be highly beneficial to overcome the high expenditure, large attrition rate, and to reduce the trial-and-error settings required for crystallization. RESULTS In this study, we present a novel model, BCrystal, which uses an optimized gradient boosting machine (XGBoost) on sequence, structural and physio-chemical features extracted from the proteins of interest. BCrystal also provides explanations, highlighting the most important features for the predicted crystallization propensity of an individual protein using the SHAP algorithm. On three independent test sets, BCrystal outperforms state-of-the-art sequence-based methods by more than 12.5% in accuracy, 18% in recall and 0.253 in Matthew's correlation coefficient, with an average accuracy of 93.7%, recall of 96.63% and Matthew's correlation coefficient of 0.868. For relative solvent accessibility of exposed residues, we observed higher values to associate positively with protein crystallizability and the number of disordered regions, fraction of coils and tripeptide stretches that contain multiple histidines associate negatively with crystallizability. The higher accuracy of BCrystal enables it to accurately screen for sequence variants with enhanced crystallizability. AVAILABILITY AND IMPLEMENTATION Our BCrystal webserver is at https://machinelearning-protein.qcri.org/ and source code is available at https://github.com/raghvendra5688/BCrystal. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Abdurrahman Elbasir
- ICT Division, College of Science and Engineering, Hamad Bin Khalifa University
| | - Raghvendra Mall
- Data Analytics, Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha 34110, Qatar
| | - Khalid Kunji
- Data Analytics, Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha 34110, Qatar
| | - Reda Rawi
- Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Zeyaul Islam
- Diabetes Research Center, Qatar Biomedical Research Institute, Hamad Bin Khalifa University, Doha 34100, Qatar
| | - Gwo-Yu Chuang
- Vaccine Research Center, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Prasanna R Kolatkar
- Diabetes Research Center, Qatar Biomedical Research Institute, Hamad Bin Khalifa University, Doha 34100, Qatar
| | - Halima Bensmail
- Data Analytics, Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha 34110, Qatar
| |
Collapse
|
7
|
Elbasir A, Moovarkumudalvan B, Kunji K, Kolatkar PR, Mall R, Bensmail H. DeepCrystal: a deep learning framework for sequence-based protein crystallization prediction. Bioinformatics 2018; 35:2216-2225. [DOI: 10.1093/bioinformatics/bty953] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2018] [Revised: 10/31/2018] [Accepted: 11/17/2018] [Indexed: 12/11/2022] Open
Abstract
Abstract
Motivation
Protein structure determination has primarily been performed using X-ray crystallography. To overcome the expensive cost, high attrition rate and series of trial-and-error settings, many in-silico methods have been developed to predict crystallization propensities of proteins based on their sequences. However, the majority of these methods build their predictors by extracting features from protein sequences, which is computationally expensive and can explode the feature space. We propose DeepCrystal, a deep learning framework for sequence-based protein crystallization prediction. It uses deep learning to identify proteins which can produce diffraction-quality crystals without the need to manually engineer additional biochemical and structural features from sequence. Our model is based on convolutional neural networks, which can exploit frequently occurring k-mers and sets of k-mers from the protein sequences to distinguish proteins that will result in diffraction-quality crystals from those that will not.
Results
Our model surpasses previous sequence-based protein crystallization predictors in terms of recall, F-score, accuracy and Matthew’s correlation coefficient (MCC) on three independent test sets. DeepCrystal achieves an average improvement of 1.4, 12.1% in recall, when compared to its closest competitors, Crysalis II and Crysf, respectively. In addition, DeepCrystal attains an average improvement of 2.1, 6.0% for F-score, 1.9, 3.9% for accuracy and 3.8, 7.0% for MCC w.r.t. Crysalis II and Crysf on independent test sets.
Availability and implementation
The standalone source code and models are available at https://github.com/elbasir/DeepCrystal and a web-server is also available at https://deeplearning-protein.qcri.org.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Abdurrahman Elbasir
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | | | - Khalid Kunji
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Prasanna R Kolatkar
- Qatar Biomedical Research Institute and Hamad Bin Khalifa University, Doha, Qatar
| | - Raghvendra Mall
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Halima Bensmail
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| |
Collapse
|