1
|
Cheng Y, Xu SM, Santucci K, Lindner G, Janitz M. Machine learning and related approaches in transcriptomics. Biochem Biophys Res Commun 2024; 724:150225. [PMID: 38852503 DOI: 10.1016/j.bbrc.2024.150225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2024] [Revised: 05/18/2024] [Accepted: 06/03/2024] [Indexed: 06/11/2024]
Abstract
Data acquisition for transcriptomic studies used to be the bottleneck in the transcriptomic analytical pipeline. However, recent developments in transcriptome profiling technologies have increased researchers' ability to obtain data, resulting in a shift in focus to data analysis. Incorporating machine learning to traditional analytical methods allows the possibility of handling larger volumes of complex data more efficiently. Many bioinformaticians, especially those unfamiliar with ML in the study of human transcriptomics and complex biological systems, face a significant barrier stemming from their limited awareness of the current landscape of ML utilisation in this field. To address this gap, this review endeavours to introduce those individuals to the general types of ML, followed by a comprehensive range of more specific techniques, demonstrated through examples of their incorporation into analytical pipelines for human transcriptome investigations. Important computational aspects such as data pre-processing, task formulation, results (performance of ML models), and validation methods are encompassed. In hope of better practical relevance, there is a strong focus on studies published within the last five years, almost exclusively examining human transcriptomes, with outcomes compared with standard non-ML tools.
Collapse
Affiliation(s)
- Yuning Cheng
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, 2052, Australia
| | - Si-Mei Xu
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, 2052, Australia
| | - Kristina Santucci
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, 2052, Australia
| | - Grace Lindner
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, 2052, Australia
| | - Michael Janitz
- School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, NSW, 2052, Australia.
| |
Collapse
|
2
|
Wang B, Luan Y. Evaluation of normalization methods for predicting quantitative phenotypes in metagenomic data analysis. Front Genet 2024; 15:1369628. [PMID: 38903761 PMCID: PMC11188486 DOI: 10.3389/fgene.2024.1369628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Accepted: 05/13/2024] [Indexed: 06/22/2024] Open
Abstract
Genotype-to-phenotype mapping is an essential problem in the current genomic era. While qualitative case-control predictions have received significant attention, less emphasis has been placed on predicting quantitative phenotypes. This emerging field holds great promise in revealing intricate connections between microbial communities and host health. However, the presence of heterogeneity in microbiome datasets poses a substantial challenge to the accuracy of predictions and undermines the reproducibility of models. To tackle this challenge, we investigated 22 normalization methods that aimed at removing heterogeneity across multiple datasets, conducted a comprehensive review of them, and evaluated their effectiveness in predicting quantitative phenotypes in three simulation scenarios and 31 real datasets. The results indicate that none of these methods demonstrate significant superiority in predicting quantitative phenotypes or attain a noteworthy reduction in Root Mean Squared Error (RMSE) of the predictions. Given the frequent occurrence of batch effects and the satisfactory performance of batch correction methods in predicting datasets affected by these effects, we strongly recommend utilizing batch correction methods as the initial step in predicting quantitative phenotypes. In summary, the performance of normalization methods in predicting metagenomic data remains a dynamic and ongoing research area. Our study contributes to this field by undertaking a comprehensive evaluation of diverse methods and offering valuable insights into their effectiveness in predicting quantitative phenotypes.
Collapse
Affiliation(s)
- Beibei Wang
- Frontier Science Center for Nonlinear Expectations, Ministry of Education, Qingdao, China
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
- School of Mathematics, Shandong University, Jinan, China
| | - Yihui Luan
- Frontier Science Center for Nonlinear Expectations, Ministry of Education, Qingdao, China
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
- School of Mathematics, Shandong University, Jinan, China
| |
Collapse
|
3
|
Skubleny D, Ghosh S, Spratlin J, Schiller DE, Rayat GR. Feature-specific quantile normalization and feature-specific mean-variance normalization deliver robust bi-directional classification and feature selection performance between microarray and RNAseq data. BMC Bioinformatics 2024; 25:136. [PMID: 38549046 PMCID: PMC11265146 DOI: 10.1186/s12859-024-05759-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 03/20/2024] [Indexed: 04/02/2024] Open
Abstract
BACKGROUND Cross-platform normalization seeks to minimize technological bias between microarray and RNAseq whole-transcriptome data. Incorporating multiple gene expression platforms permits external validation of experimental findings, and augments training sets for machine learning models. Here, we compare the performance of Feature Specific Quantile Normalization (FSQN) to a previously used but unvalidated and uncharacterized method we label as Feature Specific Mean Variance Normalization (FSMVN). We evaluate the performance of these methods for bidirectional normalization in the context of nested feature selection. RESULTS FSQN and FSMVN provided clinically equivalent bidirectional model performance with and without feature selection for colon CMS and breast PAM50 classification. Using principal component analysis, we determine that these methods eliminate batch effects related to technological platforms. Without feature selection, no statistical difference was identified between the performance of FSQN and FSMVN of cross-platform data compared to within-platform distributions. Under optimal feature selection conditions, balanced accuracy was FSQN and FSMVN were statistically equivalent to the within-platform distribution performance in multivariable linear regression analysis. FSQN and FSMVN also provided similar performance to within-platform distributions as the number of selected genes used to create models decreases. CONCLUSIONS In the context of generating supervised machine learning classifiers for molecular subtypes, FSQN and FSMVN are equally effective. Under optimal modeling conditions, FSQN and FSMVN provide equivalent model accuracy performance on cross-platform normalization data compared to within-platform data. Using cross-platform data should still be approached with caution as subtle performance differences may exist depending on the classification problem, training, and testing distributions.
Collapse
Affiliation(s)
- Daniel Skubleny
- Department of Surgery, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada.
| | - Sunita Ghosh
- Department of Oncology, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada
- Department of Mathematical and Statistical Sciences, Faculty of Science, University of Alberta, Edmonton, AB, T6G 2R3, Canada
| | - Jennifer Spratlin
- Department of Oncology, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada
| | - Daniel E Schiller
- Department of Surgery, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada
| | - Gina R Rayat
- Department of Surgery, Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, T6G 2R3, Canada
| |
Collapse
|
4
|
Wang B, Sun F, Luan Y. Comparison of the effectiveness of different normalization methods for metagenomic cross-study phenotype prediction under heterogeneity. Sci Rep 2024; 14:7024. [PMID: 38528097 DOI: 10.1038/s41598-024-57670-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Accepted: 03/20/2024] [Indexed: 03/27/2024] Open
Abstract
The human microbiome, comprising microorganisms residing within and on the human body, plays a crucial role in various physiological processes and has been linked to numerous diseases. To analyze microbiome data, it is essential to account for inherent heterogeneity and variability across samples. Normalization methods have been proposed to mitigate these variations and enhance comparability. However, the performance of these methods in predicting binary phenotypes remains understudied. This study systematically evaluates different normalization methods in microbiome data analysis and their impact on disease prediction. Our findings highlight the strengths and limitations of scaling, compositional data analysis, transformation, and batch correction methods. Scaling methods like TMM show consistent performance, while compositional data analysis methods exhibit mixed results. Transformation methods, such as Blom and NPN, demonstrate promise in capturing complex associations. Batch correction methods, including BMC and Limma, consistently outperform other approaches. However, the influence of normalization methods is constrained by population effects, disease effects, and batch effects. These results provide insights for selecting appropriate normalization approaches in microbiome research, improving predictive models, and advancing personalized medicine. Future research should explore larger and more diverse datasets and develop tailored normalization strategies for microbiome data analysis.
Collapse
Affiliation(s)
- Beibei Wang
- Frontier Science Center for Nonlinear Expectations, Ministry of Education, Qingdao, 266237, China
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China
- School of Mathematics, Shandong University, Jinan, 250100, China
| | - Fengzhu Sun
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, 90089, USA
| | - Yihui Luan
- Frontier Science Center for Nonlinear Expectations, Ministry of Education, Qingdao, 266237, China.
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, 266237, China.
- School of Mathematics, Shandong University, Jinan, 250100, China.
| |
Collapse
|
5
|
Borisov N, Tkachev V, Simonov A, Sorokin M, Kim E, Kuzmin D, Karademir-Yilmaz B, Buzdin A. Uniformly shaped harmonization combines human transcriptomic data from different platforms while retaining their biological properties and differential gene expression patterns. Front Mol Biosci 2023; 10:1237129. [PMID: 37745690 PMCID: PMC10511763 DOI: 10.3389/fmolb.2023.1237129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 08/28/2023] [Indexed: 09/26/2023] Open
Abstract
Introduction: Co-normalization of RNA profiles obtained using different experimental platforms and protocols opens avenue for comprehensive comparison of relevant features like differentially expressed genes associated with disease. Currently, most of bioinformatic tools enable normalization in a flexible format that depends on the individual datasets under analysis. Thus, the output data of such normalizations will be poorly compatible with each other. Recently we proposed a new approach to gene expression data normalization termed Shambhala which returns harmonized data in a uniform shape, where every expression profile is transformed into a pre-defined universal format. We previously showed that following shambhalization of human RNA profiles, overall tissue-specific clustering features are strongly retained while platform-specific clustering is dramatically reduced. Methods: Here, we tested Shambhala performance in retention of fold-change gene expression features and other functional characteristics of gene clusters such as pathway activation levels and predicted cancer drug activity scores. Results: Using 6,793 cancer and 11,135 normal tissue gene expression profiles from the literature and experimental datasets, we applied twelve performance criteria for different versions of Shambhala and other methods of transcriptomic harmonization with flexible output data format. Such criteria dealt with the biological type classifiers, hierarchical clustering, correlation/regression properties, stability of drug efficiency scores, and data quality for using machine learning classifiers. Discussion: Shambhala-2 harmonizer demonstrated the best results with the close to 1 correlation and linear regression coefficients for the comparison of training vs validation datasets and more than two times lesser instability for calculation of drug efficiency scores compared to other methods.
Collapse
Affiliation(s)
- Nicolas Borisov
- Omicsway Corp, Walnut, CA, United States
- Moscow Institute of Physics and Technology, Dolgoprudny, Russia
| | | | - Alexander Simonov
- Moscow Institute of Physics and Technology, Dolgoprudny, Russia
- Oncobox Ltd., Moscow, Russia
| | - Maxim Sorokin
- Moscow Institute of Physics and Technology, Dolgoprudny, Russia
- Oncobox Ltd., Moscow, Russia
- World-Class Research Center “Digital Biodesign and Personalized Healthcare”, Sechenov First Moscow State Medical University, Moscow, Russia
| | - Ella Kim
- Clinic for Neurosurgery, Laboratory of Experimental Neurooncology, Johannes Gutenberg University Medical Centre, Mainz, Germany
| | - Denis Kuzmin
- Moscow Institute of Physics and Technology, Dolgoprudny, Russia
| | - Betul Karademir-Yilmaz
- Department of Biochemistry, School of Medicine/Genetic and Metabolic Diseases Research and Investigation Center (GEMHAM) Marmara University, Istanbul, Türkiye
| | - Anton Buzdin
- Moscow Institute of Physics and Technology, Dolgoprudny, Russia
- World-Class Research Center “Digital Biodesign and Personalized Healthcare”, Sechenov First Moscow State Medical University, Moscow, Russia
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Moscow, Russia
- PathoBiology Group, European Organization for Research and Treatment of Cancer (EORTC), Brussels, Belgium
| |
Collapse
|
6
|
Zhou M, Bao S, Gong T, Wang Q, Sun J, Li J, Lu M, Sun W, Su J, Chen H, Liu Z. The transcriptional landscape and diagnostic potential of long non-coding RNAs in esophageal squamous cell carcinoma. Nat Commun 2023; 14:3799. [PMID: 37365153 DOI: 10.1038/s41467-023-39530-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Accepted: 06/14/2023] [Indexed: 06/28/2023] Open
Abstract
Esophageal squamous cell carcinoma (ESCC) is a deadly cancer with no clinically relevant biomarkers for early detection. Here, we comprehensively characterized the transcriptional landscape of long non-coding RNAs (lncRNAs) in paired tumor and normal tissue specimens from 93 ESCC patients, and identified six key malignancy-specific lncRNAs that were integrated into a Multi-LncRNA Malignancy Risk Probability model (MLMRPscore). The MLMRPscore performed robustly in distinguishing ESCC from normal controls in multiple in-house and external multicenter validation cohorts, including early-stage I/II cancer. In addition, five candidate lncRNAs were confirmed to have non-invasive diagnostic potential in our institute plasma cohort, showing superior or comparable diagnostic accuracy to current clinical serological markers. Overall, this study highlights the profound and robust dysregulation of lncRNAs in ESCC and demonstrates the potential of lncRNAs as non-invasive biomarkers for the early detection of ESCC.
Collapse
Affiliation(s)
- Meng Zhou
- School of Biomedical Engineering, Eye Hospital, Wenzhou Medical University, 325027, Wenzhou, P. R. China
| | - Siqi Bao
- School of Biomedical Engineering, Eye Hospital, Wenzhou Medical University, 325027, Wenzhou, P. R. China
| | - Tongyang Gong
- State Key Laboratory of Molecular Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, 100021, Beijing, P. R. China
| | - Qiang Wang
- Department of Anesthesiology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, 100021, Beijing, P. R. China
| | - Jie Sun
- School of Biomedical Engineering, Eye Hospital, Wenzhou Medical University, 325027, Wenzhou, P. R. China
| | - Jiaqi Li
- School of Biomedical Engineering, Eye Hospital, Wenzhou Medical University, 325027, Wenzhou, P. R. China
| | - Minyi Lu
- State Key Laboratory of Molecular Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, 100021, Beijing, P. R. China
| | - Wanyuan Sun
- State Key Laboratory of Molecular Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, 100021, Beijing, P. R. China
| | - Jianzhong Su
- School of Biomedical Engineering, Eye Hospital, Wenzhou Medical University, 325027, Wenzhou, P. R. China.
| | - Hongyan Chen
- State Key Laboratory of Molecular Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, 100021, Beijing, P. R. China.
- Key Laboratory of Cancer and Microbiome, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, 100021, Beijing, P. R. China.
| | - Zhihua Liu
- State Key Laboratory of Molecular Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, 100021, Beijing, P. R. China.
| |
Collapse
|
7
|
Qin ZX, Chen GZ, Yang QQ, Wu YJ, Sun CQ, Yang XM, Luo M, Yi CR, Zhu J, Chen WH, Liu Z. Cross-Platform Transcriptomic Data Integration, Profiling, and Mining in Vibrio cholerae. Microbiol Spectr 2023; 11:e0536922. [PMID: 37191528 PMCID: PMC10269641 DOI: 10.1128/spectrum.05369-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2022] [Accepted: 04/24/2023] [Indexed: 05/17/2023] Open
Abstract
A large number of transcriptome studies generate important data and information for the study of pathogenic mechanisms of pathogens, including Vibrio cholerae. V. cholerae transcriptome data include RNA-seq and microarray: microarray data mainly include clinical human and environmental samples, and RNA-seq data mainly focus on laboratory processing conditions, including different stresses and experimental animals in vivo. In this study, we integrated the data sets of both platforms using Rank-in and the Limma R package normalized Between Arrays function, achieving the first cross-platform transcriptome data integration of V. cholerae. By integrating the entire transcriptome data, we obtained the profiles of the most active or silent genes. By transferring the integrated expression profiles into the weighted correlation network analysis (WGCNA) pipeline, we identified the important functional modules of V. cholerae in vitro stress treatment, gene manipulation, and in vitro culture as DNA transposon, chemotaxis and signaling, signal transduction, and secondary metabolic pathways, respectively. The analysis of functional module hub genes revealed the uniqueness of clinical human samples; however, under specific expression patterning, the Δhns, ΔoxyR1 strains, and tobramycin treatment group showed high expression profile similarity with human samples. By constructing a protein-protein interaction (PPI) interaction network, we discovered several unreported novel protein interactions within transposon functional modules. IMPORTANCE We used two techniques to integrate RNA-seq data for laboratory studies with clinical microarray data for the first time. The interactions between V. cholerae genes were obtained from a global perspective, as well as comparing the similarity between clinical human samples and the current experimental conditions, and uncovering the functional modules that play a major role under different conditions. We believe that this data integration can provide us with some insight and basis for elucidating the pathogenesis and clinical control of V. cholerae.
Collapse
Affiliation(s)
- Zi-Xin Qin
- Department of Biotechnology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Guo-Zhong Chen
- Department of Biotechnology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Qian-Qian Yang
- Department of Biotechnology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Ying-Jian Wu
- Department of Bioinformatics and Systems Biology, Huazhong University of Science and Technology College of Life Sciences and Technology, Wuhan, Hubei, China
| | - Chu-Qing Sun
- Department of Bioinformatics and Systems Biology, Huazhong University of Science and Technology College of Life Sciences and Technology, Wuhan, Hubei, China
| | - Xiao-Man Yang
- Department of Biotechnology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Mei Luo
- Department of Biotechnology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Chun-Rong Yi
- Department of Biotechnology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Jun Zhu
- Department of Biotechnology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Wei-Hua Chen
- Department of Bioinformatics and Systems Biology, Huazhong University of Science and Technology College of Life Sciences and Technology, Wuhan, Hubei, China
| | - Zhi Liu
- Department of Biotechnology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei, China
| |
Collapse
|
8
|
Sun R, Zhu H, Wang Y, Wang J, Jiang C, Cao Q, Zhang Y, Zhang Y, Yuan S, Liu Q. Circular RNA expression and the competitive endogenous RNA network in pathological, age-related macular degeneration events: A cross-platform normalization study. J Biomed Res 2023; 37:367-381. [PMID: 37366063 PMCID: PMC10541779 DOI: 10.7555/jbr.37.20230010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 02/20/2023] [Accepted: 02/20/2023] [Indexed: 06/28/2023] Open
Abstract
Age-related macular degeneration (AMD) causes irreversible blindness in people aged over 50 worldwide. The dysfunction of the retinal pigment epithelium is the primary cause of atrophic AMD. In the current study, we used the ComBat and Training Distribution Matching method to integrate data obtained from the Gene Expression Omnibus database. We analyzed the integrated sequencing data by the Gene Set Enrichment Analysis. Peroxisome and tumor necrosis factor-α (TNF-α) signaling and nuclear factor kappa B (NF-κB) were among the top 10 pathways, and thus we selected them to construct AMD cell models to identify differentially expressed circular RNAs (circRNAs). We then constructed a competing endogenous RNA network, which is related to differentially expressed circRNAs. This network included seven circRNAs, 15 microRNAs, and 82 mRNAs. The Kyoto Encyclopedia of Genes and Genomes analysis of mRNAs in this network showed that the hypoxia-inducible factor-1 (HIF-1) signaling pathway was a common downstream event. The results of the current study may provide insights into the pathological processes of atrophic AMD.
Collapse
Affiliation(s)
- Ruxu Sun
- Department of Ophthalmology, the First Affiliated Hospital of Nanjing Medical University, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Hongjing Zhu
- Department of Ophthalmology, the First Affiliated Hospital of Nanjing Medical University, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Ying Wang
- Department of Ophthalmology, the First Affiliated Hospital of Nanjing Medical University, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Jianan Wang
- Department of Ophthalmology, the First Affiliated Hospital of Nanjing Medical University, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Chao Jiang
- Department of Ophthalmology, the First Affiliated Hospital of Nanjing Medical University, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Qiuchen Cao
- Department of Ophthalmology, the First Affiliated Hospital of Nanjing Medical University, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Yeran Zhang
- Department of Ophthalmology, the First Affiliated Hospital of Nanjing Medical University, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Yichen Zhang
- Department of Ophthalmology, the First Affiliated Hospital of Nanjing Medical University, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Songtao Yuan
- Department of Ophthalmology, the First Affiliated Hospital of Nanjing Medical University, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| | - Qinghuai Liu
- Department of Ophthalmology, the First Affiliated Hospital of Nanjing Medical University, Nanjing Medical University, Nanjing, Jiangsu 211166, China
| |
Collapse
|
9
|
Sadeghi M, Karimi MR, Karimi AH, Ghorbanpour Farshbaf N, Barzegar A, Schmitz U. Network-Based and Machine-Learning Approaches Identify Diagnostic and Prognostic Models for EMT-Type Gastric Tumors. Genes (Basel) 2023; 14:genes14030750. [PMID: 36981021 PMCID: PMC10048224 DOI: 10.3390/genes14030750] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Revised: 03/10/2023] [Accepted: 03/14/2023] [Indexed: 03/30/2023] Open
Abstract
The microsatellite stable/epithelial-mesenchymal transition (MSS/EMT) subtype of gastric cancer represents a highly aggressive class of tumors associated with low rates of survival and considerably high probabilities of recurrence. In the era of precision medicine, the accurate and prompt diagnosis of tumors of this subtype is of vital importance. In this study, we used Weighted Gene Co-expression Network Analysis (WGCNA) to identify a differentially expressed co-expression module of mRNAs in EMT-type gastric tumors. Using network analysis and linear discriminant analysis, we identified mRNA motifs and microRNA-based models with strong prognostic and diagnostic relevance: three models comprised of (i) the microRNAs miR-199a-5p and miR-141-3p, (ii) EVC/EVC2/GLI3, and (iii) PDE2A/GUCY1A1/GUCY1B1 gene expression profiles distinguish EMT-type tumors from other gastric tumors with high accuracy (Area Under the Receiver Operating Characteristic Curve (AUC) = 0.995, AUC = 0.9742, and AUC = 0.9717; respectively). Additionally, the DMD/ITGA1/CAV1 motif was identified as the top motif with consistent relevance to prognosis (hazard ratio > 3). Molecular functions of the members of the identified models highlight the central roles of MAPK, Hh, and cGMP/cAMP signaling in the pathology of the EMT subtype of gastric cancer and underscore their potential utility in precision therapeutic approaches.
Collapse
Affiliation(s)
- Mehdi Sadeghi
- Department of Cell & Molecular Biology, Semnan University, Semnan 3513119111, Iran
| | - Mohammad Reza Karimi
- Department of Cell & Molecular Biology, Semnan University, Semnan 3513119111, Iran
| | - Amir Hossein Karimi
- Department of Cell & Molecular Biology, Semnan University, Semnan 3513119111, Iran
| | | | - Abolfazl Barzegar
- Department of Biology, Faculty of Natural Science, University of Tabriz, Tabriz 5166616471, Iran
| | - Ulf Schmitz
- Department of Molecular & Cell Biology, James Cook University, Townsville, QLD 4811, Australia
- Centre for Tropical Bioinformatics and Molecular Biology, Australian Institute of Tropical Health and Medicine, James Cook University, Cairns, QLD 4878, Australia
| |
Collapse
|
10
|
Foltz SM, Greene CS, Taroni JN. Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously. Commun Biol 2023; 6:222. [PMID: 36841852 PMCID: PMC9968332 DOI: 10.1038/s42003-023-04588-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2017] [Accepted: 02/13/2023] [Indexed: 02/27/2023] Open
Abstract
Large compendia of gene expression data have proven valuable for the discovery of novel biological relationships. Historically, most available RNA assays were run on microarray, while RNA-seq is now the platform of choice for many new experiments. The data structure and distributions between the platforms differ, making it challenging to combine them directly. Here we perform supervised and unsupervised machine learning evaluations to assess which existing normalization methods are best suited for combining microarray and RNA-seq data. We find that quantile and Training Distribution Matching normalization allow for supervised and unsupervised model training on microarray and RNA-seq data simultaneously. Nonparanormal normalization and z-scores are also appropriate for some applications, including pathway analysis with Pathway-Level Information Extractor (PLIER). We demonstrate that it is possible to perform effective cross-platform normalization using existing methods to combine microarray and RNA-seq data for machine learning applications.
Collapse
Affiliation(s)
- Steven M Foltz
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Wynnewood, PA, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
- Center for Health AI, University of Colorado School of Medicine, Aurora, CO, USA.
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA.
| | - Jaclyn N Taroni
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
- Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Wynnewood, PA, USA.
| |
Collapse
|
11
|
Zeng W, Li W, Huang K, Lin Z, Dai H, He Z, Liu R, Zeng Z, Qin G, Chen W, Wu Y. Predicting futile recanalization, malignant cerebral edema, and cerebral herniation using intelligible ensemble machine learning following mechanical thrombectomy for acute ischemic stroke. Front Neurol 2022; 13:982783. [PMID: 36247767 PMCID: PMC9554641 DOI: 10.3389/fneur.2022.982783] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Accepted: 09/08/2022] [Indexed: 11/13/2022] Open
Abstract
PurposeTo establish an ensemble machine learning (ML) model for predicting the risk of futile recanalization, malignant cerebral edema (MCE), and cerebral herniation (CH) in patients with acute ischemic stroke (AIS) who underwent mechanical thrombectomy (MT) and recanalization.MethodsThis prospective study included 110 patients with premorbid mRS ≤ 2 who met the inclusion criteria. Futile recanalization was defined as a 90-day modified Rankin Scale score >2. Clinical and imaging data were used to construct five ML models that were fused into a logistic regression algorithm using the stacking method (LR-Stacking). We added the Shapley Additive Explanation method to display crucial factors and explain the decision process of models for each patient. Prediction performances were compared using area under the receiver operating characteristic curve (AUC), F1-score, and decision curve analysis (DCA).ResultsA total of 61 patients (55.5%) experienced futile recanalization, and 34 (30.9%) and 22 (20.0%) patients developed MCE and CH, respectively. In test set, the AUCs for the LR-Stacking model were 0.949, 0.885, and 0.904 for the three outcomes mentioned above. The F1-scores were 0.882, 0.895, and 0.909, respectively. The DCA showed that the LR-Stacking model provided more net benefits for predicting MCE and CH. The most important factors were the hypodensity volume and proportion in the corresponding vascular supply area.ConclusionUsing the ensemble ML model to analyze the clinical and imaging data of AIS patients with successful recanalization at admission and within 24 h after MT allowed for accurately predicting the risks of futile recanalization, MCE, and CH.
Collapse
Affiliation(s)
- Weixiong Zeng
- Department of Radiology, Nanfang Hospital, Southern Medical University, Guangzhou, China
| | - Wei Li
- Department of Neurology, The Second Hospital of Jilin University, Changchun, China
| | - Kaibin Huang
- Department of Neurology, Nanfang Hospital, Southern Medical University, Guangzhou, China
| | - Zhenzhou Lin
- Department of Neurology, Nanfang Hospital, Southern Medical University, Guangzhou, China
| | - Hui Dai
- Hospital Office, Ganzhou People's Hospital, Ganzhou, China
- Hospital Office, Ganzhou Hospital-Nanfang Hospital, Southern Medical University, Ganzhou, China
| | - Zilong He
- Department of Radiology, Nanfang Hospital, Southern Medical University, Guangzhou, China
| | - Renyi Liu
- Department of Radiology, Nanfang Hospital, Southern Medical University, Guangzhou, China
| | - Zhaodong Zeng
- Department of Radiology, Nanfang Hospital, Southern Medical University, Guangzhou, China
| | - Genggeng Qin
- Department of Radiology, Nanfang Hospital, Southern Medical University, Guangzhou, China
- Genggeng Qin
| | - Weiguo Chen
- Department of Radiology, Nanfang Hospital, Southern Medical University, Guangzhou, China
- Weiguo Chen
| | - Yongming Wu
- Department of Neurology, Nanfang Hospital, Southern Medical University, Guangzhou, China
- *Correspondence: Yongming Wu
| |
Collapse
|
12
|
Borisov N, Buzdin A. Transcriptomic Harmonization as the Way for Suppressing Cross-Platform Bias and Batch Effect. Biomedicines 2022; 10:2318. [PMID: 36140419 PMCID: PMC9496268 DOI: 10.3390/biomedicines10092318] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Revised: 09/14/2022] [Accepted: 09/16/2022] [Indexed: 11/16/2022] Open
Abstract
(1) Background: Emergence of methods interrogating gene expression at high throughput gave birth to quantitative transcriptomics, but also posed a question of inter-comparison of expression profiles obtained using different equipment and protocols and/or in different series of experiments. Addressing this issue is challenging, because all of the above variables can dramatically influence gene expression signals and, therefore, cause a plethora of peculiar features in the transcriptomic profiles. Millions of transcriptomic profiles were obtained and deposited in public databases of which the usefulness is however strongly limited due to the inter-comparison issues; (2) Methods: Dozens of methods and software packages that can be generally classified as either flexible or predefined format harmonizers have been proposed, but none has become to the date the gold standard for unification of this type of Big Data; (3) Results: However, recent developments evidence that platform/protocol/batch bias can be efficiently reduced not only for the comparisons of limited transcriptomic datasets. Instead, instruments were proposed for transforming gene expression profiles into the universal, uniformly shaped format that can support multiple inter-comparisons for reasonable calculation costs. This forms a basement for universal indexing of all or most of all types of RNA sequencing and microarray hybridization profiles; (4) Conclusions: In this paper, we attempted to overview the landscape of modern approaches and methods in transcriptomic harmonization and focused on the practical aspects of their application.
Collapse
Affiliation(s)
- Nicolas Borisov
- World-Class Research Center “Digital Biodesign and Personalized Healthcare”, Sechenov First Moscow State Medical University, 119435 Moscow, Russia
- Moscow Institute of Physics and Technology, 141701 Dolgoprudny, Russia
| | - Anton Buzdin
- World-Class Research Center “Digital Biodesign and Personalized Healthcare”, Sechenov First Moscow State Medical University, 119435 Moscow, Russia
- Moscow Institute of Physics and Technology, 141701 Dolgoprudny, Russia
- Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, 117997 Moscow, Russia
- PathoBiology Group, European Organization for Research and Treatment of Cancer (EORTC), 1200 Brussels, Belgium
| |
Collapse
|
13
|
Borisov N, Sorokin M, Zolotovskaya M, Borisov C, Buzdin A. Shambhala-2: A Protocol for Uniformly Shaped Harmonization of Gene Expression Profiles of Various Formats. Curr Protoc 2022; 2:e444. [PMID: 35617464 DOI: 10.1002/cpz1.444] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Uniformly shaped harmonization of gene expression profiles is central for the simultaneous comparison of multiple gene expression datasets. It is expected to operate with the gene expression data obtained using various experimental methods and equipment, and to return harmonized profiles in a uniform shape. Such uniformly shaped expression profiles from different initial datasets can be further compared directly. However, current harmonization techniques have strong limitations that prevent their broad use for bioinformatic applications. They can either operate with only up to two datasets/platforms or return data in a dynamic format that will be different for every comparison under analysis. This also does not allow for adding new data to the previously harmonized dataset(s), which complicates the analysis and increases calculation costs. We propose here a new method termed Shambhala-2 that can transform multi-platform expression data into a universal format that is identical for all harmonizations made using this technique. Shambhala-2 is based on sample-by-sample cubic conversion of the initial expression dataset into a preselected shape of the reference definitive dataset. Using 8390 samples of 12 healthy human tissue types and 4086 samples of colorectal, kidney, and lung cancer tissues, we verified Shambhala-2's capacity in restoring tissue-specific expression patterns for seven microarray and three RNA sequencing platforms. Shambhala-2 performed well for all tested combinations of RNAseq and microarray profiles, and retained gene-expression ranks, as evidenced by high correlations between different single- or aggregated gene expression metrics in pre- and post-Shambhalized samples, including preserving cancer-specific gene expression and pathway activation features. © 2022 Wiley Periodicals LLC. Basic Protocol: Shambhala-2 harmonizer Alternate Protocol 1: Linear Shambhala/Shambhala-1 Alternate Protocol 2: Alternative (flexible-format and uniformly shaped) normalization methods Support Protocol 1: Watermelon multisection (WM) Support Protocol 2: Calculation of cancer-to-normal log-fold-change (LFC) and pathway activation level (PAL).
Collapse
Affiliation(s)
- Nicolas Borisov
- Omicsway Corp., Walnut, California.,Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia
| | - Maksim Sorokin
- Omicsway Corp., Walnut, California.,Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia.,I.M. Sechenov First Moscow State Medical University, Moscow, Russia
| | - Marianna Zolotovskaya
- Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia.,Oncobox Ltd., Moscow, Russia
| | | | - Anton Buzdin
- Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia.,Shemyakin-Ovchinnikov Institute of Bioorganic Chemistry, Moscow, Russia.,World-Class Research Center "Digital biodesign and personalized healthcare", Sechenov First Moscow State Medical University, Moscow, Russia.,PathoBiology Group, European Organization for Research and Treatment of Cancer (EORTC), Brussels, Belgium
| |
Collapse
|
14
|
Isali I, McClellan P, Calaway A, Prunty M, Abbosh P, Mishra K, Ponsky L, Markt S, Psutka SP, Bukavina L. Gene network profiling in muscle-invasive bladder cancer: A systematic review and meta-analysis. Urol Oncol 2022; 40:197.e11-197.e23. [PMID: 35039218 PMCID: PMC10123538 DOI: 10.1016/j.urolonc.2021.11.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2021] [Revised: 10/17/2021] [Accepted: 11/02/2021] [Indexed: 10/19/2022]
Abstract
BACKGROUND Determining meta-analysis of transcriptional profiling of muscle-invasive bladder cancer (MIBC) through Gene Expression Omnibus (GEO) datasets has not been investigated. This study aims to define gene expression profiles in MIBC and to identify potential candidate genes and pathways. OBJECTIVES To review and evaluate gene expression studies in MIBC through publicly available RNA sequencing (RNA-Seq) and microarray data in order to identify potential prognostic and therapeutic targets for MIBC. METHODS A systematic literature search of the Ovid MEDLINE, PubMed, and Wiley Cochrane Central Register of Controlled Trials databases was performed using the terms "gene," "gene expression," and "bladder cancer" January 1, 1990 through March 2021 focused on populations with MIBC. RESULTS In the final analysis, GEO datasets were included. Fixed effect model was employed in the meta-analysis. Gene networking connections and gene-set functional analyses of the identified genes as differentially expressed in MIBC were performed using ImaGEO and GeneMANIA software. A heatmap for the upregulated and downregulated genes was generated along with the correlated pathways. CONCLUSION A total of 9 genes were reported in this analysis. Six genes were reported as upregulated (ProTα, SPINT1, UBE2E1, RAB25, KPNB1, HDAC1) and 3 genes as downregulated (NUP188, IPO13, NUP124). Genes were found to be involved in "ubiquitin mediated proteolysis," "protein processing in endoplasmic reticulum," "transcriptional misregulation in cancer," and "RNA transport" pathways.
Collapse
Affiliation(s)
- Ilaha Isali
- Department of Urology, University Hospitals Cleveland Medical Center, Case Western Reserve University, Cleveland, OH
| | - Phillip McClellan
- Department of Mechanical and Aerospace Engineering, Case Western Reserve University, Cleveland, OH
| | - Adam Calaway
- Department of Urology, University Hospitals Cleveland Medical Center, Case Western Reserve University, Cleveland, OH; Case Comprehensive Cancer Center, Case Western Reserve School of Medicine, Cleveland, OH
| | - Megan Prunty
- Department of Urology, University Hospitals Cleveland Medical Center, Case Western Reserve University, Cleveland, OH
| | - Phillip Abbosh
- Department of Urology, Fox Chase Cancer Center, Philadelphia, PA
| | - Kirtishri Mishra
- Department of Urology, Fox Chase Cancer Center, Philadelphia, PA
| | - Lee Ponsky
- Department of Urology, University Hospitals Cleveland Medical Center, Case Western Reserve University, Cleveland, OH; Case Comprehensive Cancer Center, Case Western Reserve School of Medicine, Cleveland, OH
| | - Sarah Markt
- Department of Population and Quantitative Health Science, Case Western Reserve School of Medicine, Cleveland, OH
| | - Sarah P Psutka
- Department of Urology, University of Washington School of Medicine, Seattle, WA
| | - Laura Bukavina
- Department of Urology, University Hospitals Cleveland Medical Center, Case Western Reserve University, Cleveland, OH; Case Comprehensive Cancer Center, Case Western Reserve School of Medicine, Cleveland, OH.
| |
Collapse
|
15
|
Belotti Y, Lim EH, Lim CT. The Role of the Extracellular Matrix and Tumor-Infiltrating Immune Cells in the Prognostication of High-Grade Serous Ovarian Cancer. Cancers (Basel) 2022; 14:404. [PMID: 35053566 PMCID: PMC8773831 DOI: 10.3390/cancers14020404] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 01/05/2022] [Accepted: 01/11/2022] [Indexed: 12/12/2022] Open
Abstract
Ovarian cancer is the eighth global leading cause of cancer-related death among women. The most common form is the high-grade serous ovarian carcinoma (HGSOC). No further improvements in the 5-year overall survival have been seen over the last 40 years since the adoption of platinum- and taxane-based chemotherapy. Hence, a better understanding of the mechanisms governing this aggressive phenotype would help identify better therapeutic strategies. Recent research linked onset, progression, and response to treatment with dysregulated components of the tumor microenvironment (TME) in many types of cancer. In this study, using bioinformatic approaches, we identified a 19-gene TME-related HGSOC prognostic genetic panel (19 prognostic genes (PLXNB2, HMCN2, NDNF, NTN1, TGFBI, CHAD, CLEC5A, PLXNA1, CST9, LOXL4, MMP17, PI3, PRSS1, SERPINA10, TLL1, CBLN2, IL26, NRG4, and WNT9A) by assessing the RNA sequencing data of 342 tumors available in the TCGA database. Using machine learning, we found that specific patterns of infiltrating immune cells characterized each risk group. Furthermore, we demonstrated the predictive potential of our risk score across different platforms and its improved prognostic performance compared with other gene panels.
Collapse
Affiliation(s)
- Yuri Belotti
- Institute for Health Innovation and Technology, National University of Singapore, 14 Medical Drive, Singapore 117599, Singapore;
| | - Elaine Hsuen Lim
- Division of Medical Oncology, National Cancer Center Singapore, 11 Hospital Drive, Singapore 169610, Singapore;
| | - Chwee Teck Lim
- Institute for Health Innovation and Technology, National University of Singapore, 14 Medical Drive, Singapore 117599, Singapore;
- Department of Biomedical Engineering, National University of Singapore, 4 Engineering Drive 3, Singapore 117583, Singapore
- Mechanobiology Institute, National University of Singapore, 5A Engineering Drive 1, Singapore 117411, Singapore
| |
Collapse
|
16
|
Udaondo Z. Big data and computational advancements for next generation of Microbial Biotechnology. Microb Biotechnol 2022; 15:107-109. [PMID: 34713973 PMCID: PMC8719813 DOI: 10.1111/1751-7915.13936] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2021] [Accepted: 09/09/2021] [Indexed: 11/30/2022] Open
Affiliation(s)
- Zulema Udaondo
- Department of Biomedical InformaticsUniversity of Arkansas for Medical SciencesLittle RockARUSA
| |
Collapse
|
17
|
Prognostic Matrisomal Gene Panel and Its Association with Immune Cell Infiltration in Head and Neck Carcinomas. Cancers (Basel) 2021; 13:cancers13225761. [PMID: 34830910 PMCID: PMC8616409 DOI: 10.3390/cancers13225761] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Revised: 11/09/2021] [Accepted: 11/13/2021] [Indexed: 01/04/2023] Open
Abstract
Simple Summary Squamous cell carcinoma of the head and neck (SCCHN) is a heterogeneous group of tumors arising from squamous cells lining different anatomic sites. This type of malignancy has been mainly investigated by focusing primarily on tumor cells, but recent evidence highlighted the importance of the tumor microenvironment (TME) in cancer growth, progression and metastasis. Hence, we hypothesized that dysregulated matrisomal components could have a common association with patient survival, irrespective of the subsite of origin of the SCCHN. Using bioinformatic methods and public datasets, we successfully identified a gene panel with prognostic value in HPV-negative and non-metastatic node-negative tumors and demonstrated its association with immune cell infiltration. Abstract Squamous cell carcinoma of the head and neck (SCCHN) is common worldwide and related to several risk factors including smoking, alcohol consumption, poor dentition and human papillomavirus (HPV) infection. Different etiological factors may influence the tumor microenvironment and play a role in dictating response to therapeutics. Here, we sought to investigate whether an early-stage SCCHN-specific prognostic matrisome-derived gene signature could be identified for HPV-negative SCCHN patients (n = 168), by applying a bioinformatics pipeline to the publicly available SCCHN-TCGA dataset. We identified six matrisome-derived genes with high association with prognostic outcomes in SCCHN. A six-gene risk score, the SCCHN TMI (SCCHN-tumor matrisome index: composed of MASP1, EGFL6, SFRP5, SPP1, MMP8 and P4HA1) was constructed and used to stratify patients into risk groups. Using machine learning-based deconvolution methods, we found that the risk groups were characterized by a differing abundance of infiltrating immune cells. This work highlights the key role of immune infiltration cells in the overall survival of patients affected by HPV-negative SCCHN. The identified SCCHN TMI represents a genomic tool that could potentially aid patient stratification and selection for therapy in these patients.
Collapse
|
18
|
Andrieux G, Chakraborty S. Editorial: Integration of Multi-Omics Techniques in Cancer. Front Genet 2021; 12:733965. [PMID: 34434225 PMCID: PMC8380985 DOI: 10.3389/fgene.2021.733965] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Accepted: 07/19/2021] [Indexed: 12/13/2022] Open
Affiliation(s)
- Geoffroy Andrieux
- Institute of Medical Bioinformatics and Systems Medicine, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany.,German Cancer Consortium (DKTK) and German Cancer Research Center (DKFZ), Freiburg, Germany
| | - Sajib Chakraborty
- Molecular Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, University of Dhaka, Dhaka, Bangladesh
| |
Collapse
|
19
|
Tang K, Ji X, Zhou M, Deng Z, Huang Y, Zheng G, Cao Z. Rank-in: enabling integrative analysis across microarray and RNA-seq for cancer. Nucleic Acids Res 2021; 49:e99. [PMID: 34214174 PMCID: PMC8464058 DOI: 10.1093/nar/gkab554] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Revised: 05/10/2021] [Accepted: 06/25/2021] [Indexed: 12/13/2022] Open
Abstract
Though transcriptomics technologies evolve rapidly in the past decades, integrative analysis of mixed data between microarray and RNA-seq remains challenging due to the inherent variability difference between them. Here, Rank-In was proposed to correct the nonbiological effects across the two technologies, enabling freely blended data for consolidated analysis. Rank-In was rigorously validated via the public cell and tissue samples tested by both technologies. On the two reference samples of the SEQC project, Rank-In not only perfectly classified the 44 profiles but also achieved the best accuracy of 0.9 on predicting TaqMan-validated DEGs. More importantly, on 327 Glioblastoma (GBM) profiles and 248, 523 heterogeneous colon cancer profiles respectively, only Rank-In can successfully discriminate every single cancer profile from normal controls, while the others cannot. Further on different sizes of mixed seq-array GBM profiles, Rank-In can robustly reproduce a median range of DEG overlapping from 0.74 to 0.83 among top genes, whereas the others never exceed 0.72. Being the first effective method enabling mixed data of cross-technology analysis, Rank-In welcomes hybrid of array and seq profiles for integrative study on large/small, paired/unpaired and balanced/imbalanced samples, opening possibility to reduce sampling space of clinical cancer patients. Rank-In can be accessed at http://www.badd-cao.net/rank-in/index.html.
Collapse
Affiliation(s)
- Kailin Tang
- Department of Gastroenterology, Shanghai 10th People's Hospital and School of Life Sciences and Technology, Tongji University, 1239 Siping Road, Shanghai 200092, P.R. China
| | - Xuejie Ji
- Department of Gastroenterology, Shanghai 10th People's Hospital and School of Life Sciences and Technology, Tongji University, 1239 Siping Road, Shanghai 200092, P.R. China
| | - Mengdi Zhou
- Department of Gastroenterology, Shanghai 10th People's Hospital and School of Life Sciences and Technology, Tongji University, 1239 Siping Road, Shanghai 200092, P.R. China
| | - Zeliang Deng
- Department of Gastroenterology, Shanghai 10th People's Hospital and School of Life Sciences and Technology, Tongji University, 1239 Siping Road, Shanghai 200092, P.R. China
| | - Yuwei Huang
- Department of Gastroenterology, Shanghai 10th People's Hospital and School of Life Sciences and Technology, Tongji University, 1239 Siping Road, Shanghai 200092, P.R. China.,CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Science, Shanghai 200031, P.R. China
| | - Genhui Zheng
- Department of Gastroenterology, Shanghai 10th People's Hospital and School of Life Sciences and Technology, Tongji University, 1239 Siping Road, Shanghai 200092, P.R. China
| | - Zhiwei Cao
- Department of Gastroenterology, Shanghai 10th People's Hospital and School of Life Sciences and Technology, Tongji University, 1239 Siping Road, Shanghai 200092, P.R. China
| |
Collapse
|
20
|
Identification of transcriptional subtypes in lung adenocarcinoma and squamous cell carcinoma through integrative analysis of microarray and RNA sequencing data. Sci Rep 2021; 11:8709. [PMID: 33888829 PMCID: PMC8062554 DOI: 10.1038/s41598-021-88209-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2020] [Accepted: 04/08/2021] [Indexed: 02/02/2023] Open
Abstract
Classification of tumors into subtypes can inform personalized approaches to treatment including the choice of targeted therapies. The two most common lung cancer histological subtypes, lung adenocarcinoma and lung squamous cell carcinoma, have been previously divided into transcriptional subtypes using microarray data, and corresponding signatures were subsequently used to classify RNA-seq data. Cross-platform unsupervised classification facilitates the identification of robust transcriptional subtypes by combining vast amounts of publicly available microarray and RNA-seq data. However, cross-platform classification is challenging because of intrinsic differences in data generated using the two gene expression profiling technologies. In this report, we show that robust gene expression subtypes can be identified in integrated data representing over 3500 normal and tumor lung samples profiled using two widely used platforms, Affymetrix HG-U133 Plus 2.0 Array and Illumina HiSeq RNA sequencing. We tested and analyzed consensus clustering for 384 combinations of data processing methods. The agreement between subtypes identified in single-platform and cross-platform normalized data was then evaluated using a variety of statistics. Results show that unsupervised learning can be achieved with combined microarray and RNA-seq data using selected preprocessing, cross-platform normalization, and unsupervised feature selection methods. Our analysis confirmed three lung adenocarcinoma transcriptional subtypes, but only two consistent subtypes in squamous cell carcinoma, as opposed to four subtypes previously identified. Further analysis showed that tumor subtypes were associated with distinct patterns of genomic alterations in genes coding for therapeutic targets. Importantly, by integrating quantitative proteomics data, we were able to identify tumor subtype biomarkers that effectively classify samples on the basis of both gene and protein expression. This study provides the basis for further integrative data analysis across gene and protein expression profiling platforms.
Collapse
|
21
|
Chen W, Alexandre PA, Ribeiro G, Fukumasu H, Sun W, Reverter A, Li Y. Identification of Predictor Genes for Feed Efficiency in Beef Cattle by Applying Machine Learning Methods to Multi-Tissue Transcriptome Data. Front Genet 2021; 12:619857. [PMID: 33664767 PMCID: PMC7921797 DOI: 10.3389/fgene.2021.619857] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2020] [Accepted: 01/15/2021] [Indexed: 12/22/2022] Open
Abstract
Machine learning (ML) methods have shown promising results in identifying genes when applied to large transcriptome datasets. However, no attempt has been made to compare the performance of combining different ML methods together in the prediction of high feed efficiency (HFE) and low feed efficiency (LFE) animals. In this study, using RNA sequencing data of five tissues (adrenal gland, hypothalamus, liver, skeletal muscle, and pituitary) from nine HFE and nine LFE Nellore bulls, we evaluated the prediction accuracies of five analytical methods in classifying FE animals. These included two conventional methods for differential gene expression (DGE) analysis (t-test and edgeR) as benchmarks, and three ML methods: Random Forests (RFs), Extreme Gradient Boosting (XGBoost), and combination of both RF and XGBoost (RX). Utility of a subset of candidate genes selected from each method for classification of FE animals was assessed by support vector machine (SVM). Among all methods, the smallest subsets of genes (117) identified by RX outperformed those chosen by t-test, edgeR, RF, or XGBoost in classification accuracy of animals. Gene co-expression network analysis confirmed the interactivity existing among these genes and their relevance within the network related to their prediction ranking based on ML. The results demonstrate a great potential for applying a combination of ML methods to large transcriptome datasets to identify biologically important genes for accurately classifying FE animals.
Collapse
Affiliation(s)
- Weihao Chen
- College of Animal Science and Technology, Yangzhou University, Yangzhou, China.,CSIRO Agriculture and Food, St Lucia, QLD, Australia
| | | | - Gabriela Ribeiro
- School of Animal Science and Food Engineering, University of São Paulo, Pirassununga, Brazil
| | - Heidge Fukumasu
- School of Animal Science and Food Engineering, University of São Paulo, Pirassununga, Brazil
| | - Wei Sun
- College of Animal Science and Technology, Yangzhou University, Yangzhou, China.,Institute of Agriculture Science and Technology Development, Yangzhou University, Yangzhou, China.,Joint International Research Laboratory of Agriculture and Agri-Product Safety of Ministry of Education of China, Yangzhou University, Yangzhou, China
| | | | - Yutao Li
- CSIRO Agriculture and Food, St Lucia, QLD, Australia
| |
Collapse
|
22
|
Abstract
Advances in next generation sequencing (NGS) technologies resulted in a broad array of large-scale gene expression studies and an unprecedented volume of whole messenger RNA (mRNA) sequencing data, or the transcriptome (also known as RNA sequencing, or RNA-seq). These include the Genotype Tissue Expression project (GTEx) and The Cancer Genome Atlas (TCGA), among others. Here we cover some of the commonly used datasets, provide an overview on how to begin the analysis pipeline, and how to explore and interpret the data provided by these publicly available resources.
Collapse
Affiliation(s)
- Yazeed Zoabi
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Noam Shomron
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel.
| |
Collapse
|
23
|
Linke F, Aldighieri M, Lourdusamy A, Grabowska AM, Stolnik S, Kerr ID, Merry CL, Coyle B. 3D hydrogels reveal medulloblastoma subgroup differences and identify extracellular matrix subtypes that predict patient outcome. J Pathol 2020; 253:326-338. [PMID: 33206391 PMCID: PMC7986745 DOI: 10.1002/path.5591] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Revised: 10/19/2020] [Accepted: 11/10/2020] [Indexed: 12/13/2022]
Abstract
Medulloblastoma (MB) is the most common malignant brain tumour in children and is subdivided into four subgroups: WNT, SHH, Group 3, and Group 4. These molecular subgroups differ in their metastasis patterns and related prognosis rates. Conventional 2D cell culture methods fail to recapitulate these clinical differences. Realistic 3D models of the cerebellum are therefore necessary to investigate subgroup‐specific functional differences and their role in metastasis and chemoresistance. A major component of the brain extracellular matrix (ECM) is the glycosaminoglycan hyaluronan. MB cell lines encapsulated in hyaluronan hydrogels grew as tumour nodules, with Group 3 and Group 4 cell lines displaying clinically characteristic laminar metastatic patterns and levels of chemoresistance. The glycoproteins, laminin and vitronectin, were identified as subgroup‐specific, tumour‐secreted ECM factors. Gels of higher complexity, formed by incorporation of laminin or vitronectin, revealed subgroup‐specific adhesion and growth patterns closely mimicking clinical phenotypes. ECM subtypes, defined by relative levels of laminin and vitronectin expression in patient tissue microarrays and gene expression data sets, were able to identify novel high‐risk MB patient subgroups and predict overall survival. Our hyaluronan model system has therefore allowed us to functionally characterize the interaction between different MB subtypes and their environment. It highlights the prognostic and pathological role of specific ECM factors and enables preclinical development of subgroup‐specific therapies. © 2020 The Authors. The Journal of Pathology published by John Wiley & Sons, Ltd. on behalf of The Pathological Society of Great Britain and Ireland.
Collapse
Affiliation(s)
- Franziska Linke
- Children's Brain Tumour Research Centre, School of Medicine, Biodiscovery Institute, University of Nottingham, Nottingham, UK
| | - Macha Aldighieri
- Children's Brain Tumour Research Centre, School of Medicine, Biodiscovery Institute, University of Nottingham, Nottingham, UK
| | - Anbarasu Lourdusamy
- Children's Brain Tumour Research Centre, School of Medicine, Biodiscovery Institute, University of Nottingham, Nottingham, UK
| | - Anna M Grabowska
- Division of Cancer and Stem Cells, School of Medicine, Biodiscovery Institute, University of Nottingham, Nottingham, UK
| | - Snow Stolnik
- Division of Molecular Therapeutics and Formulation, School of Pharmacy, University of Nottingham, Nottingham, UK
| | - Ian D Kerr
- School of Life Sciences, University of Nottingham, Nottingham, UK
| | - Catherine Lr Merry
- Division of Cancer and Stem Cells, School of Medicine, Biodiscovery Institute, University of Nottingham, Nottingham, UK
| | - Beth Coyle
- Children's Brain Tumour Research Centre, School of Medicine, Biodiscovery Institute, University of Nottingham, Nottingham, UK
| |
Collapse
|
24
|
Liu Z, Jiang Z, Wu N, Zhou G, Wang X. Classification of gastric cancers based on immunogenomic profiling. Transl Oncol 2020; 14:100888. [PMID: 33096337 PMCID: PMC7576512 DOI: 10.1016/j.tranon.2020.100888] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2020] [Revised: 09/03/2020] [Accepted: 09/21/2020] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Extensive evidence showed that gastric cancer (GC) is heterogeneous, and many studies have been focused on identifying GC subtypes based on genomic profiles. However, few studies have specifically explored the GC classification and predicted the classification accuracy that may help facilitate the optimal stratification of GC patients responsive to immunotherapy. METHODS Using two publicly available GC genomics datasets, we classified GC on the basis of 797 immune related genes. Unsupervised and supervised machine learning methods were used to predict the classification. RESULTS We identified two GC subtypes that we named as Immunity-High (IM-H) and Immunity- Low (IM-L), and demonstrated that this classification was duplicable and predictable by analyzing other datasets. IM-H subtype was characterized by greater immune cell infiltration, stronger immune activities, lower tumor purity, as well as worse survival prognosis compared to IM-L subtype. Besides the immune signatures, some cancer-associated pathways were hyperactivated in IM-H, including TGF-beta signaling pathway, Focal adhesion, Cell adhesion molecules (CAMs), Calcium signaling pathway, mTOR signaling pathway, MAPK signaling pathway and Wnt signaling pathway. In contrast, IM-L presented depressed immune signatures and increased activation of base excision repair, DNA replication, homologous recombination, non-homologous end-joining and nucleotide excision repair pathways. Furthermore, we identified subtype-specific genomic or clinical features, and subtype-specific gene ontology and networks in IM-H and IM-L subtype. CONCLUSIONS We proposed and validated two reproducible immune molecular subtypes of GC, which has potential clinical implications for GC patient selection of immunotherapy.
Collapse
Affiliation(s)
- Zhixian Liu
- The Affiliated Cancer Hospital of Nanjing Medical University, Jiangsu Institute of Cancer Research, Jiangsu Cancer Hospital, 42 Baiziting, Nanjing 210009, Jiangsu, China; Biomedical Informatics Research Lab, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, Jiangsu, China
| | - Zehang Jiang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou, 500040, Guangdong, China; Biomedical Informatics Research Lab, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, Jiangsu, China
| | - Nan Wu
- The Affiliated Cancer Hospital of Nanjing Medical University, Jiangsu Institute of Cancer Research, Jiangsu Cancer Hospital, 42 Baiziting, Nanjing 210009, Jiangsu, China
| | - Guoren Zhou
- The Affiliated Cancer Hospital of Nanjing Medical University, Jiangsu Institute of Cancer Research, Jiangsu Cancer Hospital, 42 Baiziting, Nanjing 210009, Jiangsu, China.
| | - Xiaosheng Wang
- Biomedical Informatics Research Lab, School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing 211198, Jiangsu, China.
| |
Collapse
|
25
|
van der Kloet FM, Buurmans J, Jonker MJ, Smilde AK, Westerhuis JA. Increased comparability between RNA-Seq and microarray data by utilization of gene sets. PLoS Comput Biol 2020; 16:e1008295. [PMID: 32997685 PMCID: PMC7549825 DOI: 10.1371/journal.pcbi.1008295] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2019] [Revised: 10/12/2020] [Accepted: 08/27/2020] [Indexed: 12/30/2022] Open
Abstract
The field of transcriptomics uses and measures mRNA as a proxy of gene expression. There are currently two major platforms in use for quantifying mRNA, microarray and RNA-Seq. Many comparative studies have shown that their results are not always consistent. In this study we aim to find a robust method to increase comparability of both platforms enabling data analysis of merged data from both platforms. We transformed high dimensional transcriptomics data from two different platforms into a lower dimensional, and biologically relevant dataset by calculating enrichment scores based on gene set collections for all samples. We compared the similarity between data from both platforms based on the raw data and on the enrichment scores. We show that the performed data transforms the data in a biologically relevant way and filters out noise which leads to increased platform concordance. We validate the procedure using predictive models built with microarray based enrichment scores to predict subtypes of breast cancer using enrichment scores based on sequenced data. Although microarray and RNA-Seq expression levels might appear different, transforming them into biologically relevant gene set enrichment scores significantly increases their correlation, which is a step forward in data integration of the two platforms. The gene set collections were shown to contain biologically relevant gene sets. More in-depth investigation on the effect of the composition, size, and number of gene sets that are used for the transformation is suggested for future research. The field of transcriptomics uses and measures mRNA as a proxy of gene expression. There are currently two major platforms in use for quantifying mRNA, microarray and RNA-Seq. Many comparative studies have shown that their results are not always consistent. In this study we aim to find a robust method to increase comparability of both platforms enabling data analysis of merged data from both platforms. We transformed the high dimensional transcriptomics data from the two different platforms into lower dimensional, and biologically relevant gene set scores. These gene sets were defined a-priori as specific combination of genes (e.g. up-regulated in a certain pathway). We observed that although microarray and RNA-Seq expression levels might appear different, using these gene sets to transform the data significantly increases their correlation. This is a step forward in data integration of the two platforms. More in-depth investigation on the effect of the composition, size, and number of gene sets that are used for the transformation is suggested for future research.
Collapse
Affiliation(s)
| | - Jeroen Buurmans
- Swammerdam Institute for Life Sciences, University of Amsterdam
| | | | - Age K. Smilde
- Swammerdam Institute for Life Sciences, University of Amsterdam
| | | |
Collapse
|
26
|
Angel PW, Rajab N, Deng Y, Pacheco CM, Chen T, Lê Cao KA, Choi J, Wells CA. A simple, scalable approach to building a cross-platform transcriptome atlas. PLoS Comput Biol 2020; 16:e1008219. [PMID: 32986694 PMCID: PMC7544119 DOI: 10.1371/journal.pcbi.1008219] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Revised: 10/08/2020] [Accepted: 08/04/2020] [Indexed: 12/21/2022] Open
Abstract
Gene expression atlases have transformed our understanding of the development, composition and function of human tissues. New technologies promise improved cellular or molecular resolution, and have led to the identification of new cell types, or better defined cell states. But as new technologies emerge, information derived on old platforms becomes obsolete. We demonstrate that it is possible to combine a large number of different profiling experiments summarised from dozens of laboratories and representing hundreds of donors, to create an integrated molecular map of human tissue. As an example, we combine 850 samples from 38 platforms to build an integrated atlas of human blood cells. We achieve robust and unbiased cell type clustering using a variance partitioning method, selecting genes with low platform bias relative to biological variation. Other than an initial rescaling, no other transformation to the primary data is applied through batch correction or renormalisation. Additional data, including single-cell datasets, can be projected for comparison, classification and annotation. The resulting atlas provides a multi-scaled approach to visualise and analyse the relationships between sets of genes and blood cell lineages, including the maturation and activation of leukocytes in vivo and in vitro. In allowing for data integration across hundreds of studies, we address a key reproduciblity challenge which is faced by any new technology. This allows us to draw on the deep phenotypes and functional annotations that accompany traditional profiling methods, and provide important context to the high cellular resolution of single cell profiling. Here, we have implemented the blood atlas in the open access Stemformatics.org platform, drawing on its extensive collection of curated transcriptome data. The method is simple, scalable and amenable for rapid deployment in other biological systems or computational workflows.
Collapse
Affiliation(s)
- Paul W. Angel
- Centre for Stem Cell Systems, The University of Melbourne, Melbourne, Victoria, Australia
| | - Nadia Rajab
- Centre for Stem Cell Systems, The University of Melbourne, Melbourne, Victoria, Australia
| | - Yidi Deng
- Melbourne Integrative Genomics, School of Mathematics and Statistics, The University of Melbourne, Melbourne, Victoria, Australia
| | - Chris M. Pacheco
- Centre for Stem Cell Systems, The University of Melbourne, Melbourne, Victoria, Australia
| | - Tyrone Chen
- Centre for Stem Cell Systems, The University of Melbourne, Melbourne, Victoria, Australia
| | - Kim-Anh Lê Cao
- Melbourne Integrative Genomics, School of Mathematics and Statistics, The University of Melbourne, Melbourne, Victoria, Australia
| | - Jarny Choi
- Centre for Stem Cell Systems, The University of Melbourne, Melbourne, Victoria, Australia
| | - Christine A. Wells
- Centre for Stem Cell Systems, The University of Melbourne, Melbourne, Victoria, Australia
| |
Collapse
|
27
|
Li X, Liu L, Goodall GJ, Schreiber A, Xu T, Li J, Le TD. A novel single-cell based method for breast cancer prognosis. PLoS Comput Biol 2020; 16:e1008133. [PMID: 32833968 PMCID: PMC7470419 DOI: 10.1371/journal.pcbi.1008133] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2020] [Revised: 09/03/2020] [Accepted: 07/09/2020] [Indexed: 12/12/2022] Open
Abstract
Breast cancer prognosis is challenging due to the heterogeneity of the disease. Various computational methods using bulk RNA-seq data have been proposed for breast cancer prognosis. However, these methods suffer from limited performances or ambiguous biological relevance, as a result of the neglect of intra-tumor heterogeneity. Recently, single cell RNA-sequencing (scRNA-seq) has emerged for studying tumor heterogeneity at cellular levels. In this paper, we propose a novel method, scPrognosis, to improve breast cancer prognosis with scRNA-seq data. scPrognosis uses the scRNA-seq data of the biological process Epithelial-to-Mesenchymal Transition (EMT). It firstly infers the EMT pseudotime and a dynamic gene co-expression network, then uses an integrative model to select genes important in EMT based on their expression variation and differentiation in different stages of EMT, and their roles in the dynamic gene co-expression network. To validate and apply the selected signatures to breast cancer prognosis, we use them as the features to build a prediction model with bulk RNA-seq data. The experimental results show that scPrognosis outperforms other benchmark breast cancer prognosis methods that use bulk RNA-seq data. Moreover, the dynamic changes in the expression of the selected signature genes in EMT may provide clues to the link between EMT and clinical outcomes of breast cancer. scPrognosis will also be useful when applied to scRNA-seq datasets of different biological processes other than EMT.
Collapse
Affiliation(s)
- Xiaomei Li
- UniSA STEM, University of South Australia, Mawson Lakes, SA, Australia
| | - Lin Liu
- UniSA STEM, University of South Australia, Mawson Lakes, SA, Australia
| | - Gregory J. Goodall
- Centre for Cancer Biology, an alliance of SA Pathology and University of South Australia, Adelaide, SA, Australia
- School of Medicine, Discipline of Medicine, University of Adelaide, SA, Australia
| | - Andreas Schreiber
- Centre for Cancer Biology, an alliance of SA Pathology and University of South Australia, Adelaide, SA, Australia
| | - Taosheng Xu
- Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, 230031, China
| | - Jiuyong Li
- UniSA STEM, University of South Australia, Mawson Lakes, SA, Australia
| | - Thuc D. Le
- UniSA STEM, University of South Australia, Mawson Lakes, SA, Australia
- * E-mail:
| |
Collapse
|
28
|
Fajarda O, Duarte-Pereira S, Silva RM, Oliveira JL. Merging microarray studies to identify a common gene expression signature to several structural heart diseases. BioData Min 2020; 13:8. [PMID: 32670412 PMCID: PMC7346458 DOI: 10.1186/s13040-020-00217-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Accepted: 06/05/2020] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Heart disease is the leading cause of death worldwide. Knowing a gene expression signature in heart disease can lead to the development of more efficient diagnosis and treatments that may prevent premature deaths. A large amount of microarray data is available in public repositories and can be used to identify differentially expressed genes. However, most of the microarray datasets are composed of a reduced number of samples and to obtain more reliable results, several datasets have to be merged, which is a challenging task. The identification of differentially expressed genes is commonly done using statistical methods. Nonetheless, these methods are based on the definition of an arbitrary threshold to select the differentially expressed genes and there is no consensus on the values that should be used. RESULTS Nine publicly available microarray datasets from studies of different heart diseases were merged to form a dataset composed of 689 samples and 8354 features. Subsequently, the adjusted p-value and fold change were determined and by combining a set of adjusted p-values cutoffs with a list of different fold change thresholds, 12 sets of differentially expressed genes were obtained. To select the set of differentially expressed genes that has the best accuracy in classifying samples from patients with heart diseases and samples from patients with no heart condition, the random forest algorithm was used. A set of 62 differentially expressed genes having a classification accuracy of approximately 95% was identified. CONCLUSIONS We identified a gene expression signature common to different cardiac diseases and supported our findings by showing their involvement in the pathophysiology of the heart. The approach used in this study is suitable for the identification of gene expression signatures, and can be extended to different diseases.
Collapse
Affiliation(s)
- Olga Fajarda
- IEETA/DETI, University of Aveiro, Aveiro, 3810-193 Portugal
| | - Sara Duarte-Pereira
- IEETA/DETI, University of Aveiro, Aveiro, 3810-193 Portugal
- Department of Medical Sciences and iBiMED-Institute of Biomedicine, University of Aveiro, Aveiro, 3810-193 Portugal
| | - Raquel M. Silva
- IEETA/DETI, University of Aveiro, Aveiro, 3810-193 Portugal
- Department of Medical Sciences and iBiMED-Institute of Biomedicine, University of Aveiro, Aveiro, 3810-193 Portugal
- Current Address: Universidade Católica Portuguesa, Faculdade de Medicina Dentária, CIIS-Centro de Investigação Interdisciplinar em Saúde, Campus de Viseu, Viseu, 3504-505 Portugal
| | | |
Collapse
|
29
|
Analysis of the Circadian Regulation of Cancer Hallmarks by a Cross-Platform Study of Colorectal Cancer Time-Series Data Reveals an Association with Genes Involved in Huntington's Disease. Cancers (Basel) 2020; 12:cancers12040963. [PMID: 32295075 PMCID: PMC7226183 DOI: 10.3390/cancers12040963] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2020] [Revised: 04/07/2020] [Accepted: 04/10/2020] [Indexed: 02/06/2023] Open
Abstract
Accumulating evidence points to a link between circadian clock dysfunction and the molecular events that drive tumorigenesis. Here, we investigated the connection between the circadian clock and the hallmarks of cancer in an in vitro model of colorectal cancer (CRC). We used a cross-platform data normalization method to concatenate and compare available microarray and RNA-sequencing time series data of CRC cell lines derived from the same patient at different disease stages. Our data analysis suggests differential regulation of molecular pathways between the CRC cells and identifies several of the circadian and likely clock-controlled genes (CCGs) as cancer hallmarks and circadian drug targets. Notably, we found links of the CCGs to Huntington’s disease (HD) in the metastasis-derived cells. We then investigated the impact of perturbations of our candidate genes in a cohort of 439 patients with colon adenocarcinoma retrieved from the Cancer Genome Atlas (TCGA). The analysis revealed a correlation of the differential expression levels of the candidate genes with the survival of patients. Thus, our study provides a bioinformatics workflow that allows for a comprehensive analysis of circadian properties at different stages of colorectal cancer, and identifies a new association between cancer and HD.
Collapse
|
30
|
A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science. UNSUPERVISED AND SEMI-SUPERVISED LEARNING 2020. [DOI: 10.1007/978-3-030-22475-2_1] [Citation(s) in RCA: 88] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
31
|
Emmett MJ, Lazar MA. Integrative regulation of physiology by histone deacetylase 3. Nat Rev Mol Cell Biol 2019; 20:102-115. [PMID: 30390028 DOI: 10.1038/s41580-018-0076-0] [Citation(s) in RCA: 109] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
Cell-type-specific gene expression is physiologically modulated by the binding of transcription factors to genomic enhancer sequences, to which chromatin modifiers such as histone deacetylases (HDACs) are recruited. Drugs that inhibit HDACs are in clinical use but lack specificity. HDAC3 is a stoichiometric component of nuclear receptor co-repressor complexes whose enzymatic activity depends on this interaction. HDAC3 is required for many aspects of mammalian development and physiology, for example, for controlling metabolism and circadian rhythms. In this Review, we discuss the mechanisms by which HDAC3 regulates cell type-specific enhancers, the structure of HDAC3 and its function as part of nuclear receptor co-repressors, its enzymatic activity and its post-translational modifications. We then discuss the plethora of tissue-specific physiological functions of HDAC3.
Collapse
Affiliation(s)
- Matthew J Emmett
- Institute for Diabetes, Obesity, and Metabolism, Department of Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA.,Division of Endocrinology, Diabetes, and Metabolism, Department of Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Mitchell A Lazar
- Institute for Diabetes, Obesity, and Metabolism, Department of Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA. .,Division of Endocrinology, Diabetes, and Metabolism, Department of Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA.
| |
Collapse
|
32
|
Jiang S, Cheng SJ, Ren LC, Wang Q, Kang YJ, Ding Y, Hou M, Yang XX, Lin Y, Liang N, Gao G. An expanded landscape of human long noncoding RNA. Nucleic Acids Res 2019; 47:7842-7856. [PMID: 31350901 PMCID: PMC6735957 DOI: 10.1093/nar/gkz621] [Citation(s) in RCA: 74] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2019] [Revised: 06/18/2019] [Accepted: 07/11/2019] [Indexed: 12/21/2022] Open
Abstract
Long noncoding RNAs (lncRNAs) are emerging as key regulators of multiple essential biological processes involved in physiology and pathology. By analyzing the largest compendium of 14,166 samples from normal and tumor tissues, we significantly expand the landscape of human long noncoding RNA with a high-quality atlas: RefLnc (Reference catalog of LncRNA). Powered by comprehensive annotation across multiple sources, RefLnc helps to pinpoint 275 novel intergenic lncRNAs correlated with sex, age or race as well as 369 novel ones associated with patient survival, clinical stage, tumor metastasis or recurrence. Integrated in a user-friendly online portal, the expanded catalog of human lncRNAs provides a valuable resource for investigating lncRNA function in both human biology and cancer development.
Collapse
Affiliation(s)
- Shuai Jiang
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Si-Jin Cheng
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Li-Chen Ren
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Qian Wang
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Yu-Jian Kang
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Yang Ding
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Mei Hou
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Xiao-Xu Yang
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Yuan Lin
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Nan Liang
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Ge Gao
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| |
Collapse
|
33
|
Zhang L, Thapa I, Haas C, Bastola D. Multiplatform biomarker identification using a data-driven approach enables single-sample classification. BMC Bioinformatics 2019; 20:601. [PMID: 31752658 PMCID: PMC6868758 DOI: 10.1186/s12859-019-3140-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2019] [Accepted: 10/09/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND High-throughput gene expression profiles have allowed discovery of potential biomarkers enabling early diagnosis, prognosis and developing individualized treatment. However, it remains a challenge to identify a set of reliable and reproducible biomarkers across various gene expression platforms and laboratories for single sample diagnosis and prognosis. We address this need with our Data-Driven Reference (DDR) approach, which employs stably expressed housekeeping genes as references to eliminate platform-specific biases and non-biological variabilities. RESULTS Our method identifies biomarkers with "built-in" features, and these features can be interpreted consistently regardless of profiling technology, which enable classification of single-sample independent of platforms. Validation with RNA-seq data of blood platelets shows that DDR achieves the superior performance in classification of six different tumor types as well as molecular target statuses (such as MET or HER2-positive, and mutant KRAS, EGFR or PIK3CA) with smaller sets of biomarkers. We demonstrate on the three microarray datasets that our method is capable of identifying robust biomarkers for subgrouping medulloblastoma samples with data perturbation due to different microarray platforms. In addition to identifying the majority of subgroup-specific biomarkers in CodeSet of nanoString, some potential new biomarkers for subgrouping medulloblastoma were detected by our method. CONCLUSIONS In this study, we present a simple, yet powerful data-driven method which contributes significantly to identification of robust cross-platform gene signature for disease classification of single-patient to facilitate precision medicine. In addition, our method provides a new strategy for transcriptome analysis.
Collapse
Affiliation(s)
- Ling Zhang
- School of Interdisciplinary Informatics, University of Nebraska at Omaha, 110 S 67th St, Omaha, 68182, NE, USA
| | - Ishwor Thapa
- School of Interdisciplinary Informatics, University of Nebraska at Omaha, 110 S 67th St, Omaha, 68182, NE, USA
| | - Christian Haas
- School of Interdisciplinary Informatics, University of Nebraska at Omaha, 110 S 67th St, Omaha, 68182, NE, USA
| | - Dhundy Bastola
- School of Interdisciplinary Informatics, University of Nebraska at Omaha, 110 S 67th St, Omaha, 68182, NE, USA.
| |
Collapse
|
34
|
Peters TJ, French HJ, Bradford ST, Pidsley R, Stirzaker C, Varinli H, Nair S, Qu W, Song J, Giles KA, Statham AL, Speirs H, Speed TP, Clark SJ. Evaluation of cross-platform and interlaboratory concordance via consensus modelling of genomic measurements. Bioinformatics 2019; 35:560-570. [PMID: 30084929 PMCID: PMC6378945 DOI: 10.1093/bioinformatics/bty675] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2018] [Revised: 07/10/2018] [Accepted: 07/31/2018] [Indexed: 01/23/2023] Open
Abstract
Motivation A synoptic view of the human genome benefits chiefly from the application of nucleic acid sequencing and microarray technologies. These platforms allow interrogation of patterns such as gene expression and DNA methylation at the vast majority of canonical loci, allowing granular insights and opportunities for validation of original findings. However, problems arise when validating against a “gold standard” measurement, since this immediately biases all subsequent measurements towards that particular technology or protocol. Since all genomic measurements are estimates, in the absence of a ”gold standard” we instead empirically assess the measurement precision and sensitivity of a large suite of genomic technologies via a consensus modelling method called the row-linear model. This method is an application of the American Society for Testing and Materials Standard E691 for assessing interlaboratory precision and sources of variability across multiple testing sites. Both cross-platform and cross-locus comparisons can be made across all common loci, allowing identification of technology- and locus-specific tendencies. Results We assess technologies including the Infinium MethylationEPIC BeadChip, whole genome bisulfite sequencing (WGBS), two different RNA-Seq protocols (PolyA+ and Ribo-Zero) and five different gene expression array platforms. Each technology thus is characterised herein, relative to the consensus. We showcase a number of applications of the row-linear model, including correlation with known interfering traits. We demonstrate a clear effect of cross-hybridisation on the sensitivity of Infinium methylation arrays. Additionally, we perform a true interlaboratory test on a set of samples interrogated on the same platform across twenty-one separate testing laboratories. Availability and implementation A full implementation of the row-linear model, plus extra functions for visualisation, are found in the R package consensus at https://github.com/timpeters82/consensus. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Timothy J Peters
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia
| | - Hugh J French
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia.,South Western Sydney Clinical School, Faculty of Medicine, University of New South Wales, Liverpool, NSW, Australia
| | - Stephen T Bradford
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia.,CSIRO Health and Biosecurity, North Ryde, NSW, Australia
| | - Ruth Pidsley
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia
| | - Clare Stirzaker
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia.,St Vincent's Clinical School, Faculty of Medicine, UNSW, Darlinghurst, NSW, Australia
| | - Hilal Varinli
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia.,CSIRO Health and Biosecurity, North Ryde, NSW, Australia.,Department of Biological Sciences, Macquarie University, North Ryde, NSW, Australia.,NSW Ministry of Health, LMB 961, North Sydney, NSW, Australia
| | - Shalima Nair
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia
| | - Wenjia Qu
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia
| | - Jenny Song
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia
| | - Katherine A Giles
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia
| | - Aaron L Statham
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia
| | - Helen Speirs
- Ramaciotti Centre for Genomics, University of New South Wales, Randwick, NSW, Australia
| | - Terence P Speed
- Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia.,Department of Mathematics & Statistics, University of Melbourne, Melbourne, VIC, Australia
| | - Susan J Clark
- Epigenetics Laboratory, Genomics and Epigenetics Division, Garvan Institute of Medical Research, Darlinghurst, NSW, Australia.,St Vincent's Clinical School, Faculty of Medicine, UNSW, Darlinghurst, NSW, Australia
| |
Collapse
|
35
|
Lim SB, Tan SJ, Lim WT, Lim CT. Compendiums of cancer transcriptomes for machine learning applications. Sci Data 2019; 6:194. [PMID: 31594947 PMCID: PMC6783425 DOI: 10.1038/s41597-019-0207-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Accepted: 07/25/2019] [Indexed: 12/18/2022] Open
Abstract
There are massive transcriptome profiles in the form of microarray. The challenge is that they are processed using diverse platforms and preprocessing tools, requiring considerable time and informatics expertise for cross-dataset analyses. If there exists a single, integrated data source, data-reuse can be facilitated for discovery, analysis, and validation of biomarker-based clinical strategy. Here, we present merged microarray-acquired datasets (MMDs) across 11 major cancer types, curating 8,386 patient-derived tumor and tumor-free samples from 95 GEO datasets. Using machine learning algorithms, we show that diagnostic models trained from MMDs can be directly applied to RNA-seq-acquired TCGA data with high classification accuracy. Machine learning optimized MMD further aids to reveal immune landscape across various carcinomas critically needed in disease management and clinical interventions. This unified data source may serve as an excellent training or test set to apply, develop, and refine machine learning algorithms that can be tapped to better define genomic landscape of human cancers.
Collapse
Affiliation(s)
- Su Bin Lim
- NUS Graduate School for Integrative Sciences & Engineering, National University of Singapore, Singapore, Singapore
- Department of Biomedical Engineering, National University of Singapore, Singapore, Singapore
| | - Swee Jin Tan
- Regional Scientific Affairs, Sysmex Asia Pacific, Singapore, Singapore
| | - Wan-Teck Lim
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore, Singapore
- Office of Academic and Clinical Development, Duke-NUS Medical School, Singapore, Singapore
- IMCB NCC MPI Singapore Oncogenome Laboratory, Institute of Molecular and Cell Biology (IMCB), A*STAR, Singapore, Singapore
| | - Chwee Teck Lim
- NUS Graduate School for Integrative Sciences & Engineering, National University of Singapore, Singapore, Singapore.
- Department of Biomedical Engineering, National University of Singapore, Singapore, Singapore.
- Mechanobiology Institute, National University of Singapore, Singapore, Singapore.
- Institute for Health Innovation and Technology (iHealthtech), National University of Singapore, Singapore, Singapore.
| |
Collapse
|
36
|
Akter S, Xu D, Nagel SC, Bromfield JJ, Pelch K, Wilshire GB, Joshi T. Machine Learning Classifiers for Endometriosis Using Transcriptomics and Methylomics Data. Front Genet 2019; 10:766. [PMID: 31552087 PMCID: PMC6737999 DOI: 10.3389/fgene.2019.00766] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2019] [Accepted: 07/19/2019] [Indexed: 12/29/2022] Open
Abstract
Endometriosis is a complex and common gynecological disorder yet a poorly understood disease affecting about 176 million women worldwide and causing significant impact on their quality of life and economic burden. Neither a definitive clinical symptom nor a minimally invasive diagnostic method is available, thus leading to an average of 4 to 11 years of diagnostic latency. Discovery of relevant biological patterns from microarray expression or next generation sequencing (NGS) data has been advanced over the last several decades by applying various machine learning tools. We performed machine learning analysis using 38 RNA-seq and 80 enrichment-based DNA methylation (MBD-seq) datasets. We experimented how well various supervised machine learning methods such as decision tree, partial least squares discriminant analysis (PLSDA), support vector machine, and random forest perform in classifying endometriosis from the control samples trained on both transcriptomics and methylomics data. The assessment was done from two different perspectives for improving classification performances: a) implication of three different normalization techniques and b) implication of differential analysis using the generalized linear model (GLM). Several candidate biomarker genes were identified by multiple machine learning experiments including NOTCH3, SNAPC2, B4GALNT1, SMAP2, DDB2, GTF3C5, and PTOV1 from the transcriptomics data analysis and TRPM6, RASSF2, TNIP2, RP3-522J7.6, FGD3, and MFSD14B from the methylomics data analysis. We concluded that an appropriate machine learning diagnostic pipeline for endometriosis should use TMM normalization for transcriptomics data, and quantile or voom normalization for methylomics data, GLM for feature space reduction and classification performance maximization.
Collapse
Affiliation(s)
- Sadia Akter
- Informatics Institute, University of Missouri, Columbia, MO, United States
| | - Dong Xu
- Informatics Institute, University of Missouri, Columbia, MO, United States
- Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
| | - Susan C. Nagel
- OB/GYN and Women’s Health, University of Missouri School of Medicine, Columbia, MO, United States
| | - John J. Bromfield
- OB/GYN and Women’s Health, University of Missouri School of Medicine, Columbia, MO, United States
| | - Katherine Pelch
- OB/GYN and Women’s Health, University of Missouri School of Medicine, Columbia, MO, United States
| | | | - Trupti Joshi
- Informatics Institute, University of Missouri, Columbia, MO, United States
- Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, United States
- Health Management and Informatics, University of Missouri, Columbia, MO, United States
| |
Collapse
|
37
|
Franks JM, Cai G, Whitfield ML. Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data. Bioinformatics 2019; 34:1868-1874. [PMID: 29360996 DOI: 10.1093/bioinformatics/bty026] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2017] [Accepted: 01/16/2018] [Indexed: 12/22/2022] Open
Abstract
Motivation Molecular subtypes of cancers and autoimmune disease, defined by transcriptomic profiling, have provided insight into disease pathogenesis, molecular heterogeneity and therapeutic responses. However, technical biases inherent to different gene expression profiling platforms present a unique problem when analyzing data generated from different studies. Currently, there is a lack of effective methods designed to eliminate platform-based bias. We present a method to normalize and classify RNA-seq data using machine learning classifiers trained on DNA microarray data and molecular subtypes in two datasets: breast invasive carcinoma (BRCA) and colorectal cancer (CRC). Results Multiple analyses show that feature specific quantile normalization (FSQN) successfully removes platform-based bias from RNA-seq data, regardless of feature scaling or machine learning algorithm. We achieve up to 98% accuracy for BRCA data and 97% accuracy for CRC data in assigning molecular subtypes to RNA-seq data normalized using FSQN and a support vector machine trained exclusively on DNA microarray data. We find that maximum accuracy was achieved when normalizing RNA-seq datasets that contain at least 25 samples. FSQN allows comparison of RNA-seq data to existing DNA microarray datasets. Using these techniques, we can successfully leverage information from existing gene expression data in new analyses despite different platforms used for gene expression profiling. Availability and implementation FSQN has been submitted as an R package to CRAN. All code used for this study is available on Github (https://github.com/jenniferfranks/FSQN). Contact michael.l.whitfield@dartmouth.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Guoshuai Cai
- Department of Environmental Health Sciences, Arnold School of Public Health, University of South Carolina, Columbia, SC, 29208, USA
| | - Michael L Whitfield
- Department of Molecular and Systems Biology.,Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Lebanon, NH, 03756, USA
| |
Collapse
|
38
|
Taroni JN, Grayson PC, Hu Q, Eddy S, Kretzler M, Merkel PA, Greene CS. MultiPLIER: A Transfer Learning Framework for Transcriptomics Reveals Systemic Features of Rare Disease. Cell Syst 2019; 8:380-394.e4. [PMID: 31121115 PMCID: PMC6538307 DOI: 10.1016/j.cels.2019.04.003] [Citation(s) in RCA: 62] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2018] [Revised: 01/15/2019] [Accepted: 04/12/2019] [Indexed: 12/22/2022]
Abstract
Most gene expression datasets generated by individual researchers are too small to fully benefit from unsupervised machine-learning methods. In the case of rare diseases, there may be too few cases available, even when multiple studies are combined. To address this challenge, we utilize transfer learning to extract coordinated expression patterns and use learned patterns to analyze small rare disease datasets. We trained a pathway-level information extractor (PLIER) model on a large public data compendium comprising multiple experiments, tissues, and biological conditions and then transferred the model to small datasets in an approach we call MultiPLIER. Models constructed from the public data compendium included features that aligned well to known biological factors and were more comprehensive than those constructed from individual datasets or conditions. When transferred to rare disease datasets, the models describe biological processes related to disease severity more effectively than models trained only on a given dataset.
Collapse
Affiliation(s)
- Jaclyn N Taroni
- Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA; Childhood Cancer Data Laboratory, Alex's Lemonade Stand Foundation, Philadelphia, PA, USA
| | - Peter C Grayson
- National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, Bethesda, MD, USA
| | - Qiwen Hu
- Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Sean Eddy
- Division of Nephrology, Department of Internal Medicine, Michigan Medicine, Ann Arbor, MI, USA
| | - Matthias Kretzler
- Division of Nephrology, Department of Internal Medicine, Michigan Medicine, Ann Arbor, MI, USA; Department of Computational Medicine and Bioinformatics, Michigan Medicine, Ann Arbor, MI, USA
| | - Peter A Merkel
- Division of Rheumatology and the Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Casey S Greene
- Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA; Childhood Cancer Data Laboratory, Alex's Lemonade Stand Foundation, Philadelphia, PA, USA; Institute of Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA; Institute of Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
39
|
Computational methods for Gene Regulatory Networks reconstruction and analysis: A review. Artif Intell Med 2019; 95:133-145. [DOI: 10.1016/j.artmed.2018.10.006] [Citation(s) in RCA: 71] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2018] [Revised: 10/23/2018] [Accepted: 10/23/2018] [Indexed: 01/14/2023]
|
40
|
Bobak CA, Titus AJ, Hill JE. Comparison of common machine learning models for classification of tuberculosis using transcriptional biomarkers from integrated datasets. Appl Soft Comput 2019. [DOI: 10.1016/j.asoc.2018.10.005] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
41
|
Chen C, Meng Q, Xia Y, Ding C, Wang L, Dai R, Cheng L, Gunaratne P, Gibbs RA, Min S, Coarfa C, Reid JG, Zhang C, Jiao C, Jiang Y, Giase G, Thomas A, Fitzgerald D, Brunetti T, Shieh A, Xia C, Wang Y, Wang Y, Badner JA, Gershon ES, White KP, Liu C. The transcription factor POU3F2 regulates a gene coexpression network in brain tissue from patients with psychiatric disorders. Sci Transl Med 2018; 10:eaat8178. [PMID: 30545964 PMCID: PMC6494100 DOI: 10.1126/scitranslmed.aat8178] [Citation(s) in RCA: 66] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2018] [Revised: 07/26/2018] [Accepted: 11/07/2018] [Indexed: 12/22/2022]
Abstract
Schizophrenia and bipolar disorder are complex psychiatric diseases with risks contributed by multiple genes. Dysregulation of gene expression has been implicated in these disorders, but little is known about such dysregulation in the human brain. We analyzed three transcriptome datasets from 394 postmortem brain tissue samples from patients with schizophrenia or bipolar disorder or from healthy control individuals without a known history of psychiatric disease. We built genome-wide coexpression networks that included microRNAs (miRNAs). We identified a coexpression network module that was differentially expressed in the brain tissue from patients compared to healthy control individuals. This module contained genes that were principally involved in glial and neural cell genesis and glial cell differentiation, and included schizophrenia risk genes carrying rare variants. This module included five miRNAs and 545 mRNAs, with six transcription factors serving as hub genes in this module. We found that the most connected transcription factor gene POU3F2, also identified on a genome-wide association study for bipolar disorder, could regulate the miRNA hsa-miR-320e and other putative target mRNAs. These regulatory relationships were replicated using PsychENCODE/BrainGVEX datasets and validated by knockdown and overexpression experiments in SH-SY5Y cells and human neural progenitor cells in vitro. Thus, we identified a brain gene expression module that was enriched for rare coding variants in genes associated with schizophrenia and that contained the putative bipolar disorder risk gene POU3F2 The transcription factor POU3F2 may be a key regulator of gene expression in this disease-associated gene coexpression module.
Collapse
Affiliation(s)
- Chao Chen
- Center for Medical Genetics, School of Life Sciences, Central South University, Changsha, China.
- National Clinical Research Center for Geriatric Disorders, Xiangya Hospital, Central South University, Changsha, China
| | - Qingtuan Meng
- Center for Medical Genetics, School of Life Sciences, Central South University, Changsha, China
| | - Yan Xia
- Center for Medical Genetics, School of Life Sciences, Central South University, Changsha, China
- Department of Psychiatry, SUNY Upstate Medical University, Syracuse, NY, USA
| | - Chaodong Ding
- Center for Medical Genetics, School of Life Sciences, Central South University, Changsha, China
- Department of Psychiatry, SUNY Upstate Medical University, Syracuse, NY, USA
| | - Le Wang
- Center for Medical Genetics, School of Life Sciences, Central South University, Changsha, China
- Child Health Institute of New Jersey, Department of Neuroscience, Rutgers Robert Wood Johnson Medical School, New Brunswick, NJ, USA
| | - Rujia Dai
- Center for Medical Genetics, School of Life Sciences, Central South University, Changsha, China
- Department of Psychiatry, SUNY Upstate Medical University, Syracuse, NY, USA
| | - Lijun Cheng
- Institute for Genomics and Systems Biology, University of Chicago, Chicago, IL, USA
| | - Preethi Gunaratne
- Department of Biology and Biochemistry, University of Houston, Houston, TX, USA
| | - Richard A Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Shishi Min
- Center for Medical Genetics, School of Life Sciences, Central South University, Changsha, China
| | - Cristian Coarfa
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
| | - Jeffrey G Reid
- Regeneron Genetics Center, Regeneron Pharmaceuticals, Tarrytown, NY, USA
| | - Chunling Zhang
- Department of Neuroscience and Physiology, SUNY Upstate Medical University, Syracuse, NY, USA
| | - Chuan Jiao
- Department of Psychiatry, SUNY Upstate Medical University, Syracuse, NY, USA
| | - Yi Jiang
- Center for Medical Genetics, School of Life Sciences, Central South University, Changsha, China
- Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville, TN, USA
| | - Gina Giase
- School of Public Health, University of Illinois at Chicago, Chicago, IL, USA
| | - Amber Thomas
- Institute for Genomics and Systems Biology, University of Chicago, Chicago, IL, USA
| | - Dominic Fitzgerald
- Institute for Genomics and Systems Biology, University of Chicago, Chicago, IL, USA
| | - Tonya Brunetti
- Institute for Genomics and Systems Biology, University of Chicago, Chicago, IL, USA
- Colorado Center for Personalized Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Annie Shieh
- Department of Psychiatry, SUNY Upstate Medical University, Syracuse, NY, USA
| | - Cuihua Xia
- Center for Medical Genetics, School of Life Sciences, Central South University, Changsha, China
| | - Yongjun Wang
- The Second Xiangya Hospital, Central South University, Changsha, China
| | - Yunpeng Wang
- Norwegian Centre for Mental Disorders Research, Institute of Clinical Medicine, University of Oslo, Oslo, Norway
- LifeSpan Changes in Brain and Cognition (LCBC), Department of Psychology, University of Oslo, Oslo, Norway
| | - Judith A Badner
- Department of Psychiatry, Rush University Medical Center, Chicago, IL, USA
| | - Elliot S Gershon
- Department of Psychiatry and Behavioral Neuroscience, University of Chicago, Chicago, IL, USA
| | - Kevin P White
- Institute for Genomics and Systems Biology, University of Chicago, Chicago, IL, USA
- Tempus Labs Inc., Chicago, IL, USA
| | - Chunyu Liu
- Center for Medical Genetics, School of Life Sciences, Central South University, Changsha, China.
- Department of Psychiatry, SUNY Upstate Medical University, Syracuse, NY, USA
- Department of Psychology, Shaanxi Normal University, Xi'an, China
| |
Collapse
|
42
|
Pedersen CB, Nielsen FC, Rossing M, Olsen LR. Using microarray-based subtyping methods for breast cancer in the era of high-throughput RNA sequencing. Mol Oncol 2018; 12:2136-2146. [PMID: 30289602 PMCID: PMC6275246 DOI: 10.1002/1878-0261.12389] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2018] [Revised: 09/19/2018] [Accepted: 09/25/2018] [Indexed: 11/30/2022] Open
Abstract
Breast cancer is a highly heterogeneous disease that can be classified into multiple subtypes based on the tumor transcriptome. Most of the subtyping schemes used in clinics today are derived from analyses of microarray data from thousands of different tumors together with clinical data for the patients from which the tumors were isolated. However, RNA sequencing (RNA‐Seq) is gradually replacing microarrays as the preferred transcriptomics platform, and although transcript abundances measured by the two different technologies are largely compatible, subtyping methods developed for probe‐based microarray data are incompatible with RNA‐Seq as input data. Here, we present an RNA‐Seq data processing pipeline, which relies on the mapping of sequencing reads to the probe set target sequences instead of the human reference genome, thereby enabling probe‐based subtyping of breast cancer tumor tissue using sequencing‐based transcriptomics. By analyzing 66 breast cancer tumors for which gene expression was measured using both microarrays and RNA‐Seq, we show that RNA‐Seq data can be directly compared to microarray data using our pipeline. Additionally, we demonstrate that the established subtyping method CITBCMST (Guedj et al., 2012), which relies on a 375 probe set‐signature to classify samples into the six subtypes basL, lumA, lumB, lumC, mApo, and normL, can be applied without further modifications. This pipeline enables a seamless transition to sequencing‐based transcriptomics for future clinical purposes.
Collapse
Affiliation(s)
- Christina Bligaard Pedersen
- Department of Bio and Health Informatics, Technical University of Denmark, Kemitorvet, Kongens Lyngby, Denmark.,Center for Genomic Medicine, Rigshospitalet - Copenhagen University Hospital, Denmark
| | - Finn Cilius Nielsen
- Center for Genomic Medicine, Rigshospitalet - Copenhagen University Hospital, Denmark
| | - Maria Rossing
- Center for Genomic Medicine, Rigshospitalet - Copenhagen University Hospital, Denmark
| | - Lars Rønn Olsen
- Department of Bio and Health Informatics, Technical University of Denmark, Kemitorvet, Kongens Lyngby, Denmark.,Center for Genomic Medicine, Rigshospitalet - Copenhagen University Hospital, Denmark
| |
Collapse
|
43
|
Johnson NT, Dhroso A, Hughes KJ, Korkin D. Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers? RNA (NEW YORK, N.Y.) 2018; 24:1119-1132. [PMID: 29941426 PMCID: PMC6097660 DOI: 10.1261/rna.062802.117] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Accepted: 06/03/2018] [Indexed: 05/09/2023]
Abstract
RNA sequencing (RNA-seq) is becoming a prevalent approach to quantify gene expression and is expected to gain better insights into a number of biological and biomedical questions compared to DNA microarrays. Most importantly, RNA-seq allows us to quantify expression at the gene or transcript levels. However, leveraging the RNA-seq data requires development of new data mining and analytics methods. Supervised learning methods are commonly used approaches for biological data analysis that have recently gained attention for their applications to RNA-seq data. Here, we assess the utility of supervised learning methods trained on RNA-seq data for a diverse range of biological classification tasks. We hypothesize that the transcript-level expression data are more informative for biological classification tasks than the gene-level expression data. Our large-scale assessment utilizes multiple data sets, organisms, lab groups, and RNA-seq analysis pipelines. Overall, we performed and assessed 61 biological classification problems that leverage three independent RNA-seq data sets and include over 2000 samples that come from multiple organisms, lab groups, and RNA-seq analyses. These 61 problems include predictions of the tissue type, sex, or age of the sample, healthy or cancerous phenotypes, and pathological tumor stages for the samples from the cancerous tissue. For each problem, the performance of three normalization techniques and six machine learning classifiers was explored. We find that for every single classification problem, the transcript-based classifiers outperform or are comparable with gene expression-based methods. The top-performing techniques reached a near perfect classification accuracy, demonstrating the utility of supervised learning for RNA-seq based data analysis.
Collapse
Affiliation(s)
- Nathan T Johnson
- Worcester Polytechnic Institute, Bioinformatics and Computational Biology Program, Worcester, Massachusetts 01609, USA
| | - Andi Dhroso
- Worcester Polytechnic Institute, Bioinformatics and Computational Biology Program, Worcester, Massachusetts 01609, USA
| | - Katelyn J Hughes
- Worcester Polytechnic Institute, Bioinformatics and Computational Biology Program, Worcester, Massachusetts 01609, USA
| | - Dmitry Korkin
- Worcester Polytechnic Institute, Bioinformatics and Computational Biology Program, Worcester, Massachusetts 01609, USA
- Worcester Polytechnic Institute, Department of Computer Science, Worcester, Massachusetts 01609, USA
| |
Collapse
|
44
|
Xiang R, Hayes BJ, Vander Jagt CJ, MacLeod IM, Khansefid M, Bowman PJ, Yuan Z, Prowse-Wilkins CP, Reich CM, Mason BA, Garner JB, Marett LC, Chen Y, Bolormaa S, Daetwyler HD, Chamberlain AJ, Goddard ME. Genome variants associated with RNA splicing variations in bovine are extensively shared between tissues. BMC Genomics 2018; 19:521. [PMID: 29973141 PMCID: PMC6032541 DOI: 10.1186/s12864-018-4902-8] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2017] [Accepted: 06/27/2018] [Indexed: 12/12/2022] Open
Abstract
Background Mammalian phenotypes are shaped by numerous genome variants, many of which may regulate gene transcription or RNA splicing. To identify variants with regulatory functions in cattle, an important economic and model species, we used sequence variants to map a type of expression quantitative trait loci (expression QTLs) that are associated with variations in the RNA splicing, i.e., sQTLs. To further the understanding of regulatory variants, sQTLs were compare with other two types of expression QTLs, 1) variants associated with variations in gene expression, i.e., geQTLs and 2) variants associated with variations in exon expression, i.e., eeQTLs, in different tissues. Results Using whole genome and RNA sequence data from four tissues of over 200 cattle, sQTLs identified using exon inclusion ratios were verified by matching their effects on adjacent intron excision ratios. sQTLs contained the highest percentage of variants that are within the intronic region of genes and contained the lowest percentage of variants that are within intergenic regions, compared to eeQTLs and geQTLs. Many geQTLs and sQTLs are also detected as eeQTLs. Many expression QTLs, including sQTLs, were significant in all four tissues and had a similar effect in each tissue. To verify such expression QTL sharing between tissues, variants surrounding (±1 Mb) the exon or gene were used to build local genomic relationship matrices (LGRM) and estimated genetic correlations between tissues. For many exons, the splicing and expression level was determined by the same cis additive genetic variance in different tissues. Thus, an effective but simple-to-implement meta-analysis combining information from three tissues is introduced to increase power to detect and validate sQTLs. sQTLs and eeQTLs together were more enriched for variants associated with cattle complex traits, compared to geQTLs. Several putative causal mutations were identified, including an sQTL at Chr6:87392580 within the 5th exon of kappa casein (CSN3) associated with milk production traits. Conclusions Using novel analytical approaches, we report the first identification of numerous bovine sQTLs which are extensively shared between multiple tissue types. The significant overlaps between bovine sQTLs and complex traits QTL highlight the contribution of regulatory mutations to phenotypic variations. Electronic supplementary material The online version of this article (10.1186/s12864-018-4902-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ruidong Xiang
- Faculty of Veterinary & Agricultural Science, University of Melbourne, Parkville, VIC, 3010, Australia. .,Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, 3083, Australia.
| | - Ben J Hayes
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, 3083, Australia.,Queensland Alliance for Agriculture and Food Innovation, Centre for Animal Science, University of Queensland, St. Lucia, QLD, 4067, Australia
| | - Christy J Vander Jagt
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
| | - Iona M MacLeod
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
| | - Majid Khansefid
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
| | - Phil J Bowman
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, 3083, Australia.,School of Applied Systems Biology, La Trobe University, Bundoora, VIC, 3083, Australia
| | - Zehu Yuan
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
| | | | - Coralie M Reich
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
| | - Brett A Mason
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
| | - Josie B Garner
- Agriculture Victoria, Dairy Production Science, Ellinbank, VIC, 3821, Australia
| | - Leah C Marett
- Agriculture Victoria, Dairy Production Science, Ellinbank, VIC, 3821, Australia
| | - Yizhou Chen
- Elizabeth Macarthur Agricultural Institute, New South Wales Department of Primary Industries, Camden, NSW, 2570, Australia
| | - Sunduimijid Bolormaa
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
| | - Hans D Daetwyler
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, 3083, Australia.,School of Applied Systems Biology, La Trobe University, Bundoora, VIC, 3083, Australia
| | - Amanda J Chamberlain
- Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
| | - Michael E Goddard
- Faculty of Veterinary & Agricultural Science, University of Melbourne, Parkville, VIC, 3010, Australia.,Agriculture Victoria, AgriBio, Centre for AgriBioscience, Bundoora, VIC, 3083, Australia
| |
Collapse
|
45
|
Thompson JA, Christensen BC, Marsit CJ. Methylation-to-Expression Feature Models of Breast Cancer Accurately Predict Overall Survival, Distant-Recurrence Free Survival, and Pathologic Complete Response in Multiple Cohorts. Sci Rep 2018; 8:5190. [PMID: 29581450 PMCID: PMC5979962 DOI: 10.1038/s41598-018-23494-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2017] [Accepted: 03/13/2018] [Indexed: 12/03/2022] Open
Abstract
Prognostic biomarkers serve a variety of purposes in cancer treatment and research, such as prediction of cancer progression, and treatment eligibility. Despite growing interest in multi-omic data integration for defining prognostic biomarkers, validated methods have been slow to emerge. Given that breast cancer has been the focus of intense research, it is amenable to studying the benefits of multi-omic prognostic models due to the availability of datasets. Thus, we examined the efficacy of our methylation-to-expression feature model (M2EFM) approach to combining molecular and clinical predictors to create risk scores for overall survival, distant metastasis, and chemosensitivity in breast cancer. Gene expression, DNA methylation, and clinical variables were integrated via M2EFM to build models of overall survival using 1028 breast tumor samples and applied to validation cohorts of 61 and 327 samples. Models of distant recurrence-free survival and pathologic complete response were built using 306 samples and validated on 182 samples. Despite different populations and assays, M2EFM models validated with good accuracy (C-index or AUC ≥ 0.7) for all outcomes and had the most consistent performance compared to other methods. Finally, we demonstrated that M2EFM identifies functionally relevant genes, which could be useful in translating an M2EFM biomarker to the clinic.
Collapse
Affiliation(s)
- Jeffrey A Thompson
- Department of Biostatistics, University of Kansas Medical Center, Kansas City, USA.
| | - Brock C Christensen
- Department of Epidemiology, Geisel School of Medicine at Dartmouth College, Hanover, USA
| | - Carmen J Marsit
- Department of Environmental Health, Rollins School of Public Health at Emory University, Atlanta, USA
| |
Collapse
|
46
|
Song Y, Yan Z. Exploring of the molecular mechanism of rhinitis via bioinformatics methods. Mol Med Rep 2017; 17:3014-3020. [PMID: 29257233 PMCID: PMC5783521 DOI: 10.3892/mmr.2017.8213] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2017] [Accepted: 10/06/2017] [Indexed: 12/27/2022] Open
Abstract
The aim of this study was to analyze gene expression profiles for exploring the function and regulatory network of differentially expressed genes (DEGs) in pathogenesis of rhinitis by a bioinformatics method. The gene expression profile of GSE43523 was downloaded from the Gene Expression Omnibus database. The dataset contained 7 seasonal allergic rhinitis samples and 5 non-allergic normal samples. DEGs between rhinitis samples and normal samples were identified via the limma package of R. The webGestal database was used to identify enriched Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways of the DEGs. The differentially co-expressed pairs of the DEGs were identified via the DCGL package in R, and the differential co-expression network was constructed based on these pairs. A protein-protein interaction (PPI) network of the DEGs was constructed based on the Search Tool for the Retrieval of Interacting Genes database. A total of 263 DEGs were identified in rhinitis samples compared with normal samples, including 125 downregulated ones and 138 upregulated ones. The DEGs were enriched in 7 KEGG pathways. 308 differential co-expression gene pairs were obtained. A differential co-expression network was constructed, containing 212 nodes. In total, 148 PPI pairs of the DEGs were identified, and a PPI network was constructed based on these pairs. Bioinformatics methods could help us identify significant genes and pathways related to the pathogenesis of rhinitis. Steroid biosynthesis pathway and metabolic pathways might play important roles in the development of allergic rhinitis (AR). Genes such as CDC42 effector protein 5, solute carrier family 39 member A11 and PR/SET domain 10 might be also associated with the pathogenesis of AR, which provided references for the molecular mechanisms of AR.
Collapse
Affiliation(s)
- Yufen Song
- Department of Otolaryngology, The Third Central Hospital of Tianjin, Tianjin 300170, P.R. China
| | - Zhaohui Yan
- Department of Otolaryngology, The Third Central Hospital of Tianjin, Tianjin 300170, P.R. China
| |
Collapse
|
47
|
Zhang W, Wang J, Menon S. Advancing cancer drug development through precision medicine and innovative designs. J Biopharm Stat 2017; 28:229-244. [PMID: 29173004 DOI: 10.1080/10543406.2017.1402784] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
Precision medicine has been a hot topic in drug development over the last decade. Biomarkers have been proven useful for understanding the disease progression and treatment response in precision medicine development. Advancement of high-throughput omics technologies has enabled fast identification of molecular biomarkers with low cost. Although biomarkers have brought many promises to drug development, steep challenges arise due to a large amount of data, complexity of technology, and lack of full understanding of biology. In this article, we discuss the technologies and statistical issues that are related to omics biomarker discovery. We also provide an overview of the current development of biomarker-enabled cancer clinical trial designs.
Collapse
Affiliation(s)
- Weidong Zhang
- a Global Product Development , Pfizer Inc , Cambridge , MA , USA
| | - Jing Wang
- a Global Product Development , Pfizer Inc , Cambridge , MA , USA
| | - Sandeep Menon
- b World Research and Development , Pfizer Inc ., Cambridge , MA , USA
| |
Collapse
|
48
|
Tan J, Huyck M, Hu D, Zelaya RA, Hogan DA, Greene CS. ADAGE signature analysis: differential expression analysis with data-defined gene sets. BMC Bioinformatics 2017; 18:512. [PMID: 29166858 PMCID: PMC5700673 DOI: 10.1186/s12859-017-1905-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2017] [Accepted: 11/01/2017] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Gene set enrichment analysis and overrepresentation analyses are commonly used methods to determine the biological processes affected by a differential expression experiment. This approach requires biologically relevant gene sets, which are currently curated manually, limiting their availability and accuracy in many organisms without extensively curated resources. New feature learning approaches can now be paired with existing data collections to directly extract functional gene sets from big data. RESULTS Here we introduce a method to identify perturbed processes. In contrast with methods that use curated gene sets, this approach uses signatures extracted from public expression data. We first extract expression signatures from public data using ADAGE, a neural network-based feature extraction approach. We next identify signatures that are differentially active under a given treatment. Our results demonstrate that these signatures represent biological processes that are perturbed by the experiment. Because these signatures are directly learned from data without supervision, they can identify uncurated or novel biological processes. We implemented ADAGE signature analysis for the bacterial pathogen Pseudomonas aeruginosa. For the convenience of different user groups, we implemented both an R package (ADAGEpath) and a web server ( http://adage.greenelab.com ) to run these analyses. Both are open-source to allow easy expansion to other organisms or signature generation methods. We applied ADAGE signature analysis to an example dataset in which wild-type and ∆anr mutant cells were grown as biofilms on the Cystic Fibrosis genotype bronchial epithelial cells. We mapped active signatures in the dataset to KEGG pathways and compared with pathways identified using GSEA. The two approaches generally return consistent results; however, ADAGE signature analysis also identified a signature that revealed the molecularly supported link between the MexT regulon and Anr. CONCLUSIONS We designed ADAGE signature analysis to perform gene set analysis using data-defined functional gene signatures. This approach addresses an important gap for biologists studying non-traditional model organisms and those without extensive curated resources available. We built both an R package and web server to provide ADAGE signature analysis to the community.
Collapse
Affiliation(s)
- Jie Tan
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, Hanover, NH, 03755, USA
| | - Matthew Huyck
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, 19104, USA.,Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, 03755, USA
| | - Dongbo Hu
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - René A Zelaya
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Deborah A Hogan
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, 03755, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, 19104, USA.
| |
Collapse
|
49
|
Tan J, Doing G, Lewis KA, Price CE, Chen KM, Cady KC, Perchuk B, Laub MT, Hogan DA, Greene CS. Unsupervised Extraction of Stable Expression Signatures from Public Compendia with an Ensemble of Neural Networks. Cell Syst 2017; 5:63-71.e6. [PMID: 28711280 PMCID: PMC5532071 DOI: 10.1016/j.cels.2017.06.003] [Citation(s) in RCA: 55] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2016] [Revised: 04/11/2017] [Accepted: 06/08/2017] [Indexed: 01/18/2023]
Abstract
Cross-experiment comparisons in public data compendia are challenged by unmatched conditions and technical noise. The ADAGE method, which performs unsupervised integration with denoising autoencoder neural networks, can identify biological patterns, but because ADAGE models, like many neural networks, are over-parameterized, different ADAGE models perform equally well. To enhance model robustness and better build signatures consistent with biological pathways, we developed an ensemble ADAGE (eADAGE) that integrated stable signatures across models. We applied eADAGE to a compendium of Pseudomonas aeruginosa gene expression profiling experiments performed in 78 media. eADAGE revealed a phosphate starvation response controlled by PhoB in media with moderate phosphate and predicted that a second stimulus provided by the sensor kinase, KinB, is required for this PhoB activation. We validated this relationship using both targeted and unbiased genetic approaches. eADAGE, which captures stable biological patterns, enables cross-experiment comparisons that can highlight measured but undiscovered relationships.
Collapse
Affiliation(s)
- Jie Tan
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Georgia Doing
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Kimberley A Lewis
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Courtney E Price
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Kathleen M Chen
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Kyle C Cady
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA; Howard Hughes Medical Institute, Cambridge, MA, USA
| | - Barret Perchuk
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA; Howard Hughes Medical Institute, Cambridge, MA, USA
| | - Michael T Laub
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA; Howard Hughes Medical Institute, Cambridge, MA, USA
| | - Deborah A Hogan
- Department of Microbiology and Immunology, Geisel School of Medicine at Dartmouth, Hanover, NH, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
50
|
Way GP, Allaway RJ, Bouley SJ, Fadul CE, Sanchez Y, Greene CS. A machine learning classifier trained on cancer transcriptomes detects NF1 inactivation signal in glioblastoma. BMC Genomics 2017; 18:127. [PMID: 28166733 PMCID: PMC5292791 DOI: 10.1186/s12864-017-3519-7] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2016] [Accepted: 01/26/2017] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND We have identified molecules that exhibit synthetic lethality in cells with loss of the neurofibromin 1 (NF1) tumor suppressor gene. However, recognizing tumors that have inactivation of the NF1 tumor suppressor function is challenging because the loss may occur via mechanisms that do not involve mutation of the genomic locus. Degradation of the NF1 protein, independent of NF1 mutation status, phenocopies inactivating mutations to drive tumors in human glioma cell lines. NF1 inactivation may alter the transcriptional landscape of a tumor and allow a machine learning classifier to detect which tumors will benefit from synthetic lethal molecules. RESULTS We developed a strategy to predict tumors with low NF1 activity and hence tumors that may respond to treatments that target cells lacking NF1. Using RNAseq data from The Cancer Genome Atlas (TCGA), we trained an ensemble of 500 logistic regression classifiers that integrates mutation status with whole transcriptomes to predict NF1 inactivation in glioblastoma (GBM). On TCGA data, the classifier detected NF1 mutated tumors (test set area under the receiver operating characteristic curve (AUROC) mean = 0.77, 95% quantile = 0.53 - 0.95) over 50 random initializations. On RNA-Seq data transformed into the space of gene expression microarrays, this method produced a classifier with similar performance (test set AUROC mean = 0.77, 95% quantile = 0.53 - 0.96). We applied our ensemble classifier trained on the transformed TCGA data to a microarray validation set of 12 samples with matched RNA and NF1 protein-level measurements. The classifier's NF1 score was associated with NF1 protein concentration in these samples. CONCLUSIONS We demonstrate that TCGA can be used to train accurate predictors of NF1 inactivation in GBM. The ensemble classifier performed well for samples with very high or very low NF1 protein concentrations but had mixed performance in samples with intermediate NF1 concentrations. Nevertheless, high-performing and validated predictors have the potential to be paired with targeted therapies and personalized medicine.
Collapse
Affiliation(s)
- Gregory P. Way
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, PA USA
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA 19104 USA
| | - Robert J. Allaway
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, Dartmouth College, HB 7650, Hanover, NH 03755 USA
| | - Stephanie J. Bouley
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, Dartmouth College, HB 7650, Hanover, NH 03755 USA
| | - Camilo E. Fadul
- Department of Neurology, University of Virginia, Charlottesville, VA USA
| | - Yolanda Sanchez
- Department of Molecular and Systems Biology, Geisel School of Medicine at Dartmouth, Dartmouth College, HB 7650, Hanover, NH 03755 USA
- Norris Cotton Cancer Center, Dartmouth-Hitchcock Medical Center, Lebanon, NH USA
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA 19104 USA
| |
Collapse
|